More performance from virtualization?

I would be highly interested in opinions in the below idea. Big thanks to everyone who already shared their feedback with me on it. I’ve decided to write about the idea here in more detail to collect more feedback and hopefully learn more about the details of spreading workload across virtual nodes.

Back in 2005, Intel released the first common x86-based CPU with a virtualization instruction set called VT-x, which soon, through wide adoption became the de facto standard for desktop and server virtualization. While it is not a challenge to imagine how this works, in reality, this is simply a set of instructions which offer control (launch, resume, stop) isolated contexts/domains on a single CPU. With the later addition of EPT and VT-d, virtual machines can also get shared memory between the host and guests and gain direct access to PCI through IOMMU.

While there are many use cases for virtualization, such as isolation, simulation and resource use optimization, in this case I would like to focus purely on performance gained from direct hardware access.

Running a program on a host machine will, in big abbreviation, allocate local memory as required and perform operations invoking the CPU as well as I/O as required for moving data between memory and the CPU, optionally use persistent storage such as hard drives. In order for the program to operate, it will have to wait for the software scheduler and IO scheduler to allow execution, wait for ACPI IRQ if required, and hit the CPU cycle with the operations. Subsequently, there can be optimistic or pessimistic scenarios in this operation, affecting the performance.

Various techniques exist to allow optimizing the load, to gain more performance, especially in set ups where sufficient resources are available. Many programs spawn threads, works or simultaneous jobs which offer better results (for many reasons) when compared with running in a single, monolithic process. Nevertheless all of these still go through the same path before hardware resources actually process low-level instructions – schedulers, cycles, etc.

And this leads to the idea: could there be a new (?) technique, allowing running multiple jobs that’d utilize VT extensions? I imagine it will not be applicable in all cases and should be left to the developer or maintainer of given program to decide. For example, I can’t imagine applying this to a workload with a lot of mutexes or other need of inter-process communication requirements. But take for example the make program – running multiple -j jobs in separate virtual contexts could give them performance gains from direct I/O and better use of CPU cycles perhaps?