How a Hypervisor Actually Shares CPU and I/O: A Practical Guide for KVM, LXC and Proxmox
vCPUs are just threads, cgroups don't replace the scheduler, and on NVMe the right I/O scheduler is usually 'none'. A ground-up tour of how Linux shares physical resources between virtual machines and containers — and which knobs are actually worth turning for small-to-medium workloads.
If you run virtual machines or containers, you have probably set “4 cores” on a guest without thinking hard about what that number does. It is worth thinking hard about, because the mental model most people carry — that cores are carved up and handed out like slices of a pie — is wrong in ways that matter the moment a host gets busy.
This article walks through how a Linux host running KVM/QEMU and LXC actually shares physical CPU and disk between guests: what the scheduler does, where cgroups fit, which knobs Proxmox and Incus expose, and how to know when any of it is going wrong. The bias throughout is toward small-to-medium workloads — the kind a freelancer or a small team runs — where the right answer is usually “measure first, tune the exceptions, leave the defaults alone.”
A vCPU is just a thread
Start here, because everything else follows from it.
When QEMU boots a guest with 4 virtual CPUs, it does not reserve 4 physical cores. It spawns 4 ordinary threads on the host, one per vCPU. From the host kernel’s point of view, those threads are indistinguishable from any other process. They sit in the same run queue as your shell, your monitoring agent, and every other guest’s vCPU threads.
That single fact explains overcommitment. Because vCPUs are just threads, you can assign more of them than you have physical cores. An 8-core / 16-thread box can happily run three guests with 4 vCPUs each — 12 vCPUs against 16 threads — as long as they do not all demand CPU at the same instant. When they do, the scheduler interleaves them, and each guest experiences the cycles it wanted but had to wait for. That waiting has a name, and we will come back to it.
When a vCPU thread is scheduled onto a physical core, the CPU enters guest mode through the hardware virtualisation extensions (VT-x on Intel, AMD-V/SVM on AMD). The guest runs natively until something forces an exit back to the host — a timer interrupt, an I/O request, or the scheduler deciding the thread’s slice is up. An idle guest issues a halt, its thread sleeps, and the host hands the core to someone else. Nothing is wasted spinning.
Which scheduler is even running?
“Linux has several schedulers” is true but conflates three separate things:
The CPU (task) scheduler decides which thread runs on which core. For years this was CFS, the Completely Fair Scheduler. Since kernel 6.6 (late 2023) the default is EEVDF — Earliest Eligible Virtual Deadline First — a drop-in replacement that is fairer to latency-sensitive tasks. The cgroup knobs are unchanged, and people still loosely say “CFS,” but on any recent kernel you are running EEVDF and there is nothing to configure.
Pluggable schedulers via sched_ext are the genuinely new development. Merged in kernel 6.12 (late 2024), sched_ext lets you load scheduling policies as BPF programs at runtime without recompiling the kernel. Because vCPUs are threads, these can affect virtualised workloads — but they target specific profiles (gaming, latency-tuned setups) and are still maturing. For a production host running client guests, a custom BPF scheduler adds a moving part with little upside at low-to-medium volume. Know it exists; reach for it only against a measured problem.
The I/O scheduler is completely separate and governs disk request ordering. More on it shortly.
The part that surprises people: none of Proxmox, Incus or libvirt has a concept of “pick a CPU scheduler.” That is not their job. They set cgroup parameters and hand vCPU threads to whatever the host kernel happens to be running. Scheduler choice is a host-kernel decision, transparent to the virtualisation layer above it.
cgroups don’t sit “above” the scheduler — they parameterise it
This is the most common conceptual slip, so it is worth being precise.
A cgroup is a grouping and accounting mechanism. It says “these processes belong together, and here are the resource rules for the group.” But a cgroup does no scheduling itself. The scheduler — EEVDF for CPU, the blk-mq scheduler for disk — is still the thing making every decision. It simply reads the cgroup’s settings as input.
So cgroups and the scheduler operate at the same level, the kernel’s resource-management core, working together: the cgroup supplies group-level policy, the scheduler does the actual dispatch. On the CPU side:
cpu.weight— proportional share under contention only. The scheduler still picks; the weight tilts how much each group gets. (Proxmoxcpuunits, Incuslimits.cpu.priority.)cpu.max— a hard quota cap. The group cannot exceed it even when cores sit idle. (Proxmoxcpulimit, Incuslimits.cpu.allowance.)
The I/O side mirrors this, with one asymmetry worth remembering:
io.max(hard throughput/IOPS caps) is enforced regardless of which scheduler is active — it is a throttle, not an arbitration decision.io.weight(proportional fairness) only does something if the scheduler actually arbitrates. Under thenonescheduler it does nothing, becausenonedoes no arbitration at all.
That asymmetry is practical, not academic. On fast NVMe you will almost certainly run the none scheduler — which means setting io.weight on your containers achieves precisely nothing. If you need hard isolation between noisy neighbours on that device, caps (io.max) are your dependable lever; weights are situational and depend on running a scheduler that honours them. The same logic holds for CPU: hard caps always bite, weights only matter when there is something to arbitrate.
The I/O scheduler, and the trap of stacking it
Since kernel 5.0 Linux uses the multi-queue block layer (blk-mq) exclusively. The old single-queue schedulers are gone. The current choices:
none— no reordering. The right answer for NVMe SSDs, whose controllers reorder far better than the kernel can. Software scheduling on top just adds latency and CPU overhead.mq-deadline— lightweight, deadline-based with a mild read bias. A solid default for SATA/SAS SSDs.kyber— simple, low-overhead, latency-targeting. Sits betweennoneand the heavier options.bfq— proportional fairness between processes. Excellent for spinning disks and interactive desktops; its per-request overhead can hurt throughput on fast NVMe.
The rule of thumb: NVMe → none; SATA/SAS SSD → mq-deadline (or none); HDD → bfq (or mq-deadline).
The part that bites in virtualisation is stacking. A guest VM’s virtual disk is not a real device — the host does the actual physical I/O. So the guest’s scheduler should be none: pass requests straight down and let the host do the real scheduling against the hardware. Running bfq inside a guest on top of bfq on the host means double scheduling, wasted CPU and unpredictable latency.
Guest VM → none (defer downward, don't second-guess)
Host → none on NVMe (does the real ordering against hardware)
LXC/Incus containers do not have this problem. A container shares the host kernel — there is no separate guest block layer, so there is only the one host-level scheduler to set. One less thing to coordinate, and part of why containers are operationally simpler for databases.
Set the scheduler persistently per device class with a udev rule, which beats the elevator= boot parameter because it lets you target NVMe and rotational disks differently:
# /etc/udev/rules.d/60-ioschedulers.rules
ACTION=="add|change", KERNEL=="nvme[0-9]*", ATTR{queue/scheduler}="none"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", ATTR{queue/scheduler}="bfq"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", ATTR{queue/scheduler}="mq-deadline"
Check what is active (the value in brackets):
cat /sys/block/nvme0n1/queue/scheduler
# [none] mq-deadline kyber bfq
What about per-instance I/O settings?
For LXC/Incus, there is no separate scheduler to set per container — none on an NVMe seen from inside a container is the host’s scheduler for the real device. Your per-instance lever is cgroup limits, not scheduler choice:
incus config device set <instance> root limits.read 100MB
incus config device set <instance> root limits.write 50MB
incus config device set <instance> root limits.read 2000iops
For QEMU/KVM, the impactful per-VM knobs are not the scheduler at all but QEMU’s caching and AIO mode. For databases on local NVMe or Ceph, the safe, durable starting point is cache=none (uses O_DIRECT, respects flushes, so the database’s durability guarantees hold), aio=io_uring or aio=native, and a dedicated iothread. The best cache/aio combination for a specific backend is worth benchmarking — Ceph RBD sometimes prefers io_uring where local block storage does well with native — rather than taking a default as gospel.
Proxmox CPU limits, with examples
Proxmox exposes three CPU knobs that are easy to confuse because they are genuinely independent:
cores— how many vCPUs the guest sees. Topology, not a limit.cpulimit— a hard cap on total CPU time in whole-core units (cpu.max).0means unlimited.cpuunits— proportional weight under contention (cpu.weight), default100. Only matters when the host is oversubscribed and guests compete.
Put plainly: cores is what the guest sees; cpulimit is what it can actually consume; cpuunits is who wins when there is a fight.
On an 8-core / 16-thread host, a sensible mixed allocation:
| Guest | cores | cpulimit | cpuunits | Intent |
|---|---|---|---|---|
| DB VM | 4 | 3 | 2048 | Bursts across 4 cores; capped at 3 sustained; wins contention |
| Web VM | 4 | 4 | 1024 | Normal priority |
| App VM | 2 | 2 | 1024 | Modest |
| Batch | 4 | 2 | 50 | Spreads when idle; yields the moment anyone else needs CPU |
That is 14 vCPUs assigned against 16 threads — barely oversubscribed. When the host is idle each guest bursts up to its cpulimit; under contention cpuunits arbitrates, so the database is served first and the batch worker starves last. Exactly the behaviour you want.
For a latency-sensitive guest you can pin it to specific physical cores with affinity: 0,1,2,3 and keep other guests off those cores — a manual partition that buys cache and SMT isolation, worth the effort only against a measured requirement. Verify the core/SMT-sibling mapping with lscpu -e first; it is not guaranteed.
What “consuming CPU” does to the neighbours
Here is the payoff of the vCPU-is-a-thread model. A busy process in one VM affects others only when there is contention for physical cores, and the effect is delay, never data interference.
No contention, no effect. If the host has spare cores when VM-A goes busy, VM-B’s threads still get cores whenever they want them. VM-A maxing out 4 cores on an otherwise idle 16-thread host has zero impact on VM-B. This is the normal state for low-to-medium workloads.
Under contention, they time-slice. When total demand exceeds capacity, EEVDF interleaves the threads, and VM-B experiences steal time — cycles it wanted but could not get because the core was running someone else. A 100 ms task might take 130 ms. Nothing breaks; it is just slower.
This is where the limits become concrete, because they act only during contention. cpuunits decides who loses time first — a high-weight database is insulated from a busy neighbour not by reserving cores but by winning arbitration. cpulimit stops a guest creating as much contention in the first place — a runaway or untrusted guest physically cannot exceed its cap.
What is never shared is correctness or security. A busy guest cannot read another guest’s memory (hardware-isolated via EPT/NPT page tables), cannot see its data, and cannot crash it. The interference is purely performance contention over a shared physical resource.
Two subtler effects deserve a mention. Cache and memory-bandwidth contention: even when cores are technically free, guests share L3 cache and the memory bus, so a guest thrashing memory can slow a neighbour without classic steal time appearing — the “noisy neighbour” effect, mitigated mainly by pinning. And SMT sharing: your 16 “cores” are 8 physical cores with 2 threads each; siblings share execution units, so 16 logical threads is not 16 cores of real throughput. This is why steal time can show up earlier than raw thread counts would suggest.
The toolkit, lightest to heaviest
- Nothing — unlimited, scheduler-fair. Best for trusted, mostly-idle guests. Maximum burst.
cpuunits— express priority under contention without capping anyone.cpulimit— a hard ceiling on a specific greedy or untrusted guest.affinity(pinning) — physical-core partitioning for cache/SMT isolation and latency guarantees.
The mistake is pre-emptively capping everything on a quiet host: you throw away free burst capacity that the scheduler’s fairness was handling perfectly well. Leave guests unlimited by default; limit the exceptions.
Monitoring: steal time first
If there is one number to watch, it is CPU steal. It is the direct signal that a host is genuinely oversubscribed under load — not merely dense on paper.
node_exporter exposes it as a mode of the core CPU counter:
node_cpu_seconds_total{mode="steal"}
It is a cumulative counter per logical CPU, so rate it and average across cores for a percentage:
avg(rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100
Where you measure matters. Steal is a guest’s view of cycles the hypervisor withheld, so node_exporter must run inside the VM for the metric to mean anything. On the Proxmox host it reads ~0 — the host does the stealing, it does not get stolen from. On the host you instead watch saturation, run-queue length, load average, and per-VM usage.
A reasonable alert, PromQL-compatible so it drops straight into vmalert + Alertmanager:
groups:
- name: cpu_contention
rules:
- alert: HighCPUSteal
expr: avg(rate(node_cpu_seconds_total{mode="steal"}[5m])) by (instance) * 100 > 10
for: 15m
labels:
severity: warning
annotations:
summary: "High CPU steal on {{ $labels.instance }}"
description: "Steal time has averaged over 10% for 15 minutes — the host may be oversubscribed."
There is no universal correct threshold — treat 10% as a starting point to tune. Brief steal is normal; sustained steal is the problem, which is why the for: 15m clause matters as much as the number. Rough guidance: a few percent steady-state is fine, sustained double digits warrants investigation, and 20%+ means guests are materially starved.
LXC is a different signal
Steal is not a clean per-container concept — a container shares the host kernel, so there is nothing to steal from in the VM sense. The meaningful metric is cgroup CPU throttling: how often and how long a container hit its cpu.max quota.
A caveat that catches people: throttling only registers when you have set a limit. An unlimited container cannot be throttled, so for those you fall back to host-level saturation as your indicator.
The ground truth lives in the cgroup filesystem, which is exporter-independent and always correct:
cat /sys/fs/cgroup/lxc/<id>/cpu.stat
# usage_usec, nr_periods, nr_throttled, throttled_usec
Climbing nr_throttled means that container is hitting its cap. (The exact path varies with Proxmox’s cgroup layout; confirm with find /sys/fs/cgroup -name cpu.stat | grep <id>.)
As for the exporters themselves: the Proxmox VE exporter reliably surfaces per-guest CPU usage (enough to spot which container is hot), though whether it exposes fine-grained throttling counters is worth checking against your own /metrics output rather than assuming. Incus has a built-in Prometheus endpoint at /1.0/metrics with per-instance CPU usage labelled by instance and project — confirm the exact metric names from the live endpoint. When in doubt, a small node_exporter textfile-collector script reading cpu.stat directly is the dependable fallback.
The through-line
Every section here rests on the same idea: physical resources are shared by a scheduler, on demand, and the management layers above it only ever supply policy. So:
- vCPUs are threads; cores are time-shared, not partitioned.
- cgroups parameterise the scheduler; they do not replace or sit above it.
- Hard caps always bite; weights only matter when the scheduler arbitrates (and not at all under
none). - On NVMe,
noneeverywhere is correct — not something to fix. - A busy guest slows neighbours only under real contention, and only ever by bounded delay; the isolation boundary holds.
- Monitor steal time inside VMs and throttling inside containers first, then tune the exceptions.
For small-to-medium workloads, the host is idle most of the time and the defaults are good. Resist the urge to micro-partition a machine that is coping fine. Set a cap to contain a guest you do not trust, a weight to protect the one you care about, and reach for pinning only when a measured problem demands it. Measure first; tune the exceptions; leave the rest alone.
Madalin
AI integrator🚀 Senior Architect | SRE & Database Expert | AI Orchestrator 👋 Building the future at the speed of thought. ⚡️ I don't just write code; I architect high-performance, bulletproof ecosystems. With a foundation in Systems Engineering and a mastery of Go and TypeScript, I bridge the gap between heavy-duty backend reliability and seamless, high-conversion frontends.
Continue the conversation
If this article reflects the challenges your organisation is navigating, explore more practical guidance across Madalin.