mm/vmstat.c · be5e015d107d5336f298b74ea5a4f0b1773bc6f9 · Kirill Smelkov / linux

vmstat: skip periodic vmstat update for isolated CPUs · be5e015d
Marcelo Tosatti authored Jun 07, 2023
Problem: The interruption caused by vmstat_update is undesirable
for certain applications.

With workloads that are running on isolated cpus with nohz full mode to
shield off any kernel interruption. For example, a VM running a
time sensitive application with a 50us maximum acceptable interruption
(use case: soft PLC).

oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...

The example above shows an additional 7us for the
        oslat -> kworker -> oslat

switches. In the case of a virtualized CPU, and the vmstat_update
interruption in the host (of a qemu-kvm vcpu), the latency penalty
observed in the guest is higher than 50us, violating the acceptable
latency threshold.

The isolated vCPU can perform operations that modify per-CPU page counters,
for example to complete I/O operations:

      CPU 11/KVM-9540    [001] dNh1.  2314.248584: mod_zone_page_state <-__folio_end_writeback
      CPU 11/KVM-9540    [001] dNh1.  2314.248585: <stack trace>
 => 0xffffffffc042b083
 => mod_zone_page_state
 => __folio_end_writeback
 => folio_end_writeback
 => iomap_finish_ioend
 => blk_mq_end_request_batch
 => nvme_irq
 => __handle_irq_event_percpu
 => handle_irq_event
 => handle_edge_irq
 => __common_interrupt
 => common_interrupt
 => asm_common_interrupt
 => vmx_do_interrupt_nmi_irqoff
 => vmx_handle_exit_irqoff
 => vcpu_enter_guest
 => vcpu_run
 => kvm_arch_vcpu_ioctl_run
 => kvm_vcpu_ioctl
 => __x64_sys_ioctl
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

In kernel users of vmstat counters either require the precise value and
they are using zone_page_state_snapshot interface or they can live with an
imprecision as the regular flushing can happen at arbitrary time and
cumulative error can grow (see calculate_normal_threshold).

From that POV the regular flushing can be postponed for CPUs that have
been isolated from the kernel interference without critical infrastructure
ever noticing.  Skip regular flushing from vmstat_shepherd for all
isolated CPUs to avoid interference with the isolated workload.

Suggested by Michal Hocko.

Link: https://lkml.kernel.org/r/ZIDoV/zxFKVmQl7W@tpadSigned-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
be5e015d
vmstat.c 55.5 KB
Replace vmstat.c