• Marcelo Tosatti's avatar
    vmstat: skip periodic vmstat update for isolated CPUs · be5e015d
    Marcelo Tosatti authored
    Problem: The interruption caused by vmstat_update is undesirable
    for certain applications.
    
    With workloads that are running on isolated cpus with nohz full mode to
    shield off any kernel interruption. For example, a VM running a
    time sensitive application with a 50us maximum acceptable interruption
    (use case: soft PLC).
    
    oslat   1094.456862: sys_mlock(start: 7f7ed0000b60, len: 1000)
    oslat   1094.456971: workqueue_queue_work: ... function=vmstat_update ...
    oslat   1094.456974: sched_switch: prev_comm=oslat ... ==> next_comm=kworker/5:1 ...
    kworker 1094.456978: sched_switch: prev_comm=kworker/5:1 ==> next_comm=oslat ...
    
    The example above shows an additional 7us for the
            oslat -> kworker -> oslat
    
    switches. In the case of a virtualized CPU, and the vmstat_update
    interruption in the host (of a qemu-kvm vcpu), the latency penalty
    observed in the guest is higher than 50us, violating the acceptable
    latency threshold.
    
    The isolated vCPU can perform operations that modify per-CPU page counters,
    for example to complete I/O operations:
    
          CPU 11/KVM-9540    [001] dNh1.  2314.248584: mod_zone_page_state <-__folio_end_writeback
          CPU 11/KVM-9540    [001] dNh1.  2314.248585: <stack trace>
     => 0xffffffffc042b083
     => mod_zone_page_state
     => __folio_end_writeback
     => folio_end_writeback
     => iomap_finish_ioend
     => blk_mq_end_request_batch
     => nvme_irq
     => __handle_irq_event_percpu
     => handle_irq_event
     => handle_edge_irq
     => __common_interrupt
     => common_interrupt
     => asm_common_interrupt
     => vmx_do_interrupt_nmi_irqoff
     => vmx_handle_exit_irqoff
     => vcpu_enter_guest
     => vcpu_run
     => kvm_arch_vcpu_ioctl_run
     => kvm_vcpu_ioctl
     => __x64_sys_ioctl
     => do_syscall_64
     => entry_SYSCALL_64_after_hwframe
    
    In kernel users of vmstat counters either require the precise value and
    they are using zone_page_state_snapshot interface or they can live with an
    imprecision as the regular flushing can happen at arbitrary time and
    cumulative error can grow (see calculate_normal_threshold).
    
    From that POV the regular flushing can be postponed for CPUs that have
    been isolated from the kernel interference without critical infrastructure
    ever noticing.  Skip regular flushing from vmstat_shepherd for all
    isolated CPUs to avoid interference with the isolated workload.
    
    Suggested by Michal Hocko.
    
    Link: https://lkml.kernel.org/r/ZIDoV/zxFKVmQl7W@tpadSigned-off-by: default avatarMarcelo Tosatti <mtosatti@redhat.com>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Frederic Weisbecker <frederic@kernel.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    be5e015d
vmstat.c 55.5 KB