• Like Xu's avatar
    KVM: x86/pmu: Explicitly check NMI from guest to reducee false positives · 812d4323
    Like Xu authored
    Explicitly check that the source of external interrupt is indeed an NMI
    in kvm_arch_pmi_in_guest(), which reduces perf-kvm false positive samples
    (host samples labelled as guest samples) generated by perf/core NMI mode
    if an NMI arrives after VM-Exit, but before kvm_after_interrupt():
    
     # test: perf-record + cpu-cycles:HP (which collects host-only precise samples)
     # Symbol                                   Overhead       sys       usr  guest sys  guest usr
     # .......................................  ........  ........  ........  .........  .........
     #
     # Before:
       [g] entry_SYSCALL_64                       24.63%     0.00%     0.00%     24.63%      0.00%
       [g] syscall_return_via_sysret              23.23%     0.00%     0.00%     23.23%      0.00%
       [g] files_lookup_fd_raw                     6.35%     0.00%     0.00%      6.35%      0.00%
     # After:
       [k] perf_adjust_freq_unthr_context         57.23%    57.23%     0.00%      0.00%      0.00%
       [k] __vmx_vcpu_run                          4.09%     4.09%     0.00%      0.00%      0.00%
       [k] vmx_update_host_rsp                     3.17%     3.17%     0.00%      0.00%      0.00%
    
    In the above case, perf records the samples labelled '[g]', the RIPs behind
    the weird samples are actually being queried by perf_instruction_pointer()
    after determining whether it's in GUEST state or not, and here's the issue:
    
    If VM-Exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and
    at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest()
    will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt().
    
    During this window, if a PMI occurs on host (since the KVM instructions on
    host are being executed), the control flow, with the help of the host NMI
    context, will be transferred to perf/core to generate performance samples,
    thus perf_instruction_pointer() and perf_guest_get_ip() is called.
    
    Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may
    cause perf/core to mistakenly assume that the source RIP of the host NMI
    belongs to the guest world and use perf_guest_get_ip() to get the RIP of
    a vCPU that has already exited by a non-NMI interrupt.
    
    Error samples are recorded and presented to the end-user via perf-report.
    Such false positive samples could be eliminated by explicitly determining
    if the exit reason is KVM_HANDLING_NMI.
    
    Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI
    is cleared, it's also still possible that another PMI is generated on host.
    Also for perf/core timer mode, the false positives are still possible since
    those non-NMI sources of interrupts are not always being used by perf/core.
    
    For events that are host-only, perf/core can and should eliminate false
    positives by checking event->attr.exclude_guest, i.e. events that are
    configured to exclude KVM guests should never fire in the guest.
    
    Events that are configured to count host and guest are trickier, perhaps
    impossible to handle with 100% accuracy?  And regardless of what accuracy
    is provided by perf/core, improving KVM's accuracy is cheap and easy, with
    no real downsides.
    
    Fixes: dd60d217 ("KVM: x86: Fix perf timer mode IP reporting")
    Signed-off-by: default avatarLike Xu <likexu@tencent.com>
    Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
    Link: https://lore.kernel.org/r/20231206032054.55070-1-likexu@tencent.com
    [sean: massage changelog, squash !!in_nmi() fixup from Like]
    Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
    812d4323
x86.h 15.1 KB