• Sean Christopherson's avatar
    KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls · 1b6043e8
    Sean Christopherson authored
    When zapping a TDP MMU root, perform the zap in two passes to avoid
    zapping an entire top-level SPTE while holding RCU, which can induce RCU
    stalls.  In the first pass, zap SPTEs at PG_LEVEL_1G, and then
    zap top-level entries in the second pass.
    
    With 4-level paging, zapping a PGD that is fully populated with 4kb leaf
    SPTEs take up to ~7 or so seconds (time varies based on kernel config,
    number of (v)CPUs, etc...).  With 5-level paging, that time can balloon
    well into hundreds of seconds.
    
    Before remote TLB flushes were omitted, the problem was even worse as
    waiting for all active vCPUs to respond to the IPI introduced significant
    overhead for VMs with large numbers of vCPUs.
    
    By zapping 1gb SPTEs (both shadow pages and hugepages) in the first pass,
    the amount of work that is done without dropping RCU protection is
    strictly bounded, with the worst case latency for a single operation
    being less than 100ms.
    
    Zapping at 1gb in the first pass is not arbitrary.  First and foremost,
    KVM relies on being able to zap 1gb shadow pages in a single shot when
    when repacing a shadow page with a hugepage.  Zapping a 1gb shadow page
    that is fully populated with 4kb dirty SPTEs also triggers the worst case
    latency due writing back the struct page accessed/dirty bits for each 4kb
    page, i.e. the two-pass approach is guaranteed to work so long as KVM can
    cleany zap a 1gb shadow page.
    
      rcu: INFO: rcu_sched self-detected stall on CPU
      rcu:     52-....: (20999 ticks this GP) idle=7be/1/0x4000000000000000
                                              softirq=15759/15759 fqs=5058
       (t=21016 jiffies g=66453 q=238577)
      NMI backtrace for cpu 52
      Call Trace:
       ...
       mark_page_accessed+0x266/0x2f0
       kvm_set_pfn_accessed+0x31/0x40
       handle_removed_tdp_mmu_page+0x259/0x2e0
       __handle_changed_spte+0x223/0x2c0
       handle_removed_tdp_mmu_page+0x1c1/0x2e0
       __handle_changed_spte+0x223/0x2c0
       handle_removed_tdp_mmu_page+0x1c1/0x2e0
       __handle_changed_spte+0x223/0x2c0
       zap_gfn_range+0x141/0x3b0
       kvm_tdp_mmu_zap_invalidated_roots+0xc8/0x130
       kvm_mmu_zap_all_fast+0x121/0x190
       kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
       kvm_page_track_flush_slot+0x5c/0x80
       kvm_arch_flush_shadow_memslot+0xe/0x10
       kvm_set_memslot+0x172/0x4e0
       __kvm_set_memory_region+0x337/0x590
       kvm_vm_ioctl+0x49c/0xf80
    Reported-by: default avatarDavid Matlack <dmatlack@google.com>
    Cc: Ben Gardon <bgardon@google.com>
    Cc: Mingwei Zhang <mizhang@google.com>
    Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
    Reviewed-by: default avatarBen Gardon <bgardon@google.com>
    Message-Id: <20220226001546.360188-22-seanjc@google.com>
    Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
    1b6043e8
tdp_mmu.c 55.3 KB