1. 04 Aug, 2021 1 commit
    • Peter Xu's avatar
      KVM: X86: MMU: Tune PTE_LIST_EXT to be bigger · dc1cff96
      Peter Xu authored
      Currently rmap array element only contains 3 entries.  However for EPT=N there
      could have a lot of guest pages that got tens of even hundreds of rmap entry.
      
      A normal distribution of a 6G guest (even if idle) shows this with rmap count
      statistics:
      
      Rmap_Count:     0       1       2-3     4-7     8-15    16-31   32-63   64-127  128-255 256-511 512-1023
      Level=4K:       3089171 49005   14016   1363    235     212     15      7       0       0       0
      Level=2M:       5951    227     0       0       0       0       0       0       0       0       0
      Level=1G:       32      0       0       0       0       0       0       0       0       0       0
      
      If we do some more fork some pages will grow even larger rmap counts.
      
      This patch makes PTE_LIST_EXT bigger so it'll be more efficient for the general
      use case of EPT=N as we do list reference less and the loops over PTE_LIST_EXT
      will be slightly more efficient; but still not too large so less waste when
      array not full.
      
      It should not affecting EPT=Y since EPT normally only has zero or one rmap
      entry for each page, so no array is even allocated.
      
      With a test case to fork 500 child and recycle them ("./rmap_fork 500" [1]),
      this patch speeds up fork time of about 29%.
      
          Before: 473.90 (+-5.93%)
          After:  366.10 (+-4.94%)
      
      [1] https://github.com/xzpeter/clibs/commit/825436f825453de2ea5aaee4bdb1c92281efe5b3Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220455.26054-6-peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dc1cff96
  2. 03 Aug, 2021 3 commits
    • Hamza Mahfooz's avatar
      KVM: const-ify all relevant uses of struct kvm_memory_slot · 269e9552
      Hamza Mahfooz authored
      As alluded to in commit f36f3f28 ("KVM: add "new" argument to
      kvm_arch_commit_memory_region"), a bunch of other places where struct
      kvm_memory_slot is used, needs to be refactored to preserve the
      "const"ness of struct kvm_memory_slot across-the-board.
      Signed-off-by: default avatarHamza Mahfooz <someguy@effective-light.com>
      Message-Id: <20210713023338.57108-1-someguy@effective-light.com>
      [Do not touch body of slot_rmap_walk_init. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      269e9552
    • Paolo Bonzini's avatar
      KVM: Don't take mmu_lock for range invalidation unless necessary · 071064f1
      Paolo Bonzini authored
      Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications
      that are unrelated to KVM.  This is possible now that memslot updates are
      blocked from range_start() to range_end(); that ensures that lock elision
      happens in both or none, and therefore that mmu_notifier_count updates
      (which must occur while holding mmu_lock for write) are always paired
      across start->end.
      
      Based on patches originally written by Ben Gardon.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      071064f1
    • Paolo Bonzini's avatar
      KVM: Block memslot updates across range_start() and range_end() · 52ac8b35
      Paolo Bonzini authored
      We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
      notifications that are unrelated to KVM.  Because mmu_notifier_count
      must be modified while holding mmu_lock for write, and must always
      be paired across start->end to stay balanced, lock elision must
      happen in both or none.  Therefore, in preparation for this change,
      this patch prevents memslot updates across range_start() and range_end().
      
      Note, technically flag-only memslot updates could be allowed in parallel,
      but stalling a memslot update for a relatively short amount of time is
      not a scalability issue, and this is all more than complex enough.
      
      A long note on the locking: a previous version of the patch used an rwsem
      to block the memslot update while the MMU notifier run, but this resulted
      in the following deadlock involving the pseudo-lock tagged as
      "mmu_notifier_invalidate_range_start".
      
         ======================================================
         WARNING: possible circular locking dependency detected
         5.12.0-rc3+ #6 Tainted: G           OE
         ------------------------------------------------------
         qemu-system-x86/3069 is trying to acquire lock:
         ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190
      
         but task is already holding lock:
         ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]
      
         which lock already depends on the new lock.
      
      This corresponds to the following MMU notifier logic:
      
          invalidate_range_start
            take pseudo lock
            down_read()           (*)
            release pseudo lock
          invalidate_range_end
            take pseudo lock      (**)
            up_read()
            release pseudo lock
      
      At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
      at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.
      
      This could cause a deadlock (ignoring for a second that the pseudo lock
      is not a lock):
      
      - invalidate_range_start waits on down_read(), because the rwsem is
      held by install_new_memslots
      
      - install_new_memslots waits on down_write(), because the rwsem is
      held till (another) invalidate_range_end finishes
      
      - invalidate_range_end sits waits on the pseudo lock, held by
      invalidate_range_start.
      
      Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
      it would change the *shared* rwsem readers into *shared recursive*
      readers), so open-code the wait using a readers count and a
      spinlock.  This also allows handling blockable and non-blockable
      critical section in the same way.
      
      Losing the rwsem fairness does theoretically allow MMU notifiers to
      block install_new_memslots forever.  Note that mm/mmu_notifier.c's own
      retry scheme in mmu_interval_read_begin also uses wait/wake_up
      and is likewise not fair.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      52ac8b35
  3. 02 Aug, 2021 36 commits