1. 12 May, 2022 3 commits
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't attempt fast page fault just because EPT is in use · 54275f74
      Sean Christopherson authored
      Check for A/D bits being disabled instead of the access tracking mask
      being non-zero when deciding whether or not to attempt to fix a page
      fault vian the fast path.  Originally, the access tracking mask was
      non-zero if and only if A/D bits were disabled by _KVM_ (including not
      being supported by hardware), but that hasn't been true since nVMX was
      fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause
      KVM to not use A/D bits while running L2 despite KVM using them while
      running L1.
      
      In other words, don't attempt the fast path just because EPT is enabled.
      
      Note, attempting the fast path for all !PRESENT faults can "fix" a very,
      _VERY_ tiny percentage of faults out of mmu_lock by detecting that the
      fault is spurious, i.e. has been fixed by a different vCPU, but again the
      odds of that happening are vanishingly small.  E.g. booting an 8-vCPU VM
      gets less than 10 successes out of 30k+ faults, and that's likely one of
      the more favorable scenarios.  Disabling dirty logging can likely lead to
      a rash of collisions between vCPUs for some workloads that operate on a
      common set of pages, but penalizing _all_ !PRESENT faults for that one
      case is unlikely to be a net positive, not to mention that that problem
      is best solved by not zapping in the first place.
      
      The number of spurious faults does scale with the number of vCPUs, e.g. a
      255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast
      path (again out of 30k), but that's all of 0.2% of faults.  Using legacy
      shadow paging does get more spurious faults, and a few more detected out
      of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring
      faults that are reflected into the guest), i.e. the extra detections are
      purely due to the sheer number of faults observed.
      
      On the other hand, getting a "negative" in the fast path takes in the
      neighborhood of 150-250 cycles.  So while it is tempting to keep/extend
      the current behavior, such a change needs to come with hard numbers
      showing that it's actually a win in the grand scheme, or any scheme for
      that matter.
      
      Fixes: 995f00a6 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      54275f74
    • Li RongQing's avatar
      KVM: VMX: clean up pi_wakeup_handler · 91ab933f
      Li RongQing authored
      Passing per_cpu() to list_for_each_entry() causes the macro to be
      evaluated N+1 times for N sleeping vCPUs.  This is a very small
      inefficiency, and the code is cleaner if the address of the per-CPU
      variable is loaded earlier.  Do this for both the list and the spinlock.
      Signed-off-by: default avatarLi RongQing <lirongqing@baidu.com>
      Message-Id: <1649244302-6777-1-git-send-email-lirongqing@baidu.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      91ab933f
    • Maxim Levitsky's avatar
      KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicness · 33fbe6be
      Maxim Levitsky authored
      This shows up as a TDP MMU leak when running nested.  Non-working cmpxchg on L0
      relies makes L1 install two different shadow pages under same spte, and one of
      them is leaked.
      
      Fixes: 1c2361f6 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses")
      Signed-off-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220512101420.306759-1-mlevitsk@redhat.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      33fbe6be
  2. 03 May, 2022 7 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-amd-pmu-fixes' into HEAD · 99132883
      Paolo Bonzini authored
      99132883
    • Sandipan Das's avatar
      kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU · 5a1bde46
      Sandipan Das authored
      On some x86 processors, CPUID leaf 0xA provides information
      on Architectural Performance Monitoring features. It
      advertises a PMU version which Qemu uses to determine the
      availability of additional MSRs to manage the PMCs.
      
      Upon receiving a KVM_GET_SUPPORTED_CPUID ioctl request for
      the same, the kernel constructs return values based on the
      x86_pmu_capability irrespective of the vendor.
      
      This leaf and the additional MSRs are not supported on AMD
      and Hygon processors. If AMD PerfMonV2 is detected, the PMU
      version is set to 2 and guest startup breaks because of an
      attempt to access a non-existent MSR. Return zeros to avoid
      this.
      
      Fixes: a6c06ed1 ("KVM: Expose the architectural performance monitoring CPUID leaf")
      Reported-by: default avatarVasant Hegde <vasant.hegde@amd.com>
      Signed-off-by: default avatarSandipan Das <sandipan.das@amd.com>
      Message-Id: <3fef83d9c2b2f7516e8ff50d60851f29a4bcb716.1651058600.git.sandipan.das@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5a1bde46
    • Kyle Huey's avatar
      KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id · 5eb84932
      Kyle Huey authored
      Zen renumbered some of the performance counters that correspond to the
      well known events in perf_hw_id. This code in KVM was never updated for
      that, so guest that attempt to use counters on Zen that correspond to the
      pre-Zen perf_hw_id values will silently receive the wrong values.
      
      This has been observed in the wild with rr[0] when running in Zen 3
      guests. rr uses the retired conditional branch counter 00d1 which is
      incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND.
      
      [0] https://rr-project.org/Signed-off-by: default avatarKyle Huey <me@kylehuey.com>
      Message-Id: <20220503050136.86298-1-khuey@kylehuey.com>
      Cc: stable@vger.kernel.org
      [Check guest family, not host. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5eb84932
    • Paolo Bonzini's avatar
      Merge branch 'kvm-tdp-mmu-atomicity-fix' into HEAD · 6ea6581f
      Paolo Bonzini authored
      We are dropping A/D bits (and W bits) in the TDP MMU.  Even if mmu_lock
      is held for write, as volatile SPTEs can be written by other tasks/vCPUs
      outside of mmu_lock.
      
      Attempting to prove that bug exposed another notable goof, which has been
      lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs
      as volatile, even though KVM never clears WRITABLE outside of MMU lock.
      As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
      update writable SPTEs.
      
      The fix does not seem to have an easily-measurable affect on performance;
      page faults are so slow that wasting even a few hundred cycles is dwarfed
      by the base cost.
      6ea6581f
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits · ba3a6120
      Sean Christopherson authored
      Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
      if mmu_lock is held for write, as volatile SPTEs can be written by other
      tasks/vCPUs outside of mmu_lock.  If a vCPU uses the to-be-modified SPTE
      to write a page, the CPU can cache the translation as WRITABLE in the TLB
      despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
      Accessed/Dirty bits and not properly tag the backing page.
      
      Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
      non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
      KVM doesn't consume the Accessed bit of non-leaf SPTEs.
      
      Dropping the Dirty and/or Writable bits is most problematic for dirty
      logging, as doing so can result in a missed TLB flush and eventually a
      missed dirty page.  In the unlikely event that the only dirty page(s) is
      a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
      (based on the Dirty or Writable bit depending on the method) and so not
      update the SPTE and ultimately not flush.  If the SPTE is cached in the
      TLB as writable before it is clobbered, the guest can continue writing
      the associated page without ever taking a write-protect fault.
      
      For most (all?) file back memory, dropping the Dirty bit is a non-issue.
      The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
      bit is effectively ignored because the primary MMU will mark that page
      dirty when the write-protection is lifted, e.g. when KVM faults the page
      back in for write.
      
      The Accessed bit is a complete non-issue.  Aside from being unused for
      non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
      Accessed bit may be dropped anyways.
      
      Lastly, the Writable bit is also problematic as an extension of the Dirty
      bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
      !DIRTY && WRITABLE.  If KVM fixes an MMU-writable, but !WRITABLE, SPTE
      out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
      the SPTE being !WRITABLE when it is checked by KVM.  But that all depends
      on the Dirty bit being problematic in the first place.
      
      Fixes: 2f2fad08 ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Venkatesh Srinivas <venkateshs@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ba3a6120
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits() · 54eb3ef5
      Sean Christopherson authored
      Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
      and into its callers.  Well, caller, since only one of its two callers
      doesn't already do the shadow-present check.
      
      Opportunistically move the helper to spte.c/h so that it can be used by
      the TDP MMU, which is also the primary motivation for the shadow-present
      change.  Unlike the legacy MMU, the TDP MMU uses a single path for clear
      leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
      MMU will need to check is_last_spte() prior to calling
      spte_has_volatile_bits(), and calling is_last_spte() without first
      calling is_shadow_present_spte() is at best odd, and at worst a violation
      of KVM's loosely defines SPTE rules.
      
      Note, mmu_spte_clear_track_bits() could likely skip the write entirely
      for SPTEs that are not shadow-present.  Leave that cleanup for a future
      patch to avoid introducing a functional change, and because the
      shadow-present check can likely be moved further up the stack, e.g.
      drop_large_spte() appears to be the only path that doesn't already
      explicitly check for a shadow-present SPTE.
      
      No functional change intended.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      54eb3ef5
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D) · 706c9c55
      Sean Christopherson authored
      Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
      being volatile (unless they're volatile for other reasons, e.g. A/D bits).
      KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
      out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
      cleared just because the SPTE is MMU-writable.
      
      Rename the wrapper of MMU-writable to be more literal, the previous name
      of spte_can_locklessly_be_made_writable() is wrong and misleading.
      
      Fixes: c7ba5b48 ("KVM: MMU: fast path of handling guest page fault")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      706c9c55
  3. 02 May, 2022 3 commits
  4. 29 Apr, 2022 27 commits
    • Lai Jiangshan's avatar
      KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guest · 84e5ffd0
      Lai Jiangshan authored
      When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is
      allocated with role.level = 5 and the guest pagetable's root gfn.
      
      And root_sp->spt[0] is also allocated with the same gfn and the same
      role except role.level = 4.  Luckily that they are different shadow
      pages, but only root_sp->spt[0] is the real translation of the guest
      pagetable.
      
      Here comes a problem:
      
      If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse)
      and uses the same gfn as the root page for nested NPT before and after
      switching gCR4_LA57.  The host (hCR4_LA57=1) might use the same root_sp
      for the guest even the guest switches gCR4_LA57.  The guest will see
      unexpected page mapped and L2 may exploit the bug and hurt L1.  It is
      lucky that the problem can't hurt L0.
      
      And three special cases need to be handled:
      
      The root_sp should be like role.direct=1 sometimes: its contents are
      not backed by gptes, root_sp->gfns is meaningless.  (For a normal high
      level sp in shadow paging, sp->gfns is often unused and kept zero, but
      it could be relevant and meaningful if sp->gfns is used because they
      are backed by concrete gptes.)
      
      For such root_sp in the case, root_sp is just a portal to contribute
      root_sp->spt[0], and root_sp->gfns should not be used and
      root_sp->spt[0] should not be dropped if gpte[0] of the guest root
      pagetable is changed.
      
      Such root_sp should not be accounted too.
      
      So add role.passthrough to distinguish the shadow pages in the hash
      when gCR4_LA57 is toggled and fix above special cases by using it in
      kvm_mmu_page_{get|set}_gfn() and sp_has_gptes().
      Signed-off-by: default avatarLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      84e5ffd0
    • Lai Jiangshan's avatar
      KVM: X86/MMU: Add sp_has_gptes() · 767d8d8d
      Lai Jiangshan authored
      Add sp_has_gptes() which equals to !sp->role.direct currently.
      
      Shadow page having gptes needs to be write-protected, accounted and
      responded to kvm_mmu_pte_write().
      
      Use it in these places to replace !sp->role.direct and rename
      for_each_gfn_indirect_valid_sp.
      Signed-off-by: default avatarLai Jiangshan <jiangshan.ljs@antgroup.com>
      Message-Id: <20220420131204.2850-2-jiangshanlai@gmail.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      767d8d8d
    • Suravee Suthikulpanit's avatar
      KVM: SVM: Introduce trace point for the slow-path of avic_kic_target_vcpus · 9f084f7c
      Suravee Suthikulpanit authored
      This can help identify potential performance issues when handles
      AVIC incomplete IPI due vCPU not running.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220420154954.19305-3-suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9f084f7c
    • Suravee Suthikulpanit's avatar
      KVM: SVM: Use target APIC ID to complete AVIC IRQs when possible · 7223fd2d
      Suravee Suthikulpanit authored
      Currently, an AVIC-enabled VM suffers from performance bottleneck
      when scaling to large number of vCPUs for I/O intensive workloads.
      
      In such case, a vCPU often executes halt instruction to get into idle state
      waiting for interrupts, in which KVM would de-schedule the vCPU from
      physical CPU.
      
      When AVIC HW tries to deliver interrupt to the halting vCPU, it would
      result in AVIC incomplete IPI #vmexit to notify KVM to reschedule
      the target vCPU into running state.
      
      Investigation has shown the main hotspot is in the kvm_apic_match_dest()
      in the following call stack where it tries to find target vCPUs
      corresponding to the information in the ICRH/ICRL registers.
      
        - handle_exit
          - svm_invoke_exit_handler
            - avic_incomplete_ipi_interception
              - kvm_apic_match_dest
      
      However, AVIC provides hints in the #vmexit info, which can be used to
      retrieve the destination guest physical APIC ID.
      
      In addition, since QEMU defines guest physical APIC ID to be the same as
      vCPU ID, it can be used to quickly identify the target vCPU to deliver IPI,
      and avoid the overhead from searching through all vCPUs to match the target
      vCPU.
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20220420154954.19305-2-suravee.suthikulpanit@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7223fd2d
    • Paolo Bonzini's avatar
      KVM: x86/mmu: replace direct_map with root_role.direct · 347a0d0d
      Paolo Bonzini authored
      direct_map is always equal to the direct field of the root page's role:
      
      - for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is
      copied from cpu_role.base.direct
      
      - for TDP, it is always true and root_role.direct is also always true
      
      - for shadow TDP, it is always false and root_role.direct is also always
      false
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      347a0d0d
    • Paolo Bonzini's avatar
      KVM: x86/mmu: replace root_level with cpu_role.base.level · 4d25502a
      Paolo Bonzini authored
      Remove another duplicate field of struct kvm_mmu.  This time it's
      the root level for page table walking; the separate field is
      always initialized as cpu_role.base.level, so its users can look
      up the CPU mode directly instead.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4d25502a
    • Paolo Bonzini's avatar
      KVM: x86/mmu: replace shadow_root_level with root_role.level · a972e29c
      Paolo Bonzini authored
      root_role.level is always the same value as shadow_level:
      
      - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu
      
      - it's the level argument when going through kvm_init_shadow_ept_mmu
      
      - it's assigned directly from new_role.base.level when going
        through shadow_mmu_init_context
      
      Remove the duplication and get the level directly from the role.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a972e29c
    • Paolo Bonzini's avatar
      KVM: x86/mmu: pull CPU mode computation to kvm_init_mmu · a7f1de9b
      Paolo Bonzini authored
      Do not lead init_kvm_*mmu into the temptation of poking
      into struct kvm_mmu_role_regs, by passing to it directly
      the CPU mode.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a7f1de9b
    • Paolo Bonzini's avatar
      KVM: x86/mmu: simplify and/or inline computation of shadow MMU roles · 56b321f9
      Paolo Bonzini authored
      Shadow MMUs compute their role from cpu_role.base, simply by adjusting
      the root level.  It's one line of code, so do not place it in a separate
      function.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      56b321f9
    • Paolo Bonzini's avatar
      KVM: x86/mmu: remove redundant bits from extended role · faf72962
      Paolo Bonzini authored
      Before the separation of the CPU and the MMU role, CR0.PG was not
      available in the base MMU role, because two-dimensional paging always
      used direct=1 in the MMU role.  However, now that the raw role is
      snapshotted in mmu->cpu_role, the value of CR0.PG always matches both
      !cpu_role.base.direct and cpu_role.base.level > 0.  There is no need to
      store it again in union kvm_mmu_extended_role; instead, write an is_cr0_pg
      accessor by hand that takes care of the conversion.  Use cpu_role.base.level
      since the future of the direct field is unclear.
      
      Likewise, CR4.PAE is now always present in the CPU role as
      !cpu_role.base.has_4_byte_gpte.  The inversion makes certain tests on
      the MMU role easier, and is easily hidden by the is_cr4_pae accessor
      when operating on the CPU role.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      faf72962
    • Paolo Bonzini's avatar
      KVM: x86/mmu: rename kvm_mmu_role union · 7a7ae829
      Paolo Bonzini authored
      It is quite confusing that the "full" union is called kvm_mmu_role
      but is used for the "cpu_role" field of struct kvm_mmu.  Rename it
      to kvm_cpu_role.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7a7ae829
    • Paolo Bonzini's avatar
      KVM: x86/mmu: remove extended bits from mmu_role, rename field · 7a458f0e
      Paolo Bonzini authored
      mmu_role represents the role of the root of the page tables.
      It does not need any extended bits, as those govern only KVM's
      page table walking; the is_* functions used for page table
      walking always use the CPU role.
      
      ext.valid is not present anymore in the MMU role, but an
      all-zero MMU role is impossible because the level field is
      never zero in the MMU role.  So just zap the whole mmu_role
      in order to force invalidation after CPUID is updated.
      
      While making this change, which requires touching almost every
      occurrence of "mmu_role", rename it to "root_role".
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7a458f0e
    • Paolo Bonzini's avatar
      KVM: x86/mmu: store shadow EFER.NX in the MMU role · 362505de
      Paolo Bonzini authored
      Now that the MMU role is separate from the CPU role, it can be a
      truthful description of the format of the shadow pages.  This includes
      whether the shadow pages use the NX bit; so force the efer_nx field
      of the MMU role when TDP is disabled, and remove the hardcoding it in
      the callers of reset_shadow_zero_bits_mask.
      
      In fact, the initialization of reserved SPTE bits can now be made common
      to shadow paging and shadow NPT; move it to shadow_mmu_init_context.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      362505de
    • Paolo Bonzini's avatar
      KVM: x86/mmu: cleanup computation of MMU roles for shadow paging · f417e145
      Paolo Bonzini authored
      Pass the already-computed CPU role, instead of redoing it.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f417e145
    • Paolo Bonzini's avatar
      KVM: x86/mmu: cleanup computation of MMU roles for two-dimensional paging · 2ba67677
      Paolo Bonzini authored
      Inline kvm_calc_mmu_role_common into its sole caller, and simplify it
      by removing the computation of unnecessary bits.
      
      Extended bits are unnecessary because page walking uses the CPU role,
      and EFER.NX/CR0.WP can be set to one unconditionally---matching the
      format of shadow pages rather than the format of guest pages.
      
      The MMU role for two dimensional paging does still depend on the CPU role,
      even if only barely so, due to SMM and guest mode; for consistency,
      pass it down to kvm_calc_tdp_mmu_root_page_role instead of querying
      the vcpu with is_smm or is_guest_mode.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2ba67677
    • Paolo Bonzini's avatar
      KVM: x86/mmu: remove kvm_calc_shadow_root_page_role_common · 19b5dcc3
      Paolo Bonzini authored
      kvm_calc_shadow_root_page_role_common is the same as
      kvm_calc_cpu_role except for the level, which is overwritten
      afterwards in kvm_calc_shadow_mmu_root_page_role
      and kvm_calc_shadow_npt_root_page_role.
      
      role.base.direct is already set correctly for the CPU role,
      and CR0.PG=1 is required for VMRUN so it will also be
      correct for nested NPT.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      19b5dcc3
    • Paolo Bonzini's avatar
      KVM: x86/mmu: remove ept_ad field · ec283cb1
      Paolo Bonzini authored
      The ept_ad field is used during page walk to determine if the guest PTEs
      have accessed and dirty bits.  In the MMU role, the ad_disabled
      bit represents whether the *shadow* PTEs have the bits, so it
      would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just
      !mmu->mmu_role.base.ad_disabled.
      
      However, the similar field in the CPU mode, ad_disabled, is initialized
      correctly: to the opposite value of ept_ad for shadow EPT, and zero
      for non-EPT guest paging modes (which always have A/D bits).  It is
      therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode,
      like other page-format fields; it just has to be inverted to account
      for the different polarity.
      
      In fact, now that the CPU mode is distinct from the MMU roles, it would
      even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and
      use !mmu->cpu_role.base.ad_disabled instead.  I am not doing this because
      the macro has a small effect in terms of dead code elimination:
      
         text	   data	    bss	    dec	    hex
       103544	  16665	    112	 120321	  1d601    # as of this patch
       103746	  16665	    112	 120523	  1d6cb    # without PT_HAVE_ACCESSED_DIRTY
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ec283cb1
    • Paolo Bonzini's avatar
      KVM: x86/mmu: do not recompute root level from kvm_mmu_role_regs · 60f3cb60
      Paolo Bonzini authored
      The root_level can be found in the cpu_role (in fact the field
      is superfluous and could be removed, but one thing at a time).
      Since there is only one usage left of role_regs_to_root_level,
      inline it into kvm_calc_cpu_role.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      60f3cb60
    • Paolo Bonzini's avatar
      KVM: x86/mmu: split cpu_role from mmu_role · e5ed0fb0
      Paolo Bonzini authored
      Snapshot the state of the processor registers that govern page walk into
      a new field of struct kvm_mmu.  This is a more natural representation
      than having it *mostly* in mmu_role but not exclusively; the delta
      right now is represented in other fields, such as root_level.
      
      The nested MMU now has only the CPU role; and in fact the new function
      kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role,
      except that it has role.base.direct equal to !CR0.PG.  For a walk-only
      MMU, "direct" has no meaning, but we set it to !CR0.PG so that
      role.ext.cr0_pg can go away in a future patch.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e5ed0fb0
    • Paolo Bonzini's avatar
      KVM: x86/mmu: remove "bool base_only" arguments · b8980508
      Paolo Bonzini authored
      The argument is always false now that kvm_mmu_calc_root_page_role has
      been removed.
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b8980508
    • Sean Christopherson's avatar
      KVM: x86: Clean up and document nested #PF workaround · 6819af75
      Sean Christopherson authored
      Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround
      with an explicit, common workaround in kvm_inject_emulated_page_fault().
      Aside from being a hack, the current approach is brittle and incomplete,
      e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(),
      and nVMX fails to apply the workaround when VMX is intercepting #PF due
      to allow_smaller_maxphyaddr=1.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6819af75
    • Paolo Bonzini's avatar
      KVM: x86/mmu: rephrase unclear comment · 25cc0565
      Paolo Bonzini authored
      If accessed bits are not supported there simple isn't any distinction
      between accessed and non-accessed gPTEs, so the comment does not make
      much sense.  Rephrase it in terms of what happens if accessed bits
      *are* supported.
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      25cc0565
    • Paolo Bonzini's avatar
      KVM: x86/mmu: pull computation of kvm_mmu_role_regs to kvm_init_mmu · 39e7e2bf
      Paolo Bonzini authored
      The init_kvm_*mmu functions, with the exception of shadow NPT,
      do not need to know the full values of CR0/CR4/EFER; they only
      need to know the bits that make up the "role".  This cleanup
      however will take quite a few incremental steps.  As a start,
      pull the common computation of the struct kvm_mmu_role_regs
      into their caller: all of them extract the struct from the vcpu
      as the very first step.
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      39e7e2bf
    • Paolo Bonzini's avatar
      KVM: x86/mmu: constify uses of struct kvm_mmu_role_regs · 82ffa13f
      Paolo Bonzini authored
      struct kvm_mmu_role_regs is computed just once and then accessed.  Use
      const to make this clearer, even though the const fields of struct
      kvm_mmu_role_regs already prevent (or make it harder...) to modify
      the contents of the struct.
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      82ffa13f
    • Paolo Bonzini's avatar
      KVM: x86/mmu: nested EPT cannot be used in SMM · daed87b8
      Paolo Bonzini authored
      The role.base.smm flag is always zero when setting up shadow EPT,
      do not bother copying it over from vcpu->arch.root_mmu.
      Reviewed-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      daed87b8
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled · 8b9e74bf
      Sean Christopherson authored
      Clear enable_mmio_caching if hardware can't support MMIO caching and use
      the dedicated flag to detect if MMIO caching is enabled instead of
      assuming shadow_mmio_value==0 means MMIO caching is disabled.  TDX will
      use a zero value even when caching is enabled, and is_mmio_spte() isn't
      so hot that it needs to avoid an extra memory access, i.e. there's no
      reason to be super clever.  And the clever approach may not even be more
      performant, e.g. gcc-11 lands the extra check on a non-zero value inline,
      but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra
      uops for non-MMIO SPTEs.
      
      Cc: Isaku Yamahata <isaku.yamahata@intel.com>
      Cc: Kai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220420002747.3287931-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8b9e74bf
    • Sean Christopherson's avatar
      KVM: x86/mmu: Check for host MMIO exclusion from mem encrypt iff necessary · 65936229
      Sean Christopherson authored
      When determining whether or not a SPTE needs to have SME/SEV's memory
      encryption flag set, do the moderately expensive host MMIO pfn check if
      and only if the memory encryption mask is non-zero.
      
      Note, KVM could further optimize the host MMIO checks by making a single
      call to kvm_is_mmio_pfn(), but the tdp_enabled path (for EPT's memtype
      handling) will likely be split out to a separate flow[*].  At that point,
      a better approach would be to shove the call to kvm_is_mmio_pfn() into
      VMX code so that AMD+NPT without SME doesn't get hit with an unnecessary
      lookup.
      
      [*] https://lkml.kernel.org/r/20220321224358.1305530-3-bgardon@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220415004909.2216670-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      65936229