1. 17 May, 2022 1 commit
  2. 15 May, 2022 2 commits
    • Quentin Perret's avatar
      KVM: arm64: Don't hypercall before EL2 init · 2e403167
      Quentin Perret authored
      Will reported the following splat when running with Protected KVM
      enabled:
      
      [    2.427181] ------------[ cut here ]------------
      [    2.427668] WARNING: CPU: 3 PID: 1 at arch/arm64/kvm/mmu.c:489 __create_hyp_private_mapping+0x118/0x1ac
      [    2.428424] Modules linked in:
      [    2.429040] CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc2-00084-g8635adc4efc7 #1
      [    2.429589] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
      [    2.430286] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [    2.430734] pc : __create_hyp_private_mapping+0x118/0x1ac
      [    2.431091] lr : create_hyp_exec_mappings+0x40/0x80
      [    2.431377] sp : ffff80000803baf0
      [    2.431597] x29: ffff80000803bb00 x28: 0000000000000000 x27: 0000000000000000
      [    2.432156] x26: 0000000000000000 x25: 0000000000000000 x24: 0000000000000000
      [    2.432561] x23: ffffcd96c343b000 x22: 0000000000000000 x21: ffff80000803bb40
      [    2.433004] x20: 0000000000000004 x19: 0000000000001800 x18: 0000000000000000
      [    2.433343] x17: 0003e68cf7efdd70 x16: 0000000000000004 x15: fffffc81f602a2c8
      [    2.434053] x14: ffffdf8380000000 x13: ffffcd9573200000 x12: ffffcd96c343b000
      [    2.434401] x11: 0000000000000004 x10: ffffcd96c1738000 x9 : 0000000000000004
      [    2.434812] x8 : ffff80000803bb40 x7 : 7f7f7f7f7f7f7f7f x6 : 544f422effff306b
      [    2.435136] x5 : 000000008020001e x4 : ffff207d80a88c00 x3 : 0000000000000005
      [    2.435480] x2 : 0000000000001800 x1 : 000000014f4ab800 x0 : 000000000badca11
      [    2.436149] Call trace:
      [    2.436600]  __create_hyp_private_mapping+0x118/0x1ac
      [    2.437576]  create_hyp_exec_mappings+0x40/0x80
      [    2.438180]  kvm_init_vector_slots+0x180/0x194
      [    2.458941]  kvm_arch_init+0x80/0x274
      [    2.459220]  kvm_init+0x48/0x354
      [    2.459416]  arm_init+0x20/0x2c
      [    2.459601]  do_one_initcall+0xbc/0x238
      [    2.459809]  do_initcall_level+0x94/0xb4
      [    2.460043]  do_initcalls+0x54/0x94
      [    2.460228]  do_basic_setup+0x1c/0x28
      [    2.460407]  kernel_init_freeable+0x110/0x178
      [    2.460610]  kernel_init+0x20/0x1a0
      [    2.460817]  ret_from_fork+0x10/0x20
      [    2.461274] ---[ end trace 0000000000000000 ]---
      
      Indeed, the Protected KVM mode promotes __create_hyp_private_mapping()
      to a hypercall as EL1 no longer has access to the hypervisor's stage-1
      page-table. However, the call from kvm_init_vector_slots() happens after
      pKVM has been initialized on the primary CPU, but before it has been
      initialized on secondaries. As such, if the KVM initcall procedure is
      migrated from one CPU to another in this window, the hypercall may end up
      running on a CPU for which EL2 has not been initialized.
      
      Fortunately, the pKVM hypervisor doesn't rely on the host to re-map the
      vectors in the private range, so the hypercall in question is in fact
      superfluous. Skip it when pKVM is enabled.
      Reported-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarQuentin Perret <qperret@google.com>
      [maz: simplified the checks slightly]
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220513092607.35233-1-qperret@google.com
      2e403167
    • Marc Zyngier's avatar
      KVM: arm64: vgic-v3: Consistently populate ID_AA64PFR0_EL1.GIC · 5163373a
      Marc Zyngier authored
      When adding support for the slightly wonky Apple M1, we had to
      populate ID_AA64PFR0_EL1.GIC==1 to present something to the guest,
      as the HW itself doesn't advertise the feature.
      
      However, we gated this on the in-kernel irqchip being created.
      This causes some trouble for QEMU, which snapshots the state of
      the registers before creating a virtual GIC, and then tries to
      restore these registers once the GIC has been created.  Obviously,
      between the two stages, ID_AA64PFR0_EL1.GIC has changed value,
      and the write fails.
      
      The fix is to actually emulate the HW, and always populate the
      field if the HW is capable of it.
      
      Fixes: 562e530f ("KVM: arm64: Force ID_AA64PFR0_EL1.GIC=1 when exposing a virtual GICv3")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Reported-by: default avatarPeter Maydell <peter.maydell@linaro.org>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Link: https://lore.kernel.org/r/20220503211424.3375263-1-maz@kernel.org
      5163373a
  3. 12 May, 2022 1 commit
    • Sean Christopherson's avatar
      KVM: x86/mmu: Update number of zapped pages even if page list is stable · b28cb0cd
      Sean Christopherson authored
      When zapping obsolete pages, update the running count of zapped pages
      regardless of whether or not the list has become unstable due to zapping
      a shadow page with its own child shadow pages.  If the VM is backed by
      mostly 4kb pages, KVM can zap an absurd number of SPTEs without bumping
      the batch count and thus without yielding.  In the worst case scenario,
      this can cause a soft lokcup.
      
       watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [dirty_log_perf_:13020]
         RIP: 0010:workingset_activation+0x19/0x130
         mark_page_accessed+0x266/0x2e0
         kvm_set_pfn_accessed+0x31/0x40
         mmu_spte_clear_track_bits+0x136/0x1c0
         drop_spte+0x1a/0xc0
         mmu_page_zap_pte+0xef/0x120
         __kvm_mmu_prepare_zap_page+0x205/0x5e0
         kvm_mmu_zap_all_fast+0xd7/0x190
         kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10
         kvm_page_track_flush_slot+0x5c/0x80
         kvm_arch_flush_shadow_memslot+0xe/0x10
         kvm_set_memslot+0x1a8/0x5d0
         __kvm_set_memory_region+0x337/0x590
         kvm_vm_ioctl+0xb08/0x1040
      
      Fixes: fbb158cb ("KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""")
      Reported-by: default avatarDavid Matlack <dmatlack@google.com>
      Reviewed-by: default avatarBen Gardon <bgardon@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220511145122.3133334-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      b28cb0cd
  4. 06 May, 2022 2 commits
    • Sean Christopherson's avatar
      KVM: VMX: Exit to userspace if vCPU has injected exception and invalid state · 053d2290
      Sean Christopherson authored
      Exit to userspace with an emulation error if KVM encounters an injected
      exception with invalid guest state, in addition to the existing check of
      bailing if there's a pending exception (KVM doesn't support emulating
      exceptions except when emulating real mode via vm86).
      
      In theory, KVM should never get to such a situation as KVM is supposed to
      exit to userspace before injecting an exception with invalid guest state.
      But in practice, userspace can intervene and manually inject an exception
      and/or stuff registers to force invalid guest state while a previously
      injected exception is awaiting reinjection.
      
      Fixes: fc4fad79 ("KVM: VMX: Reject KVM_RUN if emulation is required with pending exception")
      Reported-by: syzbot+cfafed3bb76d3e37581b@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220502221850.131873-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      053d2290
    • Peter Gonda's avatar
      KVM: SEV: Mark nested locking of vcpu->lock · 0c2c7c06
      Peter Gonda authored
      svm_vm_migrate_from() uses sev_lock_vcpus_for_migration() to lock all
      source and target vcpu->locks. Unfortunately there is an 8 subclass
      limit, so a new subclass cannot be used for each vCPU. Instead maintain
      ownership of the first vcpu's mutex.dep_map using a role specific
      subclass: source vs target. Release the other vcpu's mutex.dep_maps.
      
      Fixes: b5663931 ("KVM: SEV: Add support for SEV intra host migration")
      Reported-by: John Sperbeck<jsperbeck@google.com>
      Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Suggested-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: default avatarPeter Gonda <pgonda@google.com>
      
      Message-Id: <20220502165807.529624-1-pgonda@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0c2c7c06
  5. 03 May, 2022 7 commits
    • Paolo Bonzini's avatar
      Merge branch 'kvm-amd-pmu-fixes' into HEAD · 04144108
      Paolo Bonzini authored
      04144108
    • Sandipan Das's avatar
      kvm: x86/cpuid: Only provide CPUID leaf 0xA if host has architectural PMU · 5a1bde46
      Sandipan Das authored
      On some x86 processors, CPUID leaf 0xA provides information
      on Architectural Performance Monitoring features. It
      advertises a PMU version which Qemu uses to determine the
      availability of additional MSRs to manage the PMCs.
      
      Upon receiving a KVM_GET_SUPPORTED_CPUID ioctl request for
      the same, the kernel constructs return values based on the
      x86_pmu_capability irrespective of the vendor.
      
      This leaf and the additional MSRs are not supported on AMD
      and Hygon processors. If AMD PerfMonV2 is detected, the PMU
      version is set to 2 and guest startup breaks because of an
      attempt to access a non-existent MSR. Return zeros to avoid
      this.
      
      Fixes: a6c06ed1 ("KVM: Expose the architectural performance monitoring CPUID leaf")
      Reported-by: default avatarVasant Hegde <vasant.hegde@amd.com>
      Signed-off-by: default avatarSandipan Das <sandipan.das@amd.com>
      Message-Id: <3fef83d9c2b2f7516e8ff50d60851f29a4bcb716.1651058600.git.sandipan.das@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5a1bde46
    • Kyle Huey's avatar
      KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id · 5eb84932
      Kyle Huey authored
      Zen renumbered some of the performance counters that correspond to the
      well known events in perf_hw_id. This code in KVM was never updated for
      that, so guest that attempt to use counters on Zen that correspond to the
      pre-Zen perf_hw_id values will silently receive the wrong values.
      
      This has been observed in the wild with rr[0] when running in Zen 3
      guests. rr uses the retired conditional branch counter 00d1 which is
      incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND.
      
      [0] https://rr-project.org/Signed-off-by: default avatarKyle Huey <me@kylehuey.com>
      Message-Id: <20220503050136.86298-1-khuey@kylehuey.com>
      Cc: stable@vger.kernel.org
      [Check guest family, not host. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5eb84932
    • Paolo Bonzini's avatar
      Merge branch 'kvm-tdp-mmu-atomicity-fix' into HEAD · 4f510c8b
      Paolo Bonzini authored
      We are dropping A/D bits (and W bits) in the TDP MMU.  Even if mmu_lock
      is held for write, as volatile SPTEs can be written by other tasks/vCPUs
      outside of mmu_lock.
      
      Attempting to prove that bug exposed another notable goof, which has been
      lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs
      as volatile, even though KVM never clears WRITABLE outside of MMU lock.
      As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
      update writable SPTEs.
      
      The fix does not seem to have an easily-measurable affect on performance;
      page faults are so slow that wasting even a few hundred cycles is dwarfed
      by the base cost.
      4f510c8b
    • Sean Christopherson's avatar
      KVM: x86/mmu: Use atomic XCHG to write TDP MMU SPTEs with volatile bits · ba3a6120
      Sean Christopherson authored
      Use an atomic XCHG to write TDP MMU SPTEs that have volatile bits, even
      if mmu_lock is held for write, as volatile SPTEs can be written by other
      tasks/vCPUs outside of mmu_lock.  If a vCPU uses the to-be-modified SPTE
      to write a page, the CPU can cache the translation as WRITABLE in the TLB
      despite it being seen by KVM as !WRITABLE, and/or KVM can clobber the
      Accessed/Dirty bits and not properly tag the backing page.
      
      Exempt non-leaf SPTEs from atomic updates as KVM itself doesn't modify
      non-leaf SPTEs without holding mmu_lock, they do not have Dirty bits, and
      KVM doesn't consume the Accessed bit of non-leaf SPTEs.
      
      Dropping the Dirty and/or Writable bits is most problematic for dirty
      logging, as doing so can result in a missed TLB flush and eventually a
      missed dirty page.  In the unlikely event that the only dirty page(s) is
      a clobbered SPTE, clear_dirty_gfn_range() will see the SPTE as not dirty
      (based on the Dirty or Writable bit depending on the method) and so not
      update the SPTE and ultimately not flush.  If the SPTE is cached in the
      TLB as writable before it is clobbered, the guest can continue writing
      the associated page without ever taking a write-protect fault.
      
      For most (all?) file back memory, dropping the Dirty bit is a non-issue.
      The primary MMU write-protects its PTEs on writeback, i.e. KVM's dirty
      bit is effectively ignored because the primary MMU will mark that page
      dirty when the write-protection is lifted, e.g. when KVM faults the page
      back in for write.
      
      The Accessed bit is a complete non-issue.  Aside from being unused for
      non-leaf SPTEs, KVM doesn't do a TLB flush when aging SPTEs, i.e. the
      Accessed bit may be dropped anyways.
      
      Lastly, the Writable bit is also problematic as an extension of the Dirty
      bit, as KVM (correctly) treats the Dirty bit as volatile iff the SPTE is
      !DIRTY && WRITABLE.  If KVM fixes an MMU-writable, but !WRITABLE, SPTE
      out of mmu_lock, then it can allow the CPU to set the Dirty bit despite
      the SPTE being !WRITABLE when it is checked by KVM.  But that all depends
      on the Dirty bit being problematic in the first place.
      
      Fixes: 2f2fad08 ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Venkatesh Srinivas <venkateshs@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ba3a6120
    • Sean Christopherson's avatar
      KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits() · 54eb3ef5
      Sean Christopherson authored
      Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
      and into its callers.  Well, caller, since only one of its two callers
      doesn't already do the shadow-present check.
      
      Opportunistically move the helper to spte.c/h so that it can be used by
      the TDP MMU, which is also the primary motivation for the shadow-present
      change.  Unlike the legacy MMU, the TDP MMU uses a single path for clear
      leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
      MMU will need to check is_last_spte() prior to calling
      spte_has_volatile_bits(), and calling is_last_spte() without first
      calling is_shadow_present_spte() is at best odd, and at worst a violation
      of KVM's loosely defines SPTE rules.
      
      Note, mmu_spte_clear_track_bits() could likely skip the write entirely
      for SPTEs that are not shadow-present.  Leave that cleanup for a future
      patch to avoid introducing a functional change, and because the
      shadow-present check can likely be moved further up the stack, e.g.
      drop_large_spte() appears to be the only path that doesn't already
      explicitly check for a shadow-present SPTE.
      
      No functional change intended.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      54eb3ef5
    • Sean Christopherson's avatar
      KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D) · 706c9c55
      Sean Christopherson authored
      Don't treat SPTEs that are truly writable, i.e. writable in hardware, as
      being volatile (unless they're volatile for other reasons, e.g. A/D bits).
      KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit
      out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get
      cleared just because the SPTE is MMU-writable.
      
      Rename the wrapper of MMU-writable to be more literal, the previous name
      of spte_can_locklessly_be_made_writable() is wrong and misleading.
      
      Fixes: c7ba5b48 ("KVM: MMU: fast path of handling guest page fault")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220423034752.1161007-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      706c9c55
  6. 29 Apr, 2022 7 commits
    • Paolo Bonzini's avatar
      KVM: x86: work around QEMU issue with synthetic CPUID leaves · f751d8ea
      Paolo Bonzini authored
      Synthesizing AMD leaves up to 0x80000021 caused problems with QEMU,
      which assumes the *host* CPUID[0x80000000].EAX is higher or equal
      to what KVM_GET_SUPPORTED_CPUID reports.
      
      This causes QEMU to issue bogus host CPUIDs when preparing the input
      to KVM_SET_CPUID2.  It can even get into an infinite loop, which is
      only terminated by an abort():
      
         cpuid_data is full, no space for cpuid(eax:0x8000001d,ecx:0x3e)
      
      To work around this, only synthesize those leaves if 0x8000001d exists
      on the host.  The synthetic 0x80000021 leaf is mostly useful on Zen2,
      which satisfies the condition.
      
      Fixes: f144c49e ("KVM: x86: synthesize CPUID leaf 0x80000021h if useful")
      Reported-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f751d8ea
    • Sean Christopherson's avatar
      Revert "x86/mm: Introduce lookup_address_in_mm()" · 643d95aa
      Sean Christopherson authored
      Drop lookup_address_in_mm() now that KVM is providing it's own variant
      of lookup_address_in_pgd() that is safe for use with user addresses, e.g.
      guards against page tables being torn down.  A variant that provides a
      non-init mm is inherently dangerous and flawed, as the only reason to use
      an mm other than init_mm is to walk a userspace mapping, and
      lookup_address_in_pgd() does not play nice with userspace mappings, e.g.
      doesn't disable IRQs to block TLB shootdowns and doesn't use READ_ONCE()
      to ensure an upper level entry isn't converted to a huge page between
      checking the PAGE_SIZE bit and grabbing the address of the next level
      down.
      
      This reverts commit 13c72c06.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <YmwIi3bXr/1yhYV/@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      643d95aa
    • Paolo Bonzini's avatar
      Merge branch 'kvm-fixes-for-5.18-rc5' into HEAD · 73331c5d
      Paolo Bonzini authored
      Fixes for (relatively) old bugs, to be merged in both the -rc and next
      development trees:
      
      * Fix potential races when walking host page table
      
      * Fix bad user ABI for KVM_EXIT_SYSTEM_EVENT
      
      * Fix shadow page table leak when KVM runs nested
      73331c5d
    • Mingwei Zhang's avatar
      KVM: x86/mmu: fix potential races when walking host page table · 44187235
      Mingwei Zhang authored
      KVM uses lookup_address_in_mm() to detect the hugepage size that the host
      uses to map a pfn.  The function suffers from several issues:
      
       - no usage of READ_ONCE(*). This allows multiple dereference of the same
         page table entry. The TOCTOU problem because of that may cause KVM to
         incorrectly treat a newly generated leaf entry as a nonleaf one, and
         dereference the content by using its pfn value.
      
       - the information returned does not match what KVM needs; for non-present
         entries it returns the level at which the walk was terminated, as long
         as the entry is not 'none'.  KVM needs level information of only 'present'
         entries, otherwise it may regard a non-present PXE entry as a present
         large page mapping.
      
       - the function is not safe for mappings that can be torn down, because it
         does not disable IRQs and because it returns a PTE pointer which is never
         safe to dereference after the function returns.
      
      So implement the logic for walking host page tables directly in KVM, and
      stop using lookup_address_in_mm().
      
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220429031757.2042406-1-mizhang@google.com>
      [Inline in host_pfn_mapping_level, ensure no semantic change for its
       callers. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      44187235
    • Paolo Bonzini's avatar
      KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENT · d495f942
      Paolo Bonzini authored
      When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags
      member that at the time was unused.  Unfortunately this extensibility
      mechanism has several issues:
      
      - x86 is not writing the member, so it would not be possible to use it
        on x86 except for new events
      
      - the member is not aligned to 64 bits, so the definition of the
        uAPI struct is incorrect for 32- on 64-bit userspace.  This is a
        problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately
        usage of flags was only introduced in 5.18.
      
      Since padding has to be introduced, place a new field in there
      that tells if the flags field is valid.  To allow further extensibility,
      in fact, change flags to an array of 16 values, and store how many
      of the values are valid.  The availability of the new ndata field
      is tied to a system capability; all architectures are changed to
      fill in the field.
      
      To avoid breaking compilation of userspace that was using the flags
      field, provide a userspace-only union to overlap flags with data[0].
      The new field is placed at the same offset for both 32- and 64-bit
      userspace.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Message-Id: <20220422103013.34832-1-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d495f942
    • Sean Christopherson's avatar
      KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR · 86931ff7
      Sean Christopherson authored
      Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
      MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
      The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
      guest, possibly with help from userspace, manages to coerce KVM into
      creating a SPTE for an "impossible" gfn, KVM will leak the associated
      shadow pages (page tables):
      
        WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                      kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
        Call Trace:
         <TASK>
         kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
         kvm_destroy_vm+0x162/0x2d0 [kvm]
         kvm_vm_release+0x1d/0x30 [kvm]
         __fput+0x82/0x240
         task_work_run+0x5b/0x90
         exit_to_user_mode_prepare+0xd2/0xe0
         syscall_exit_to_user_mode+0x1d/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      On bare metal, encountering an impossible gpa in the page fault path is
      well and truly impossible, barring CPU bugs, as the CPU will signal #PF
      during the gva=>gpa translation (or a similar failure when stuffing a
      physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
      itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
      of the underlying hardware, in which case the hardware will not fault on
      the illegal-from-KVM's-perspective gpa.
      
      Alternatively, KVM could continue allowing the dodgy behavior and simply
      zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
      a (minor) waste of cycles, and more importantly, KVM can't reasonably
      support impossible memslots when running on bare metal (or with an
      accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
      KVM is running as a guest is not a safe option as the host isn't required
      to announce itself to the guest in any way, e.g. doesn't need to set the
      HYPERVISOR CPUID bit.
      
      A second alternative to disallowing the memslot behavior would be to
      disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
      restriction is undesirable as there are legitimate use cases for doing
      so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
      systems so that VMs can be migrated between hosts with different
      MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.
      
      Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
      even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
      any reserved physical address bits).
      
      The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
      The memslot and TDP MMU code want an exclusive value, but the name
      implies the returned value is inclusive, and the MMIO path needs an
      inclusive check.
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Cc: Ben Gardon <bgardon@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220428233416.2446833-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      86931ff7
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-5.18-2' of... · 484c22df
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 5.18, take #2
      
      - Take care of faults occuring between the PARange and
        IPA range by injecting an exception
      
      - Fix S2 faults taken from a host EL0 in protected mode
      
      - Work around Oops caused by a PMU access from a 32bit
        guest when PMU has been created. This is a temporary
        bodge until we fix it for good.
      484c22df
  7. 27 Apr, 2022 3 commits
    • Marc Zyngier's avatar
      KVM: arm64: Inject exception on out-of-IPA-range translation fault · 85ea6b1e
      Marc Zyngier authored
      When taking a translation fault for an IPA that is outside of
      the range defined by the hypervisor (between the HW PARange and
      the IPA range), we stupidly treat it as an IO and forward the access
      to userspace. Of course, userspace can't do much with it, and things
      end badly.
      
      Arguably, the guest is braindead, but we should at least catch the
      case and inject an exception.
      
      Check the faulting IPA against:
      - the sanitised PARange: inject an address size fault
      - the IPA size: inject an abort
      Reported-by: default avatarChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      85ea6b1e
    • Alexandru Elisei's avatar
      KVM/arm64: Don't emulate a PMU for 32-bit guests if feature not set · 8f6379e2
      Alexandru Elisei authored
      kvm->arch.arm_pmu is set when userspace attempts to set the first PMU
      attribute. As certain attributes are mandatory, arm_pmu ends up always
      being set to a valid arm_pmu, otherwise KVM will refuse to run the VCPU.
      However, this only happens if the VCPU has the PMU feature. If the VCPU
      doesn't have the feature bit set, kvm->arch.arm_pmu will be left
      uninitialized and equal to NULL.
      
      KVM doesn't do ID register emulation for 32-bit guests and accesses to the
      PMU registers aren't gated by the pmu_visibility() function. This is done
      to prevent injecting unexpected undefined exceptions in guests which have
      detected the presence of a hardware PMU. But even though the VCPU feature
      is missing, KVM still attempts to emulate certain aspects of the PMU when
      PMU registers are accessed. This leads to a NULL pointer dereference like
      this one, which happens on an odroid-c4 board when running the
      kvm-unit-tests pmu-cycle-counter test with kvmtool and without the PMU
      feature being set:
      
      [  454.402699] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000150
      [  454.405865] Mem abort info:
      [  454.408596]   ESR = 0x96000004
      [  454.411638]   EC = 0x25: DABT (current EL), IL = 32 bits
      [  454.416901]   SET = 0, FnV = 0
      [  454.419909]   EA = 0, S1PTW = 0
      [  454.423010]   FSC = 0x04: level 0 translation fault
      [  454.427841] Data abort info:
      [  454.430687]   ISV = 0, ISS = 0x00000004
      [  454.434484]   CM = 0, WnR = 0
      [  454.437404] user pgtable: 4k pages, 48-bit VAs, pgdp=000000000c924000
      [  454.443800] [0000000000000150] pgd=0000000000000000, p4d=0000000000000000
      [  454.450528] Internal error: Oops: 96000004 [#1] PREEMPT SMP
      [  454.456036] Modules linked in:
      [  454.459053] CPU: 1 PID: 267 Comm: kvm-vcpu-0 Not tainted 5.18.0-rc4 #113
      [  454.465697] Hardware name: Hardkernel ODROID-C4 (DT)
      [  454.470612] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [  454.477512] pc : kvm_pmu_event_mask.isra.0+0x14/0x74
      [  454.482427] lr : kvm_pmu_set_counter_event_type+0x2c/0x80
      [  454.487775] sp : ffff80000a9839c0
      [  454.491050] x29: ffff80000a9839c0 x28: ffff000000a83a00 x27: 0000000000000000
      [  454.498127] x26: 0000000000000000 x25: 0000000000000000 x24: ffff00000a510000
      [  454.505198] x23: ffff000000a83a00 x22: ffff000003b01000 x21: 0000000000000000
      [  454.512271] x20: 000000000000001f x19: 00000000000003ff x18: 0000000000000000
      [  454.519343] x17: 000000008003fe98 x16: 0000000000000000 x15: 0000000000000000
      [  454.526416] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
      [  454.533489] x11: 000000008003fdbc x10: 0000000000009d20 x9 : 000000000000001b
      [  454.540561] x8 : 0000000000000000 x7 : 0000000000000d00 x6 : 0000000000009d00
      [  454.547633] x5 : 0000000000000037 x4 : 0000000000009d00 x3 : 0d09000000000000
      [  454.554705] x2 : 000000000000001f x1 : 0000000000000000 x0 : 0000000000000000
      [  454.561779] Call trace:
      [  454.564191]  kvm_pmu_event_mask.isra.0+0x14/0x74
      [  454.568764]  kvm_pmu_set_counter_event_type+0x2c/0x80
      [  454.573766]  access_pmu_evtyper+0x128/0x170
      [  454.577905]  perform_access+0x34/0x80
      [  454.581527]  kvm_handle_cp_32+0x13c/0x160
      [  454.585495]  kvm_handle_cp15_32+0x1c/0x30
      [  454.589462]  handle_exit+0x70/0x180
      [  454.592912]  kvm_arch_vcpu_ioctl_run+0x1c4/0x5e0
      [  454.597485]  kvm_vcpu_ioctl+0x23c/0x940
      [  454.601280]  __arm64_sys_ioctl+0xa8/0xf0
      [  454.605160]  invoke_syscall+0x48/0x114
      [  454.608869]  el0_svc_common.constprop.0+0xd4/0xfc
      [  454.613527]  do_el0_svc+0x28/0x90
      [  454.616803]  el0_svc+0x34/0xb0
      [  454.619822]  el0t_64_sync_handler+0xa4/0x130
      [  454.624049]  el0t_64_sync+0x18c/0x190
      [  454.627675] Code: a9be7bfd 910003fd f9000bf3 52807ff3 (b9415001)
      [  454.633714] ---[ end trace 0000000000000000 ]---
      
      In this particular case, Linux hasn't detected the presence of a hardware
      PMU because the PMU node is missing from the DTB, so userspace would have
      been unable to set the VCPU PMU feature even if it attempted it. What
      happens is that the 32-bit guest reads ID_DFR0, which advertises the
      presence of the PMU, and when it tries to program a counter, it triggers
      the NULL pointer dereference because kvm->arch.arm_pmu is NULL.
      
      kvm-arch.arm_pmu was introduced by commit 46b18782 ("KVM: arm64:
      Keep a per-VM pointer to the default PMU"). Until that commit, this
      error would be triggered instead:
      
      [   73.388140] ------------[ cut here ]------------
      [   73.388189] Unknown PMU version 0
      [   73.390420] WARNING: CPU: 1 PID: 264 at arch/arm64/kvm/pmu-emul.c:36 kvm_pmu_event_mask.isra.0+0x6c/0x74
      [   73.399821] Modules linked in:
      [   73.402835] CPU: 1 PID: 264 Comm: kvm-vcpu-0 Not tainted 5.17.0 #114
      [   73.409132] Hardware name: Hardkernel ODROID-C4 (DT)
      [   73.414048] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
      [   73.420948] pc : kvm_pmu_event_mask.isra.0+0x6c/0x74
      [   73.425863] lr : kvm_pmu_event_mask.isra.0+0x6c/0x74
      [   73.430779] sp : ffff80000a8db9b0
      [   73.434055] x29: ffff80000a8db9b0 x28: ffff000000dbaac0 x27: 0000000000000000
      [   73.441131] x26: ffff000000dbaac0 x25: 00000000c600000d x24: 0000000000180720
      [   73.448203] x23: ffff800009ffbe10 x22: ffff00000b612000 x21: 0000000000000000
      [   73.455276] x20: 000000000000001f x19: 0000000000000000 x18: ffffffffffffffff
      [   73.462348] x17: 000000008003fe98 x16: 0000000000000000 x15: 0720072007200720
      [   73.469420] x14: 0720072007200720 x13: ffff800009d32488 x12: 00000000000004e6
      [   73.476493] x11: 00000000000001a2 x10: ffff800009d32488 x9 : ffff800009d32488
      [   73.483565] x8 : 00000000ffffefff x7 : ffff800009d8a488 x6 : ffff800009d8a488
      [   73.490638] x5 : ffff0000f461a9d8 x4 : 0000000000000000 x3 : 0000000000000001
      [   73.497710] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff000000dbaac0
      [   73.504784] Call trace:
      [   73.507195]  kvm_pmu_event_mask.isra.0+0x6c/0x74
      [   73.511768]  kvm_pmu_set_counter_event_type+0x2c/0x80
      [   73.516770]  access_pmu_evtyper+0x128/0x16c
      [   73.520910]  perform_access+0x34/0x80
      [   73.524532]  kvm_handle_cp_32+0x13c/0x160
      [   73.528500]  kvm_handle_cp15_32+0x1c/0x30
      [   73.532467]  handle_exit+0x70/0x180
      [   73.535917]  kvm_arch_vcpu_ioctl_run+0x20c/0x6e0
      [   73.540489]  kvm_vcpu_ioctl+0x2b8/0x9e0
      [   73.544283]  __arm64_sys_ioctl+0xa8/0xf0
      [   73.548165]  invoke_syscall+0x48/0x114
      [   73.551874]  el0_svc_common.constprop.0+0xd4/0xfc
      [   73.556531]  do_el0_svc+0x28/0x90
      [   73.559808]  el0_svc+0x28/0x80
      [   73.562826]  el0t_64_sync_handler+0xa4/0x130
      [   73.567054]  el0t_64_sync+0x1a0/0x1a4
      [   73.570676] ---[ end trace 0000000000000000 ]---
      [   73.575382] kvm: pmu event creation failed -2
      
      The root cause remains the same: kvm->arch.pmuver was never set to
      something sensible because the VCPU feature itself was never set.
      
      The odroid-c4 is somewhat of a special case, because Linux doesn't probe
      the PMU. But the above errors can easily be reproduced on any hardware,
      with or without a PMU driver, as long as userspace doesn't set the PMU
      feature.
      
      Work around the fact that KVM advertises a PMU even when the VCPU feature
      is not set by gating all PMU emulation on the feature. The guest can still
      access the registers without KVM injecting an undefined exception.
      Signed-off-by: default avatarAlexandru Elisei <alexandru.elisei@arm.com>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220425145530.723858-1-alexandru.elisei@arm.com
      8f6379e2
    • Will Deacon's avatar
      KVM: arm64: Handle host stage-2 faults from 32-bit EL0 · 2a50fc5f
      Will Deacon authored
      When pKVM is enabled, host memory accesses are translated by an identity
      mapping at stage-2, which is populated lazily in response to synchronous
      exceptions from 64-bit EL1 and EL0.
      
      Extend this handling to cover exceptions originating from 32-bit EL0 as
      well. Although these are very unlikely to occur in practice, as the
      kernel typically ensures that user pages are initialised before mapping
      them in, drivers could still map previously untouched device pages into
      userspace and expect things to work rather than panic the system.
      
      Cc: Quentin Perret <qperret@google.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Signed-off-by: default avatarWill Deacon <will@kernel.org>
      Signed-off-by: default avatarMarc Zyngier <maz@kernel.org>
      Link: https://lore.kernel.org/r/20220427171332.13635-1-will@kernel.org
      2a50fc5f
  8. 21 Apr, 2022 17 commits
    • Paolo Bonzini's avatar
      kvm: selftests: introduce and use more page size-related constants · e852be8b
      Paolo Bonzini authored
      Clean up code that was hardcoding masks for various fields,
      now that the masks are included in processor.h.
      
      For more cleanup, define PAGE_SIZE and PAGE_MASK just like in Linux.
      PAGE_SIZE in particular was defined by several tests.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e852be8b
    • Paolo Bonzini's avatar
      kvm: selftests: do not use bitfields larger than 32-bits for PTEs · f18b4aeb
      Paolo Bonzini authored
      Red Hat's QE team reported test failure on access_tracking_perf_test:
      
      Testing guest mode: PA-bits:ANY, VA-bits:48,  4K pages
      guest physical test memory offset: 0x3fffbffff000
      
      Populating memory             : 0.684014577s
      Writing to populated memory   : 0.006230175s
      Reading from populated memory : 0.004557805s
      ==== Test Assertion Failure ====
        lib/kvm_util.c:1411: false
        pid=125806 tid=125809 errno=4 - Interrupted system call
           1  0x0000000000402f7c: addr_gpa2hva at kvm_util.c:1411
           2   (inlined by) addr_gpa2hva at kvm_util.c:1405
           3  0x0000000000401f52: lookup_pfn at access_tracking_perf_test.c:98
           4   (inlined by) mark_vcpu_memory_idle at access_tracking_perf_test.c:152
           5   (inlined by) vcpu_thread_main at access_tracking_perf_test.c:232
           6  0x00007fefe9ff81ce: ?? ??:0
           7  0x00007fefe9c64d82: ?? ??:0
        No vm physical memory at 0xffbffff000
      
      I can easily reproduce it with a Intel(R) Xeon(R) CPU E5-2630 with 46 bits
      PA.
      
      It turns out that the address translation for clearing idle page tracking
      returned a wrong result; addr_gva2gpa()'s last step, which is based on
      "pte[index[0]].pfn", did the calculation with 40 bits length and the
      high 12 bits got truncated.  In above case the GPA address to be returned
      should be 0x3fffbffff000 for GVA 0xc0000000, but it got truncated into
      0xffbffff000 and the subsequent gpa2hva lookup failed.
      
      The width of operations on bit fields greater than 32-bit is
      implementation defined, and differs between GCC (which uses the bitfield
      precision) and clang (which uses 64-bit arithmetic), so this is a
      potential minefield.  Remove the bit fields and using manual masking
      instead.
      
      Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2075036Reported-by: default avatarNana Liu <nanliu@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Tested-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      f18b4aeb
    • Mingwei Zhang's avatar
      KVM: SEV: add cache flush to solve SEV cache incoherency issues · 683412cc
      Mingwei Zhang authored
      Flush the CPU caches when memory is reclaimed from an SEV guest (where
      reclaim also includes it being unmapped from KVM's memslots).  Due to lack
      of coherency for SEV encrypted memory, failure to flush results in silent
      data corruption if userspace is malicious/broken and doesn't ensure SEV
      guest memory is properly pinned and unpinned.
      
      Cache coherency is not enforced across the VM boundary in SEV (AMD APM
      vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
      VM guests have to be explicitly flushed on the host side. If a memory page
      containing dirty confidential cachelines was released by VM and reallocated
      to another user, the cachelines may corrupt the new user at a later time.
      
      KVM takes a shortcut by assuming all confidential memory remain pinned
      until the end of VM lifetime. Therefore, KVM does not flush cache at
      mmu_notifier invalidation events. Because of this incorrect assumption and
      the lack of cache flushing, malicous userspace can crash the host kernel:
      creating a malicious VM and continuously allocates/releases unpinned
      confidential memory pages when the VM is running.
      
      Add cache flush operations to mmu_notifier operations to ensure that any
      physical memory leaving the guest VM get flushed. In particular, hook
      mmu_notifier_invalidate_range_start and mmu_notifier_release events and
      flush cache accordingly. The hook after releasing the mmu lock to avoid
      contention with other vCPUs.
      
      Cc: stable@vger.kernel.org
      Suggested-by: default avatarSean Christpherson <seanjc@google.com>
      Reported-by: default avatarMingwei Zhang <mizhang@google.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-4-mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      683412cc
    • Mingwei Zhang's avatar
      KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs · d45829b3
      Mingwei Zhang authored
      Use clflush_cache_range() to flush the confidential memory when
      SME_COHERENT is supported in AMD CPU. Cache flush is still needed since
      SME_COHERENT only support cache invalidation at CPU side. All confidential
      cache lines are still incoherent with DMA devices.
      
      Cc: stable@vger.kerel.org
      
      Fixes: add5e2f0 ("KVM: SVM: Add support for the SEV-ES VMSA")
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-3-mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d45829b3
    • Sean Christopherson's avatar
      KVM: SVM: Simplify and harden helper to flush SEV guest page(s) · 4bbef7e8
      Sean Christopherson authored
      Rework sev_flush_guest_memory() to explicitly handle only a single page,
      and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails.  Per-page
      flushing is currently used only to flush the VMSA, and in its current
      form, the helper is completely broken with respect to flushing actual
      guest memory, i.e. won't work correctly for an arbitrary memory range.
      
      VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page
      walks, i.e. will fault if the address is not present in the host page
      tables or does not have the correct permissions.  Current AMD CPUs also
      do not honor SMAP overrides (undocumented in kernel versions of the APM),
      so passing in a userspace address is completely out of the question.  In
      other words, KVM would need to manually walk the host page tables to get
      the pfn, ensure the pfn is stable, and then use the direct map to invoke
      VM_PAGE_FLUSH.  And the latter might not even work, e.g. if userspace is
      particularly evil/clever and backs the guest with Secret Memory (which
      unmaps memory from the direct map).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      
      Fixes: add5e2f0 ("KVM: SVM: Add support for the SEV-ES VMSA")
      Reported-by: default avatarMingwei Zhang <mizhang@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-2-mizhang@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4bbef7e8
    • Thomas Huth's avatar
      KVM: selftests: Silence compiler warning in the kvm_page_table_test · 266a19a0
      Thomas Huth authored
      When compiling kvm_page_table_test.c, I get this compiler warning
      with gcc 11.2:
      
      kvm_page_table_test.c: In function 'pre_init_before_test':
      ../../../../tools/include/linux/kernel.h:44:24: warning: comparison of
       distinct pointer types lacks a cast
         44 |         (void) (&_max1 == &_max2);              \
            |                        ^~
      kvm_page_table_test.c:281:21: note: in expansion of macro 'max'
        281 |         alignment = max(0x100000, alignment);
            |                     ^~~
      
      Fix it by adjusting the type of the absolute value.
      Signed-off-by: default avatarThomas Huth <thuth@redhat.com>
      Reviewed-by: default avatarClaudio Imbrenda <imbrenda@linux.ibm.com>
      Message-Id: <20220414103031.565037-1-thuth@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      266a19a0
    • Like Xu's avatar
      KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog · 75189d1d
      Like Xu authored
      NMI-watchdog is one of the favorite features of kernel developers,
      but it does not work in AMD guest even with vPMU enabled and worse,
      the system misrepresents this capability via /proc.
      
      This is a PMC emulation error. KVM does not pass the latest valid
      value to perf_event in time when guest NMI-watchdog is running, thus
      the perf_event corresponding to the watchdog counter will enter the
      old state at some point after the first guest NMI injection, forcing
      the hardware register PMC0 to be constantly written to 0x800000000001.
      
      Meanwhile, the running counter should accurately reflect its new value
      based on the latest coordinated pmc->counter (from vPMC's point of view)
      rather than the value written directly by the guest.
      
      Fixes: 168d918f ("KVM: x86: Adjust counter sample period after a wrmsr")
      Reported-by: default avatarDongli Cao <caodongli@kingsoft.com>
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Reviewed-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Tested-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Message-Id: <20220409015226.38619-1-likexu@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      75189d1d
    • Wanpeng Li's avatar
      x86/kvm: Preserve BSP MSR_KVM_POLL_CONTROL across suspend/resume · 0361bdfd
      Wanpeng Li authored
      MSR_KVM_POLL_CONTROL is cleared on reset, thus reverting guests to
      host-side polling after suspend/resume.  Non-bootstrap CPUs are
      restored correctly by the haltpoll driver because they are hot-unplugged
      during suspend and hot-plugged during resume; however, the BSP
      is not hotpluggable and remains in host-sde polling mode after
      the guest resume.  The makes the guest pay for the cost of vmexits
      every time the guest enters idle.
      
      Fix it by recording BSP's haltpoll state and resuming it during guest
      resume.
      
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: default avatarWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1650267752-46796-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0361bdfd
    • Tom Rix's avatar
      KVM: SPDX style and spelling fixes · a413a625
      Tom Rix authored
      SPDX comments use use /* */ style comments in headers anad
      // style comments in .c files.  Also fix two spelling mistakes.
      Signed-off-by: default avatarTom Rix <trix@redhat.com>
      Message-Id: <20220410153840.55506-1-trix@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a413a625
    • Sean Christopherson's avatar
      KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled · 0047fb33
      Sean Christopherson authored
      Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is
      disabled at the module level to avoid having to acquire the mutex and
      potentially process all vCPUs. The DISABLE inhibit will (barring bugs)
      never be lifted, so piling on more inhibits is unnecessary.
      
      Fixes: cae72dcc ("KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active")
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0047fb33
    • Sean Christopherson's avatar
      KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race · 423ecfea
      Sean Christopherson authored
      Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an
      in-kernel local APIC and APICv enabled at the module level.  Consuming
      kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can
      race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens
      before the vCPU is fully onlined, i.e. it won't get the request made to
      "all" vCPUs.  If APICv is globally inhibited between setting apicv_active
      and onlining the vCPU, the vCPU will end up running with APICv enabled
      and trigger KVM's sanity check.
      
      Mark APICv as active during vCPU creation if APICv is enabled at the
      module level, both to be optimistic about it's final state, e.g. to avoid
      additional VMWRITEs on VMX, and because there are likely bugs lurking
      since KVM checks apicv_active in multiple vCPU creation paths.  While
      keeping the current behavior of consuming kvm_apicv_activated() is
      arguably safer from a regression perspective, force apicv_active so that
      vCPU creation runs with deterministic state and so that if there are bugs,
      they are found sooner than later, i.e. not when some crazy race condition
      is hit.
      
        WARNING: CPU: 0 PID: 484 at arch/x86/kvm/x86.c:9877 vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
        Modules linked in:
        CPU: 0 PID: 484 Comm: syz-executor361 Not tainted 5.16.13 #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
        RIP: 0010:vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
        Call Trace:
         <TASK>
         vcpu_run arch/x86/kvm/x86.c:10039 [inline]
         kvm_arch_vcpu_ioctl_run+0x337/0x15e0 arch/x86/kvm/x86.c:10234
         kvm_vcpu_ioctl+0x4d2/0xc80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3727
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:874 [inline]
         __se_sys_ioctl fs/ioctl.c:860 [inline]
         __x64_sys_ioctl+0x16d/0x1d0 fs/ioctl.c:860
         do_syscall_x64 arch/x86/entry/common.c:50 [inline]
         do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The bug was hit by a syzkaller spamming VM creation with 2 vCPUs and a
      call to KVM_SET_GUEST_DEBUG.
      
        r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x0, 0x0)
        r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
        ioctl$KVM_CAP_SPLIT_IRQCHIP(r1, 0x4068aea3, &(0x7f0000000000)) (async)
        r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) (async)
        r3 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x400000000000002)
        ioctl$KVM_SET_GUEST_DEBUG(r3, 0x4048ae9b, &(0x7f00000000c0)={0x5dda9c14aa95f5c5})
        ioctl$KVM_RUN(r2, 0xae80, 0x0)
      Reported-by: default avatarGaoning Pan <pgn@zju.edu.cn>
      Reported-by: default avatarYongkang Jia <kangel@zju.edu.cn>
      Fixes: 8df14af4 ("kvm: x86: Add support for dynamic APICv activation")
      Cc: stable@vger.kernel.org
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      423ecfea
    • Sean Christopherson's avatar
      KVM: nVMX: Defer APICv updates while L2 is active until L1 is active · 7c69661e
      Sean Christopherson authored
      Defer APICv updates that occur while L2 is active until nested VM-Exit,
      i.e. until L1 regains control.  vmx_refresh_apicv_exec_ctrl() assumes L1
      is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
      vmcs01.  E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
      APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
      becomes unhibited while L2 is active, KVM will set various APICv controls
      in vmcs02 and trigger a failed VM-Entry.  The kicker is that, unless
      running with nested_early_check=1, KVM blames L1 and chaos ensues.
      
      In all cases, ignoring vmcs02 and always deferring the inhibition change
      to vmcs01 is correct (or at least acceptable).  The ABSENT and DISABLE
      inhibitions cannot truly change while L2 is active (see below).
      
      IRQ_BLOCKING can change, but it is firmly a best effort debug feature.
      Furthermore, only L2's APIC is accelerated/virtualized to the full extent
      possible, e.g. even if L1 passes through its APIC to L2, normal MMIO/MSR
      interception will apply to the virtual APIC managed by KVM.
      The exception is the SELF_IPI register when x2APIC is enabled, but that's
      an acceptable hole.
      
      Lastly, Hyper-V's Auto EOI can technically be toggled if L1 exposes the
      MSRs to L2, but for that to work in any sane capacity, L1 would need to
      pass through IRQs to L2 as well, and IRQs must be intercepted to enable
      virtual interrupt delivery.  I.e. exposing Auto EOI to L2 and enabling
      VID for L2 are, for all intents and purposes, mutually exclusive.
      
      Lack of dynamic toggling is also why this scenario is all but impossible
      to encounter in KVM's current form.  But a future patch will pend an
      APICv update request _during_ vCPU creation to plug a race where a vCPU
      that's being created doesn't get included in the "all vCPUs request"
      because it's not yet visible to other vCPUs.  If userspaces restores L2
      after VM creation (hello, KVM selftests), the first KVM_RUN will occur
      while L2 is active and thus service the APICv update request made during
      VM creation.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220420013732.3308816-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c69661e
    • Sean Christopherson's avatar
      KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled · 80f0497c
      Sean Christopherson authored
      Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via
      module param.  A recent refactoring to add a wrapper for setting/clearing
      inhibits unintentionally changed the flag, probably due to a copy+paste
      goof.
      
      Fixes: 4f4c4a3e ("KVM: x86: Trace all APICv inhibit changes and capture overall status")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220420013732.3308816-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      80f0497c
    • Sean Christopherson's avatar
      KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref · 5c697c36
      Sean Christopherson authored
      Initialize debugfs_entry to its semi-magical -ENOENT value when the VM
      is created.  KVM's teardown when VM creation fails is kludgy and calls
      kvm_uevent_notify_change() and kvm_destroy_vm_debugfs() even if KVM never
      attempted kvm_create_vm_debugfs().  Because debugfs_entry is zero
      initialized, the IS_ERR() checks pass and KVM derefs a NULL pointer.
      
        BUG: kernel NULL pointer dereference, address: 0000000000000018
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 1068b1067 P4D 1068b1067 PUD 1068b0067 PMD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 871 Comm: repro Not tainted 5.18.0-rc1+ #825
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:__dentry_path+0x7b/0x130
        Call Trace:
         <TASK>
         dentry_path_raw+0x42/0x70
         kvm_uevent_notify_change.part.0+0x10c/0x200 [kvm]
         kvm_put_kvm+0x63/0x2b0 [kvm]
         kvm_dev_ioctl+0x43a/0x920 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x31/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        Modules linked in: kvm_intel kvm irqbypass
      
      Fixes: a44a4cc1 ("KVM: Don't create VM debugfs files outside of the VM directory")
      Cc: stable@vger.kernel.org
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Oliver Upton <oupton@google.com>
      Reported-by: syzbot+df6fbbd2ee39f21289ef@syzkaller.appspotmail.com
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarOliver Upton <oupton@google.com>
      Message-Id: <20220415004622.2207751-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5c697c36
    • Sean Christopherson's avatar
      KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused · 2031f287
      Sean Christopherson authored
      Add wrappers to acquire/release KVM's SRCU lock when stashing the index
      in vcpu->src_idx, along with rudimentary detection of illegal usage,
      e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx.  Because the
      SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
      unnoticed for quite some time and only cause problems when the nested
      lock happens to get a different index.
      
      Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
      likely yell so loudly that it will bring the kernel to its knees.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Tested-by: default avatarFabiano Rosas <farosas@linux.ibm.com>
      Message-Id: <20220415004343.2203171-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2031f287
    • Sean Christopherson's avatar
      KVM: RISC-V: Use kvm_vcpu.srcu_idx, drop RISC-V's unnecessary copy · fdd6f6ac
      Sean Christopherson authored
      Use the generic kvm_vcpu's srcu_idx instead of using an indentical field
      in RISC-V's version of kvm_vcpu_arch.  Generic KVM very intentionally
      does not touch vcpu->srcu_idx, i.e. there's zero chance of running afoul
      of common code.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220415004343.2203171-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fdd6f6ac
    • Sean Christopherson's avatar
      KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io() · 2d089356
      Sean Christopherson authored
      Don't re-acquire SRCU in complete_emulated_io() now that KVM acquires the
      lock in kvm_arch_vcpu_ioctl_run().  More importantly, don't overwrite
      vcpu->srcu_idx.  If the index acquired by complete_emulated_io() differs
      from the one acquired by kvm_arch_vcpu_ioctl_run(), KVM will effectively
      leak a lock and hang if/when synchronize_srcu() is invoked for the
      relevant grace period.
      
      Fixes: 8d25b7be ("KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220415004343.2203171-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      2d089356