1. 03 Jun, 2023 2 commits
    • Michal Luczaj's avatar
      KVM: selftests: Add test for race in kvm_recalculate_apic_map() · 47d2804b
      Michal Luczaj authored
      Keep switching between LAPIC_MODE_X2APIC and LAPIC_MODE_DISABLED during
      APIC map construction to hunt for TOCTOU bugs in KVM.  KVM's optimized map
      recalc makes multiple passes over the list of vCPUs, and the calculations
      ignore vCPU's whose APIC is hardware-disabled, i.e. there's a window where
      toggling LAPIC_MODE_DISABLED is quite interesting.
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Co-developed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20230602233250.1014316-4-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      47d2804b
    • Sean Christopherson's avatar
      KVM: x86: Bail from kvm_recalculate_phys_map() if x2APIC ID is out-of-bounds · 4364b287
      Sean Christopherson authored
      Bail from kvm_recalculate_phys_map() and disable the optimized map if the
      target vCPU's x2APIC ID is out-of-bounds, i.e. if the vCPU was added
      and/or enabled its local APIC after the map was allocated.  This fixes an
      out-of-bounds access bug in the !x2apic_format path where KVM would write
      beyond the end of phys_map.
      
      Check the x2APIC ID regardless of whether or not x2APIC is enabled,
      as KVM's hardcodes x2APIC ID to be the vCPU ID, i.e. it can't change, and
      the map allocation in kvm_recalculate_apic_map() doesn't check for x2APIC
      being enabled, i.e. the check won't get false postivies.
      
      Note, this also affects the x2apic_format path, which previously just
      ignored the "x2apic_id > new->max_apic_id" case.  That too is arguably a
      bug fix, as ignoring the vCPU meant that KVM would not send interrupts to
      the vCPU until the next map recalculation.  In practice, that "bug" is
      likely benign as a newly present vCPU/APIC would immediately trigger a
      recalc.  But, there's no functional downside to disabling the map, and
      a future patch will gracefully handle the -E2BIG case by retrying instead
      of simply disabling the optimized map.
      
      Opportunistically add a sanity check on the xAPIC ID size, along with a
      comment explaining why the xAPIC ID is guaranteed to be "good".
      Reported-by: default avatarMichal Luczaj <mhal@rbox.co>
      Fixes: 5b84b029 ("KVM: x86: Honor architectural behavior for aliased 8-bit APIC IDs")
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20230602233250.1014316-2-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      4364b287
  2. 02 Jun, 2023 3 commits
    • Sean Christopherson's avatar
      KVM: x86: Account fastpath-only VM-Exits in vCPU stats · 8b703a49
      Sean Christopherson authored
      Increment vcpu->stat.exits when handling a fastpath VM-Exit without
      going through any part of the "slow" path.  Not bumping the exits stat
      can result in wildly misleading exit counts, e.g. if the primary reason
      the guest is exiting is to program the TSC deadline timer.
      
      Fixes: 404d5d7b ("KVM: X86: Introduce more exit_fastpath_completion enum values")
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20230602011920.787844-2-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      8b703a49
    • Maciej S. Szmigiero's avatar
      KVM: SVM: vNMI pending bit is V_NMI_PENDING_MASK not V_NMI_BLOCKING_MASK · b2ce8997
      Maciej S. Szmigiero authored
      While testing Hyper-V enabled Windows Server 2019 guests on Zen4 hardware
      I noticed that with vCPU count large enough (> 16) they sometimes froze at
      boot.
      With vCPU count of 64 they never booted successfully - suggesting some kind
      of a race condition.
      
      Since adding "vnmi=0" module parameter made these guests boot successfully
      it was clear that the problem is most likely (v)NMI-related.
      
      Running kvm-unit-tests quickly showed failing NMI-related tests cases, like
      "multiple nmi" and "pending nmi" from apic-split, x2apic and xapic tests
      and the NMI parts of eventinj test.
      
      The issue was that once one NMI was being serviced no other NMI was allowed
      to be set pending (NMI limit = 0), which was traced to
      svm_is_vnmi_pending() wrongly testing for the "NMI blocked" flag rather
      than for the "NMI pending" flag.
      
      Fix this by testing for the right flag in svm_is_vnmi_pending().
      Once this is done, the NMI-related kvm-unit-tests pass successfully and
      the Windows guest no longer freezes at boot.
      
      Fixes: fa4c027a ("KVM: x86: Add support for SVM's Virtual NMI")
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/be4ca192eb0c1e69a210db3009ca984e6a54ae69.1684495380.git.maciej.szmigiero@oracle.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      b2ce8997
    • Sean Christopherson's avatar
      KVM: x86/mmu: Grab memslot for correct address space in NX recovery worker · 817fa998
      Sean Christopherson authored
      Factor in the address space (non-SMM vs. SMM) of the target shadow page
      when recovering potential NX huge pages, otherwise KVM will retrieve the
      wrong memslot when zapping shadow pages that were created for SMM.  The
      bug most visibly manifests as a WARN on the memslot being non-NULL, but
      the worst case scenario is that KVM could unaccount the shadow page
      without ensuring KVM won't install a huge page, i.e. if the non-SMM slot
      is being dirty logged, but the SMM slot is not.
      
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 3911 at arch/x86/kvm/mmu/mmu.c:7015
       kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm]
       CPU: 1 PID: 3911 Comm: kvm-nx-lpage-re
       RIP: 0010:kvm_nx_huge_page_recovery_worker+0x38c/0x3d0 [kvm]
       RSP: 0018:ffff99b284f0be68 EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff99b284edd000 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
       RBP: ffff9271397024e0 R08: 0000000000000000 R09: ffff927139702450
       R10: 0000000000000000 R11: 0000000000000001 R12: ffff99b284f0be98
       R13: 0000000000000000 R14: ffff9270991fcd80 R15: 0000000000000003
       FS:  0000000000000000(0000) GS:ffff927f9f640000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f0aacad3ae0 CR3: 000000088fc2c005 CR4: 00000000003726e0
       Call Trace:
        <TASK>
      __pfx_kvm_nx_huge_page_recovery_worker+0x10/0x10 [kvm]
        kvm_vm_worker_thread+0x106/0x1c0 [kvm]
        kthread+0xd9/0x100
        ret_from_fork+0x2c/0x50
        </TASK>
       ---[ end trace 0000000000000000 ]---
      
      This bug was exposed by commit edbdb43f ("KVM: x86: Preserve TDP MMU
      roots until they are explicitly invalidated"), which allowed KVM to retain
      SMM TDP MMU roots effectively indefinitely.  Before commit edbdb43f,
      KVM would zap all SMM TDP MMU roots and thus all SMM TDP MMU shadow pages
      once all vCPUs exited SMM, which made the window where this bug (recovering
      an SMM NX huge page) could be encountered quite tiny.  To hit the bug, the
      NX recovery thread would have to run while at least one vCPU was in SMM.
      Most VMs typically only use SMM during boot, and so the problematic shadow
      pages were gone by the time the NX recovery thread ran.
      
      Now that KVM preserves TDP MMU roots until they are explicitly invalidated
      (e.g. by a memslot deletion), the window to trigger the bug is effectively
      never closed because most VMMs don't delete memslots after boot (except
      for a handful of special scenarios).
      
      Fixes: eb298605 ("KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages")
      Reported-by: default avatarFabio Coatti <fabio.coatti@gmail.com>
      Closes: https://lore.kernel.org/all/CADpTngX9LESCdHVu_2mQkNGena_Ng2CphWNwsRGSMxzDsTjU2A@mail.gmail.com
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20230602010137.784664-1-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      817fa998
  3. 21 May, 2023 3 commits
  4. 19 May, 2023 4 commits
    • Michal Luczaj's avatar
      KVM: Fix vcpu_array[0] races · afb2acb2
      Michal Luczaj authored
      In kvm_vm_ioctl_create_vcpu(), add vcpu to vcpu_array iff it's safe to
      access vcpu via kvm_get_vcpu() and kvm_for_each_vcpu(), i.e. when there's
      no failure path requiring vcpu removal and destruction. Such order is
      important because vcpu_array accessors may end up referencing vcpu at
      vcpu_array[0] even before online_vcpus is set to 1.
      
      When online_vcpus=0, any call to kvm_get_vcpu() goes through
      array_index_nospec() and ends with an attempt to xa_load(vcpu_array, 0):
      
      	int num_vcpus = atomic_read(&kvm->online_vcpus);
      	i = array_index_nospec(i, num_vcpus);
      	return xa_load(&kvm->vcpu_array, i);
      
      Similarly, when online_vcpus=0, a kvm_for_each_vcpu() does not iterate over
      an "empty" range, but actually [0, ULONG_MAX]:
      
      	xa_for_each_range(&kvm->vcpu_array, idx, vcpup, 0, \
      			  (atomic_read(&kvm->online_vcpus) - 1))
      
      In both cases, such online_vcpus=0 edge case, even if leading to
      unnecessary calls to XArray API, should not be an issue; requesting
      unpopulated indexes/ranges is handled by xa_load() and xa_for_each_range().
      
      However, this means that when the first vCPU is created and inserted in
      vcpu_array *and* before online_vcpus is incremented, code calling
      kvm_get_vcpu()/kvm_for_each_vcpu() already has access to that first vCPU.
      
      This should not pose a problem assuming that once a vcpu is stored in
      vcpu_array, it will remain there, but that's not the case:
      kvm_vm_ioctl_create_vcpu() first inserts to vcpu_array, then requests a
      file descriptor. If create_vcpu_fd() fails, newly inserted vcpu is removed
      from the vcpu_array, then destroyed:
      
      	vcpu->vcpu_idx = atomic_read(&kvm->online_vcpus);
      	r = xa_insert(&kvm->vcpu_array, vcpu->vcpu_idx, vcpu, GFP_KERNEL_ACCOUNT);
      	kvm_get_kvm(kvm);
      	r = create_vcpu_fd(vcpu);
      	if (r < 0) {
      		xa_erase(&kvm->vcpu_array, vcpu->vcpu_idx);
      		kvm_put_kvm_no_destroy(kvm);
      		goto unlock_vcpu_destroy;
      	}
      	atomic_inc(&kvm->online_vcpus);
      
      This results in a possible race condition when a reference to a vcpu is
      acquired (via kvm_get_vcpu() or kvm_for_each_vcpu()) moments before said
      vcpu is destroyed.
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Message-Id: <20230510140410.1093987-2-mhal@rbox.co>
      Cc: stable@vger.kernel.org
      Fixes: c5b07754 ("KVM: Convert the kvm->vcpus array to a xarray", 2021-12-08)
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      afb2acb2
    • Jacob Xu's avatar
      KVM: VMX: Fix header file dependency of asm/vmx.h · 3367eeab
      Jacob Xu authored
      Include a definition of WARN_ON_ONCE() before using it.
      
      Fixes: bb1fcc70 ("KVM: nVMX: Allow L1 to use 5-level page walks for nested EPT")
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJacob Xu <jacobhxu@google.com>
      [reworded commit message; changed <asm/bug.h> to <linux/bug.h>]
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20220225012959.1554168-1-jmattson@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      3367eeab
    • Sean Christopherson's avatar
      KVM: Don't enable hardware after a restart/shutdown is initiated · e0ceec22
      Sean Christopherson authored
      Reject hardware enabling, i.e. VM creation, if a restart/shutdown has
      been initiated to avoid re-enabling hardware between kvm_reboot() and
      machine_{halt,power_off,restart}().  The restart case is especially
      problematic (for x86) as enabling VMX (or clearing GIF in KVM_RUN on
      SVM) blocks INIT, which results in the restart/reboot hanging as BIOS
      is unable to wake and rendezvous with APs.
      
      Note, this bug, and the original issue that motivated the addition of
      kvm_reboot(), is effectively limited to a forced reboot, e.g. `reboot -f`.
      In a "normal" reboot, userspace will gracefully teardown userspace before
      triggering the kernel reboot (modulo bugs, errors, etc), i.e. any process
      that might do ioctl(KVM_CREATE_VM) is long gone.
      
      Fixes: 8e1c1815 ("KVM: VMX: Disable VMX when system shutdown")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Message-Id: <20230512233127.804012-3-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e0ceec22
    • Sean Christopherson's avatar
      KVM: Use syscore_ops instead of reboot_notifier to hook restart/shutdown · 6735150b
      Sean Christopherson authored
      Use syscore_ops.shutdown to disable hardware virtualization during a
      reboot instead of using the dedicated reboot_notifier so that KVM disables
      virtualization _after_ system_state has been updated.  This will allow
      fixing a race in KVM's handling of a forced reboot where KVM can end up
      enabling hardware virtualization between kernel_restart_prepare() and
      machine_restart().
      
      Rename KVM's hook to match the syscore op to avoid any possible confusion
      from wiring up a "reboot" helper to a "shutdown" hook (neither "shutdown
      nor "reboot" is completely accurate as the hook handles both).
      
      Opportunistically rewrite kvm_shutdown()'s comment to make it less VMX
      specific, and to explain why kvm_rebooting exists.
      
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Oliver Upton <oliver.upton@linux.dev>
      Cc: James Morse <james.morse@arm.com>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Zenghui Yu <yuzenghui@huawei.com>
      Cc: kvmarm@lists.linux.dev
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>
      Cc: Anup Patel <anup@brainfault.org>
      Cc: Atish Patra <atishp@atishpatra.org>
      Cc: kvm-riscv@lists.infradead.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarMarc Zyngier <maz@kernel.org>
      Message-Id: <20230512233127.804012-2-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      6735150b
  5. 17 May, 2023 1 commit
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-6.4-1' of... · 39428f6e
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-6.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
      
      KVM/arm64 fixes for 6.4, take #1
      
      - Plug a race in the stage-2 mapping code where the IPA and the PA
        would end up being out of sync
      
      - Make better use of the bitmap API (bitmap_zero, bitmap_zalloc...)
      
      - FP/SVE/SME documentation update, in the hope that this field
        becomes clearer...
      
      - Add workaround for the usual Apple SEIS brokenness
      
      - Random comment fixes
      39428f6e
  6. 14 May, 2023 13 commits
  7. 13 May, 2023 14 commits