1. 02 Dec, 2022 7 commits
  2. 01 Dec, 2022 23 commits
    • Sean Christopherson's avatar
      KVM: selftests: Define and use a custom static assert in lib headers · 0c326523
      Sean Christopherson authored
      Define and use kvm_static_assert() in the common KVM selftests headers to
      provide deterministic behavior, and to allow creating static asserts
      without dummy messages.
      
      The kernel's static_assert() makes the message param optional, and on the
      surface, tools/include/linux/build_bug.h appears to follow suit.  However,
      glibc may override static_assert() and redefine it as a direct alias of
      _Static_assert(), which makes the message parameter mandatory.  This leads
      to non-deterministic behavior as KVM selftests code that utilizes
      static_assert() without a custom message may or not compile depending on
      the order of includes.  E.g. recently added asserts in
      x86_64/processor.h fail on some systems with errors like
      
        In file included from lib/memstress.c:11:0:
        include/x86_64/processor.h: In function ‘this_cpu_has_p’:
        include/x86_64/processor.h:193:34: error: expected ‘,’ before ‘)’ token
          static_assert(low_bit < high_bit);     \
                                          ^
      due to _Static_assert() expecting a comma before a message.  The "message
      optional" version of static_assert() uses macro magic to strip away the
      comma when presented with empty an __VA_ARGS__
      
        #ifndef static_assert
        #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
        #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
        #endif // static_assert
      
      and effectively generates "_Static_assert(expr, #expr)".
      
      The incompatible version of static_assert() gets defined by this snippet
      in /usr/include/assert.h:
      
        #if defined __USE_ISOC11 && !defined __cplusplus
        # undef static_assert
        # define static_assert _Static_assert
        #endif
      
      which yields "_Static_assert(expr)" and thus fails as above.
      
      KVM selftests don't actually care about using C11, but __USE_ISOC11 gets
      defined because of _GNU_SOURCE, which many tests do #define.  _GNU_SOURCE
      triggers a massive pile of defines in /usr/include/features.h, including
      _ISOC11_SOURCE:
      
        /* If _GNU_SOURCE was defined by the user, turn on all the other features.  */
        #ifdef _GNU_SOURCE
        # undef  _ISOC95_SOURCE
        # define _ISOC95_SOURCE 1
        # undef  _ISOC99_SOURCE
        # define _ISOC99_SOURCE 1
        # undef  _ISOC11_SOURCE
        # define _ISOC11_SOURCE 1
        # undef  _POSIX_SOURCE
        # define _POSIX_SOURCE  1
        # undef  _POSIX_C_SOURCE
        # define _POSIX_C_SOURCE        200809L
        # undef  _XOPEN_SOURCE
        # define _XOPEN_SOURCE  700
        # undef  _XOPEN_SOURCE_EXTENDED
        # define _XOPEN_SOURCE_EXTENDED 1
        # undef  _LARGEFILE64_SOURCE
        # define _LARGEFILE64_SOURCE    1
        # undef  _DEFAULT_SOURCE
        # define _DEFAULT_SOURCE        1
        # undef  _ATFILE_SOURCE
        # define _ATFILE_SOURCE 1
        #endif
      
      which further down in /usr/include/features.h leads to:
      
        /* This is to enable the ISO C11 extension.  */
        #if (defined _ISOC11_SOURCE \
             || (defined __STDC_VERSION__ && __STDC_VERSION__ >= 201112L))
        # define __USE_ISOC11   1
        #endif
      
      To make matters worse, /usr/include/assert.h doesn't guard against
      multiple inclusion by turning itself into a nop, but instead #undefs a
      few macros and continues on.  As a result, it's all but impossible to
      ensure the "message optional" version of static_assert() will actually be
      used, e.g. explicitly including assert.h and #undef'ing static_assert()
      doesn't work as a later inclusion of assert.h will again redefine its
      version.
      
        #ifdef  _ASSERT_H
      
        # undef _ASSERT_H
        # undef assert
        # undef __ASSERT_VOID_CAST
      
        # ifdef __USE_GNU
        #  undef assert_perror
        # endif
      
        #endif /* assert.h      */
      
        #define _ASSERT_H       1
        #include <features.h>
      
      Fixes: fcba483e ("KVM: selftests: Sanity check input to ioctls() at build time")
      Fixes: ee379553 ("KVM: selftests: Refactor X86_FEATURE_* framework to prep for X86_PROPERTY_*")
      Fixes: 53a7dc0f ("KVM: selftests: Add X86_PROPERTY_* framework to retrieve CPUID values")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221122013309.1872347-1-seanjc@google.com
      0c326523
    • Sean Christopherson's avatar
      KVM: selftests: Do kvm_cpu_has() checks before creating VM+vCPU · 553d1652
      Sean Christopherson authored
      Move the AMX test's kvm_cpu_has() checks before creating the VM+vCPU,
      there are no dependencies between the two operations.  Opportunistically
      add a comment to call out that enabling off-by-default XSAVE-managed
      features must be done before KVM_GET_SUPPORTED_CPUID is cached.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221128225735.3291648-5-seanjc@google.com
      553d1652
    • Sean Christopherson's avatar
      KVM: selftests: Disallow "get supported CPUID" before REQ_XCOMP_GUEST_PERM · cd5f3d21
      Sean Christopherson authored
      Disallow using kvm_get_supported_cpuid() and thus caching KVM's supported
      CPUID info before enabling XSAVE-managed features that are off-by-default
      and must be enabled by ARCH_REQ_XCOMP_GUEST_PERM.  Caching the supported
      CPUID before all XSAVE features are enabled can result in false negatives
      due to testing features that were cached before they were enabled.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221128225735.3291648-4-seanjc@google.com
      cd5f3d21
    • Sean Christopherson's avatar
      KVM: selftests: Move __vm_xsave_require_permission() below CPUID helpers · 2ceade1d
      Sean Christopherson authored
      Move __vm_xsave_require_permission() below the CPUID helpers so that a
      future change can reference the cached result of KVM_GET_SUPPORTED_CPUID
      while keeping the definition of the variable close to its intended user,
      kvm_get_supported_cpuid().
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221128225735.3291648-3-seanjc@google.com
      2ceade1d
    • Lei Wang's avatar
      KVM: selftests: Move XFD CPUID checking out of __vm_xsave_require_permission() · 18eee7bf
      Lei Wang authored
      Move the kvm_cpu_has() check on X86_FEATURE_XFD out of the helper to
      enable off-by-default XSAVE-managed features and into the one test that
      currenty requires XFD (XFeature Disable) support.   kvm_cpu_has() uses
      kvm_get_supported_cpuid() and thus caches KVM_GET_SUPPORTED_CPUID, and so
      using kvm_cpu_has() before ARCH_REQ_XCOMP_GUEST_PERM effectively results
      in the test caching stale values, e.g. subsequent checks on AMX_TILE will
      get false negatives.
      
      Although off-by-default features are nonsensical without XFD, checking
      for XFD virtualization prior to enabling such features isn't strictly
      required.
      Signed-off-by: default avatarLei Wang <lei4.wang@intel.com>
      Fixes: 7fbb653e ("KVM: selftests: Check KVM's supported CPUID, not host CPUID, for XFD")
      Link: https://lore.kernel.org/r/20221125023839.315207-1-lei4.wang@intel.com
      [sean: add Fixes, reword changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221128225735.3291648-2-seanjc@google.com
      18eee7bf
    • Sean Christopherson's avatar
      KVM: selftests: Restore assert for non-nested VMs in access tracking test · 8fcee042
      Sean Christopherson authored
      Restore the assert (on x86-64) that <10% of pages are still idle when NOT
      running as a nested VM in the access tracking test.  The original assert
      was converted to a "warning" to avoid false failures when running the
      test in a VM, but the non-nested case does not suffer from the same
      "infinite TLB size" issue.
      
      Using the HYPERVISOR flag isn't infallible as VMMs aren't strictly
      required to enumerate the "feature" in CPUID, but practically speaking
      anyone that is running KVM selftests in VMs is going to be using a VMM
      and hypervisor that sets the HYPERVISOR flag.
      
      Cc: David Matlack <dmatlack@google.com>
      Reviewed-by: default avatarEmanuele Giuseppe Esposito <eesposit@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221129175300.4052283-3-seanjc@google.com
      8fcee042
    • Sean Christopherson's avatar
      KVM: selftests: Fix inverted "warning" in access tracking perf test · a33004e8
      Sean Christopherson authored
      Warn if the number of idle pages is greater than or equal to 10% of the
      total number of pages, not if the percentage of idle pages is less than
      10%.  The original code asserted that less than 10% of pages were still
      idle, but the check got inverted when the assert was converted to a
      warning.
      
      Opportunistically clean up the warning; selftests are 64-bit only, there
      is no need to use "%PRIu64" instead of "%lu".
      
      Fixes: 6336a810 ("KVM: selftests: replace assertion with warning in access_tracking_perf_test")
      Reviewed-by: default avatarEmanuele Giuseppe Esposito <eesposit@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221129175300.4052283-2-seanjc@google.com
      a33004e8
    • Anton Romanov's avatar
      KVM: x86: Use current rather than snapshotted TSC frequency if it is constant · 3ebcbd22
      Anton Romanov authored
      Don't snapshot tsc_khz into per-cpu cpu_tsc_khz if the host TSC is
      constant, in which case the actual TSC frequency will never change and thus
      capturing TSC during initialization is unnecessary, KVM can simply use
      tsc_khz.  This value is snapshotted from
      kvm_timer_init->kvmclock_cpu_online->tsc_khz_changed(NULL)
      
      On CPUs with constant TSC, but not a hardware-specified TSC frequency,
      snapshotting cpu_tsc_khz and using that to set a VM's target TSC frequency
      can lead to VM to think its TSC frequency is not what it actually is if
      refining the TSC completes after KVM snapshots tsc_khz.  The actual
      frequency never changes, only the kernel's calculation of what that
      frequency is changes.
      
      Ideally, KVM would not be able to race with TSC refinement, or would have
      a hook into tsc_refine_calibration_work() to get an alert when refinement
      is complete.  Avoiding the race altogether isn't practical as refinement
      takes a relative eternity; it's deliberately put on a work queue outside of
      the normal boot sequence to avoid unnecessarily delaying boot.
      
      Adding a hook is doable, but somewhat gross due to KVM's ability to be
      built as a module.  And if the TSC is constant, which is likely the case
      for every VMX/SVM-capable CPU produced in the last decade, the race can be
      hit if and only if userspace is able to create a VM before TSC refinement
      completes; refinement is slow, but not that slow.
      
      For now, punt on a proper fix, as not taking a snapshot can help some uses
      cases and not taking a snapshot is arguably correct irrespective of the
      race with refinement.
      Signed-off-by: default avatarAnton Romanov <romanton@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220608183525.1143682-1-romanton@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      3ebcbd22
    • Sean Christopherson's avatar
      KVM: selftests: Verify userspace can stuff IA32_FEATURE_CONTROL at will · b80732fd
      Sean Christopherson authored
      Verify the KVM allows userspace to set all supported bits in the
      IA32_FEATURE_CONTROL MSR irrespective of the current guest CPUID, and
      that all unsupported bits are rejected.
      
      Throw the testcase into vmx_msrs_test even though it's not technically a
      VMX MSR; it's close enough, and the most frequently feature controlled by
      the MSR is VMX.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220607232353.3375324-4-seanjc@google.com
      b80732fd
    • Sean Christopherson's avatar
      KVM: VMX: Move MSR_IA32_FEAT_CTL.LOCKED check into "is valid" helper · 2d6cd686
      Sean Christopherson authored
      Move the check on IA32_FEATURE_CONTROL being locked, i.e. read-only from
      the guest, into the helper to check the overall validity of the incoming
      value.  Opportunistically rename the helper to make it clear that it
      returns a bool.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220607232353.3375324-3-seanjc@google.com
      2d6cd686
    • Sean Christopherson's avatar
      KVM: VMX: Allow userspace to set all supported FEATURE_CONTROL bits · d2a00af2
      Sean Christopherson authored
      Allow userspace to set all supported bits in MSR IA32_FEATURE_CONTROL
      irrespective of the guest CPUID model, e.g. via KVM_SET_MSRS.  KVM's ABI
      is that userspace is allowed to set MSRs before CPUID, i.e. can set MSRs
      to values that would fault according to the guest CPUID model.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220607232353.3375324-2-seanjc@google.com
      d2a00af2
    • Sean Christopherson's avatar
      KVM: VMX: Make vmread_error_trampoline() uncallable from C code · 0b5e7a16
      Sean Christopherson authored
      Declare vmread_error_trampoline() as an opaque symbol so that it cannot
      be called from C code, at least not without some serious fudging.  The
      trampoline always passes parameters on the stack so that the inline
      VMREAD sequence doesn't need to clobber registers.  regparm(0) was
      originally added to document the stack behavior, but it ended up being
      confusing because regparm(0) is a nop for 64-bit targets.
      
      Opportunustically wrap the trampoline and its declaration in #ifdeffery
      to make it even harder to invoke incorrectly, to document why it exists,
      and so that it's not left behind if/when CONFIG_CC_HAS_ASM_GOTO_OUTPUT
      is true for all supported toolchains.
      
      No functional change intended.
      
      Cc: Uros Bizjak <ubizjak@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220928232015.745948-1-seanjc@google.com
      0b5e7a16
    • Sean Christopherson's avatar
      KVM: nVMX: Reword comments about generating nested CR0/4 read shadows · 4a8fd4a7
      Sean Christopherson authored
      Reword the comments that (attempt to) document nVMX's overrides of the
      CR0/4 read shadows for L2 after calling vmx_set_cr0/4().  The important
      behavior that needs to be documented is that KVM needs to override the
      shadows to account for L1's masks even though the shadows are set by the
      common helpers (and that setting the shadows first would result in the
      correct shadows being clobbered).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Link: https://lore.kernel.org/r/20220831000721.4066617-1-seanjc@google.com
      4a8fd4a7
    • Sean Christopherson's avatar
      KVM: x86: Clean up KVM_CAP_X86_USER_SPACE_MSR documentation · 1f158147
      Sean Christopherson authored
      Clean up the KVM_CAP_X86_USER_SPACE_MSR documentation to eliminate
      misleading and/or inconsistent verbiage, and to actually document what
      accesses are intercepted by which flags.
      
        - s/will/may since not all #GPs are guaranteed to be intercepted
        - s/deflect/intercept to align with common KVM terminology
        - s/user space/userspace to align with the majority of KVM docs
        - Avoid using "trap" terminology, as KVM exits to userspace _before_
          stepping, i.e. doesn't exhibit trap-like behavior
        - Actually document the flags
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220831001706.4075399-4-seanjc@google.com
      1f158147
    • Sean Christopherson's avatar
      KVM: x86: Reword MSR filtering docs to more precisely define behavior · b93d2ec3
      Sean Christopherson authored
      Reword the MSR filtering documentatiion to more precisely define the
      behavior of filtering using common virtualization terminology.
      
        - Explicitly document KVM's behavior when an MSR is denied
        - s/handled/allowed as there is no guarantee KVM will "handle" the
          MSR access
        - Drop the "fall back" terminology, which incorrectly suggests that
          there is existing KVM behavior to fall back to
        - Fix an off-by-one error in the range (the end is exclusive)
        - Call out the interaction between MSR filtering and
          KVM_CAP_X86_USER_SPACE_MSR's KVM_MSR_EXIT_REASON_FILTER
        - Delete the redundant paragraph on what '0' and '1' in the bitmap
          means, it's covered by the sections on KVM_MSR_FILTER_{READ,WRITE}
        - Delete the clause on x2APIC MSR behavior depending on APIC base, this
          is covered by stating that KVM follows architectural behavior when
          emulating/virtualizing MSR accesses
      Reported-by: default avatarAaron Lewis <aaronlewis@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220831001706.4075399-3-seanjc@google.com
      b93d2ec3
    • Sean Christopherson's avatar
      KVM: x86: Delete documentation for READ|WRITE in KVM_X86_SET_MSR_FILTER · 5c8c0b32
      Sean Christopherson authored
      Delete the paragraph that describes the behavior when both
      KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE are set for a range.  There is
      nothing special about KVM's handling of this combination, whereas
      explicitly documenting the combination suggests that there is some magic
      behavior the user needs to be aware of.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220831001706.4075399-2-seanjc@google.com
      5c8c0b32
    • Jim Mattson's avatar
      KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS · 2e7eab81
      Jim Mattson authored
      According to Intel's document on Indirect Branch Restricted
      Speculation, "Enabling IBRS does not prevent software from controlling
      the predicted targets of indirect branches of unrelated software
      executed later at the same predictor mode (for example, between two
      different user applications, or two different virtual machines). Such
      isolation can be ensured through use of the Indirect Branch Predictor
      Barrier (IBPB) command." This applies to both basic and enhanced IBRS.
      
      Since L1 and L2 VMs share hardware predictor modes (guest-user and
      guest-kernel), hardware IBRS is not sufficient to virtualize
      IBRS. (The way that basic IBRS is implemented on pre-eIBRS parts,
      hardware IBRS is actually sufficient in practice, even though it isn't
      sufficient architecturally.)
      
      For virtual CPUs that support IBRS, add an indirect branch prediction
      barrier on emulated VM-exit, to ensure that the predicted targets of
      indirect branches executed in L1 cannot be controlled by software that
      was executed in L2.
      
      Since we typically don't intercept guest writes to IA32_SPEC_CTRL,
      perform the IBPB at emulated VM-exit regardless of the current
      IA32_SPEC_CTRL.IBRS value, even though the IBPB could technically be
      deferred until L1 sets IA32_SPEC_CTRL.IBRS, if IA32_SPEC_CTRL.IBRS is
      clear at emulated VM-exit.
      
      This is CVE-2022-2196.
      
      Fixes: 5c911bef ("KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02")
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221019213620.1953281-3-jmattson@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      2e7eab81
    • Jim Mattson's avatar
      KVM: VMX: Guest usage of IA32_SPEC_CTRL is likely · 4f209989
      Jim Mattson authored
      At this point in time, most guests (in the default, out-of-the-box
      configuration) are likely to use IA32_SPEC_CTRL.  Therefore, drop the
      compiler hint that it is unlikely for KVM to be intercepting WRMSR of
      IA32_SPEC_CTRL.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221019213620.1953281-2-jmattson@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      4f209989
    • Sean Christopherson's avatar
      KVM: nVMX: Inject #GP, not #UD, if "generic" VMXON CR0/CR4 check fails · 9cc40932
      Sean Christopherson authored
      Inject #GP for if VMXON is attempting with a CR0/CR4 that fails the
      generic "is CRx valid" check, but passes the CR4.VMXE check, and do the
      generic checks _after_ handling the post-VMXON VM-Fail.
      
      The CR4.VMXE check, and all other #UD cases, are special pre-conditions
      that are enforced prior to pivoting on the current VMX mode, i.e. occur
      before interception if VMXON is attempted in VMX non-root mode.
      
      All other CR0/CR4 checks generate #GP and effectively have lower priority
      than the post-VMXON check.
      
      Per the SDM:
      
          IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ...
              THEN #UD;
          ELSIF not in VMX operation
              THEN
                  IF (CPL > 0) or (in A20M mode) or
                  (the values of CR0 and CR4 are not supported in VMX operation)
                      THEN #GP(0);
          ELSIF in VMX non-root operation
              THEN VMexit;
          ELSIF CPL > 0
              THEN #GP(0);
          ELSE VMfail("VMXON executed in VMX root operation");
          FI;
      
      which, if re-written without ELSIF, yields:
      
          IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ...
              THEN #UD
      
          IF in VMX non-root operation
              THEN VMexit;
      
          IF CPL > 0
              THEN #GP(0)
      
          IF in VMX operation
              THEN VMfail("VMXON executed in VMX root operation");
      
          IF (in A20M mode) or
             (the values of CR0 and CR4 are not supported in VMX operation)
                      THEN #GP(0);
      
      Note, KVM unconditionally forwards VMXON VM-Exits that occur in L2 to L1,
      i.e. there is no need to check the vCPU is not in VMX non-root mode.  Add
      a comment to explain why unconditionally forwarding such exits is
      functionally correct.
      Reported-by: default avatarEric Li <ercli@ucdavis.edu>
      Fixes: c7d855c2 ("KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221006001956.329314-1-seanjc@google.com
      9cc40932
    • Zhao Liu's avatar
      KVM: SVM: Replace kmap_atomic() with kmap_local_page() · a8a12c00
      Zhao Liu authored
      The use of kmap_atomic() is being deprecated in favor of
      kmap_local_page()[1].
      
      The main difference between atomic and local mappings is that local
      mappings don't disable page faults or preemption.
      
      There're 2 reasons we can use kmap_local_page() here:
      1. SEV is 64-bit only and kmap_local_page() doesn't disable migration in
      this case, but here the function clflush_cache_range() uses CLFLUSHOPT
      instruction to flush, and on x86 CLFLUSHOPT is not CPU-local and flushes
      the page out of the entire cache hierarchy on all CPUs (APM volume 3,
      chapter 3, CLFLUSHOPT). So there's no need to disable preemption to ensure
      CPU-local.
      2. clflush_cache_range() doesn't need to disable pagefault and the mapping
      is still valid even if sleeps. This is also true for sched out/in when
      preempted.
      
      In addition, though kmap_local_page() is a thin wrapper around
      page_address() on 64-bit, kmap_local_page() should still be used here in
      preference to page_address() since page_address() isn't suitable to be used
      in a generic function (like sev_clflush_pages()) where the page passed in
      is not easy to determine the source of allocation. Keeping the kmap* API in
      place means it can be used for things other than highmem mappings[2].
      
      Therefore, sev_clflush_pages() is a function that should use
      kmap_local_page() in place of kmap_atomic().
      
      Convert the calls of kmap_atomic() / kunmap_atomic() to kmap_local_page() /
      kunmap_local().
      
      [1]: https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com
      [2]: https://lore.kernel.org/lkml/5d667258-b58b-3d28-3609-e7914c99b31b@intel.com/Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Suggested-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Signed-off-by: default avatarZhao Liu <zhao1.liu@intel.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220928092748.463631-1-zhao1.liu@linux.intel.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      a8a12c00
    • Sean Christopherson's avatar
      KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid · 5c30e810
      Sean Christopherson authored
      Skip the WRMSR fastpath in SVM's VM-Exit handler if the next RIP isn't
      valid, e.g. because KVM is running with nrips=false.  SVM must decode and
      emulate to skip the WRMSR if the CPU doesn't provide the next RIP.
      Getting the instruction bytes to decode the WRMSR requires reading guest
      memory, which in turn means dereferencing memslots, and that isn't safe
      because KVM doesn't hold SRCU when the fastpath runs.
      
      Don't bother trying to enable the fastpath for this case, e.g. by doing
      only the WRMSR and leaving the "skip" until later.  NRIPS is supported on
      all modern CPUs (KVM has considered making it mandatory), and the next
      RIP will be valid the vast, vast majority of the time.
      
        =============================
        WARNING: suspicious RCU usage
        6.0.0-smp--4e557fcd3d80-skip #13 Tainted: G           O
        -----------------------------
        include/linux/kvm_host.h:954 suspicious rcu_dereference_check() usage!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by stable/206475:
         #0: ffff9d9dfebcc0f0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x8b/0x620 [kvm]
      
        stack backtrace:
        CPU: 152 PID: 206475 Comm: stable Tainted: G           O       6.0.0-smp--4e557fcd3d80-skip #13
        Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 10.48.0 01/27/2022
        Call Trace:
         <TASK>
         dump_stack_lvl+0x69/0xaa
         dump_stack+0x10/0x12
         lockdep_rcu_suspicious+0x11e/0x130
         kvm_vcpu_gfn_to_memslot+0x155/0x190 [kvm]
         kvm_vcpu_gfn_to_hva_prot+0x18/0x80 [kvm]
         paging64_walk_addr_generic+0x183/0x450 [kvm]
         paging64_gva_to_gpa+0x63/0xd0 [kvm]
         kvm_fetch_guest_virt+0x53/0xc0 [kvm]
         __do_insn_fetch_bytes+0x18b/0x1c0 [kvm]
         x86_decode_insn+0xf0/0xef0 [kvm]
         x86_emulate_instruction+0xba/0x790 [kvm]
         kvm_emulate_instruction+0x17/0x20 [kvm]
         __svm_skip_emulated_instruction+0x85/0x100 [kvm_amd]
         svm_skip_emulated_instruction+0x13/0x20 [kvm_amd]
         handle_fastpath_set_msr_irqoff+0xae/0x180 [kvm]
         svm_vcpu_run+0x4b8/0x5a0 [kvm_amd]
         vcpu_enter_guest+0x16ca/0x22f0 [kvm]
         kvm_arch_vcpu_ioctl_run+0x39d/0x900 [kvm]
         kvm_vcpu_ioctl+0x538/0x620 [kvm]
         __se_sys_ioctl+0x77/0xc0
         __x64_sys_ioctl+0x1d/0x20
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 404d5d7b ("KVM: X86: Introduce more exit_fastpath_completion enum values")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220930234031.1732249-1-seanjc@google.com
      5c30e810
    • Sean Christopherson's avatar
      KVM: x86: Fail emulation during EMULTYPE_SKIP on any exception · 17122c06
      Sean Christopherson authored
      Treat any exception during instruction decode for EMULTYPE_SKIP as a
      "full" emulation failure, i.e. signal failure instead of queuing the
      exception.  When decoding purely to skip an instruction, KVM and/or the
      CPU has already done some amount of emulation that cannot be unwound,
      e.g. on an EPT misconfig VM-Exit KVM has already processeed the emulated
      MMIO.  KVM already does this if a #UD is encountered, but not for other
      exceptions, e.g. if a #PF is encountered during fetch.
      
      In SVM's soft-injection use case, queueing the exception is particularly
      problematic as queueing exceptions while injecting events can put KVM
      into an infinite loop due to bailing from VM-Enter to service the newly
      pending exception.  E.g. multiple warnings to detect such behavior fire:
      
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9873 kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm]
        Modules linked in: kvm_amd ccp kvm irqbypass
        CPU: 3 PID: 1017 Comm: svm_nested_soft Not tainted 6.0.0-rc1+ #220
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm]
        Call Trace:
         kvm_vcpu_ioctl+0x223/0x6d0 [kvm]
         __x64_sys_ioctl+0x85/0xc0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
        ---[ end trace 0000000000000000 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9987 kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm]
        Modules linked in: kvm_amd ccp kvm irqbypass
        CPU: 3 PID: 1017 Comm: svm_nested_soft Tainted: G        W          6.0.0-rc1+ #220
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm]
        Call Trace:
         kvm_vcpu_ioctl+0x223/0x6d0 [kvm]
         __x64_sys_ioctl+0x85/0xc0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 6ea6e843 ("KVM: x86: inject exceptions produced by x86_decode_insn")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220930233632.1725475-1-seanjc@google.com
      17122c06
    • Peng Hao's avatar
      KVM: x86: Keep the lock order consistent between SRCU and gpc spinlock · 4265df66
      Peng Hao authored
      Acquire SRCU before taking the gpc spinlock in wait_pending_event() so as
      to be consistent with all other functions that acquire both locks.  It's
      not illegal to acquire SRCU inside a spinlock, nor is there deadlock
      potential, but in general it's preferable to order locks from least
      restrictive to most restrictive, e.g. if wait_pending_event() needed to
      sleep for whatever reason, it could do so while holding SRCU, but would
      need to drop the spinlock.
      Signed-off-by: default avatarPeng Hao <flyingpeng@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/CAPm50a++Cb=QfnjMZ2EnCj-Sb9Y4UM-=uOEtHAcjnNLCAAf-dQ@mail.gmail.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      4265df66
  3. 30 Nov, 2022 7 commits
    • Sean Christopherson's avatar
      KVM: VMX: Resume guest immediately when injecting #GP on ECREATE · eb3992e8
      Sean Christopherson authored
      Resume the guest immediately when injecting a #GP on ECREATE due to an
      invalid enclave size, i.e. don't attempt ECREATE in the host.  The #GP is
      a terminal fault, e.g. skipping the instruction if ECREATE is successful
      would result in KVM injecting #GP on the instruction following ECREATE.
      
      Fixes: 70210c04 ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions")
      Cc: stable@vger.kernel.org
      Cc: Kai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarKai Huang <kai.huang@intel.com>
      Link: https://lore.kernel.org/r/20220930233132.1723330-1-seanjc@google.com
      eb3992e8
    • Paolo Bonzini's avatar
      KVM: x86: fix uninitialized variable use on KVM_REQ_TRIPLE_FAULT · df0bb47b
      Paolo Bonzini authored
      If a triple fault was fixed by kvm_x86_ops.nested_ops->triple_fault (by
      turning it into a vmexit), there is no need to leave vcpu_enter_guest().
      Any vcpu->requests will be caught later before the actual vmentry,
      and in fact vcpu_enter_guest() was not initializing the "r" variable.
      Depending on the compiler's whims, this could cause the
      x86_64/triple_fault_event_test test to fail.
      
      Cc: Maxim Levitsky <mlevitsk@redhat.com>
      Fixes: 92e7d5c8 ("KVM: x86: allow L1 to not intercept triple fault")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      df0bb47b
    • Michal Luczaj's avatar
      KVM: x86: Remove unused argument in gpc_unmap_khva() · c1a81f3b
      Michal Luczaj authored
      Remove the unused @kvm argument from gpc_unmap_khva().
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      c1a81f3b
    • Michal Luczaj's avatar
      KVM: Shorten gfn_to_pfn_cache function names · aba3caef
      Michal Luczaj authored
      Formalize "gpc" as the acronym and use it in function names.
      
      No functional change intended.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarMichal Luczaj <mhal@rbox.co>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      aba3caef
    • David Woodhouse's avatar
      KVM: x86/xen: Add runstate tests for 32-bit mode and crossing page boundary · 8acc3518
      David Woodhouse authored
      Torture test the cases where the runstate crosses a page boundary, and
      and especially the case where it's configured in 32-bit mode and doesn't,
      but then switching to 64-bit mode makes it go onto the second page.
      
      To simplify this, make the KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST ioctl
      also update the guest runstate area. It already did so if the actual
      runstate changed, as a side-effect of kvm_xen_update_runstate(). So
      doing it in the plain adjustment case is making it more consistent, as
      well as giving us a nice way to trigger the update without actually
      running the vCPU again and changing the values.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      8acc3518
    • David Woodhouse's avatar
      KVM: x86/xen: Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured · d8ba8ba4
      David Woodhouse authored
      Closer inspection of the Xen code shows that we aren't supposed to be
      using the XEN_RUNSTATE_UPDATE flag unconditionally. It should be
      explicitly enabled by guests through the HYPERVISOR_vm_assist hypercall.
      If we randomly set the top bit of ->state_entry_time for a guest that
      hasn't asked for it and doesn't expect it, that could make the runtimes
      fail to add up and confuse the guest. Without the flag it's perfectly
      safe for a vCPU to read its own vcpu_runstate_info; just not for one
      vCPU to read *another's*.
      
      I briefly pondered adding a word for the whole set of VMASST_TYPE_*
      flags but the only one we care about for HVM guests is this, so it
      seemed a bit pointless.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20221127122210.248427-3-dwmw2@infradead.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d8ba8ba4
    • David Woodhouse's avatar
      KVM: x86/xen: Compatibility fixes for shared runstate area · 5ec3289b
      David Woodhouse authored
      The guest runstate area can be arbitrarily byte-aligned. In fact, even
      when a sane 32-bit guest aligns the overall structure nicely, the 64-bit
      fields in the structure end up being unaligned due to the fact that the
      32-bit ABI only aligns them to 32 bits.
      
      So setting the ->state_entry_time field to something|XEN_RUNSTATE_UPDATE
      is buggy, because if it's unaligned then we can't update the whole field
      atomically; the low bytes might be observable before the _UPDATE bit is.
      Xen actually updates the *byte* containing that top bit, on its own. KVM
      should do the same.
      
      In addition, we cannot assume that the runstate area fits within a single
      page. One option might be to make the gfn_to_pfn cache cope with regions
      that cross a page — but getting a contiguous virtual kernel mapping of a
      discontiguous set of IOMEM pages is a distinctly non-trivial exercise,
      and it seems this is the *only* current use case for the GPC which would
      benefit from it.
      
      An earlier version of the runstate code did use a gfn_to_hva cache for
      this purpose, but it still had the single-page restriction because it
      used the uhva directly — because it needs to be able to do so atomically
      when the vCPU is being scheduled out, so it used pagefault_disable()
      around the accesses and didn't just use kvm_write_guest_cached() which
      has a fallback path.
      
      So... use a pair of GPCs for the first and potential second page covering
      the runstate area. We can get away with locking both at once because
      nothing else takes more than one GPC lock at a time so we can invent
      a trivial ordering rule.
      
      The common case where it's all in the same page is kept as a fast path,
      but in both cases, the actual guest structure (compat or not) is built
      up from the fields in @vx, following preset pointers to the state and
      times fields. The only difference is whether those pointers point to
      the kernel stack (in the split case) or to guest memory directly via
      the GPC.  The fast path is also fixed to use a byte access for the
      XEN_RUNSTATE_UPDATE bit, then the only real difference is the dual
      memcpy.
      
      Finally, Xen also does write the runstate area immediately when it's
      configured. Flip the kvm_xen_update_runstate() and …_guest() functions
      and call the latter directly when the runstate area is set. This means
      that other ioctls which modify the runstate also write it immediately
      to the guest when they do so, which is also intended.
      
      Update the xen_shinfo_test to exercise the pathological case where the
      XEN_RUNSTATE_UPDATE flag in the top byte of the state_entry_time is
      actually in a different page to the rest of the 64-bit word.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      5ec3289b
  4. 28 Nov, 2022 3 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-next-6.2-1' of... · 1e79a9e3
      Paolo Bonzini authored
      Merge tag 'kvm-s390-next-6.2-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
      
      - Second batch of the lazy destroy patches
      - First batch of KVM changes for kernel virtual != physical address support
      - Removal of a unused function
      1e79a9e3
    • Jiaxi Chen's avatar
      KVM: x86: Advertise PREFETCHIT0/1 CPUID to user space · 29c46979
      Jiaxi Chen authored
      Latest Intel platform Granite Rapids has introduced a new instruction -
      PREFETCHIT0/1, which moves code to memory (cache) closer to the
      processor depending on specific hints.
      
      The bit definition:
      CPUID.(EAX=7,ECX=1):EDX[bit 14]
      
      PREFETCHIT0/1 is on a KVM-only subleaf. Plus an x86_FEATURE definition
      for this feature bit to direct it to the KVM entry.
      
      Advertise PREFETCHIT0/1 to KVM userspace. This is safe because there are
      no new VMX controls or additional host enabling required for guests to
      use this feature.
      Signed-off-by: default avatarJiaxi Chen <jiaxi.chen@linux.intel.com>
      Message-Id: <20221125125845.1182922-9-jiaxi.chen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      29c46979
    • Jiaxi Chen's avatar
      KVM: x86: Advertise AVX-NE-CONVERT CPUID to user space · 9977f087
      Jiaxi Chen authored
      AVX-NE-CONVERT is a new set of instructions which can convert low
      precision floating point like BF16/FP16 to high precision floating point
      FP32, and can also convert FP32 elements to BF16. This instruction
      allows the platform to have improved AI capabilities and better
      compatibility.
      
      The bit definition:
      CPUID.(EAX=7,ECX=1):EDX[bit 5]
      
      AVX-NE-CONVERT is on a KVM-only subleaf. Plus an x86_FEATURE definition
      for this feature bit to direct it to the KVM entry.
      
      Advertise AVX-NE-CONVERT to KVM userspace. This is safe because there
      are no new VMX controls or additional host enabling required for guests
      to use this feature.
      Signed-off-by: default avatarJiaxi Chen <jiaxi.chen@linux.intel.com>
      Message-Id: <20221125125845.1182922-8-jiaxi.chen@linux.intel.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      9977f087