1. 02 Dec, 2022 15 commits
    • Sean Christopherson's avatar
      tools: KVM: selftests: Convert clear/set_bit() to actual atomics · bb056c0f
      Sean Christopherson authored
      Convert {clear,set}_bit() to atomics as KVM's ucall implementation relies
      on clear_bit() being atomic, they are defined in atomic.h, and the same
      helpers in the kernel proper are atomic.
      
      KVM's ucall infrastructure is the only user of clear_bit() in tools/, and
      there are no true set_bit() users.  tools/testing/nvdimm/ does make heavy
      use of set_bit(), but that code builds into a kernel module of sorts, i.e.
      pulls in all of the kernel's header and so is already getting the kernel's
      atomic set_bit().
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221119013450.2643007-10-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      bb056c0f
    • Sean Christopherson's avatar
      tools: Drop "atomic_" prefix from atomic test_and_set_bit() · 36293352
      Sean Christopherson authored
      Drop the "atomic_" prefix from tools' atomic_test_and_set_bit() to
      match the kernel nomenclature where test_and_set_bit() is atomic,
      and __test_and_set_bit() provides the non-atomic variant.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221119013450.2643007-9-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      36293352
    • Sean Christopherson's avatar
      tools: Drop conflicting non-atomic test_and_{clear,set}_bit() helpers · 7f32a6cf
      Sean Christopherson authored
      Drop tools' non-atomic test_and_set_bit() and test_and_clear_bit() helpers
      now that all users are gone.  The names will be claimed in the future for
      atomic versions.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221119013450.2643007-8-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7f32a6cf
    • Sean Christopherson's avatar
      KVM: selftests: Use non-atomic clear/set bit helpers in KVM tests · 03a0c819
      Sean Christopherson authored
      Use the dedicated non-atomic helpers for {clear,set}_bit() and their
      test variants, i.e. the double-underscore versions.  Depsite being
      defined in atomic.h, and despite the kernel versions being atomic in the
      kernel, tools' {clear,set}_bit() helpers aren't actually atomic.  Move
      to the double-underscore versions so that the versions that are expected
      to be atomic (for kernel developers) can be made atomic without affecting
      users that don't want atomic operations.
      
      Leave the usage in ucall_free() as-is, it's the one place in tools/ that
      actually wants/needs atomic behavior.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221119013450.2643007-7-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      03a0c819
    • Sean Christopherson's avatar
      perf tools: Use dedicated non-atomic clear/set bit helpers · 75d7ba32
      Sean Christopherson authored
      Use the dedicated non-atomic helpers for {clear,set}_bit() and their
      test variants, i.e. the double-underscore versions.  Depsite being
      defined in atomic.h, and despite the kernel versions being atomic in the
      kernel, tools' {clear,set}_bit() helpers aren't actually atomic.  Move
      to the double-underscore versions so that the versions that are expected
      to be atomic (for kernel developers) can be made atomic without affecting
      users that don't want atomic operations.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Acked-by: default avatarNamhyung Kim <namhyung@kernel.org>
      Message-Id: <20221119013450.2643007-6-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      75d7ba32
    • Sean Christopherson's avatar
      tools: Take @bit as an "unsigned long" in {clear,set}_bit() helpers · 7f2b47f2
      Sean Christopherson authored
      Take @bit as an unsigned long instead of a signed int in clear_bit() and
      set_bit() so that they match the double-underscore versions, __clear_bit()
      and __set_bit().  This will allow converting users that really don't want
      atomic operations to the double-underscores without introducing a
      functional change, which will in turn allow making {clear,set}_bit()
      atomic (as advertised).
      
      Practically speaking, this _should_ have no functional impact.  KVM's
      selftests usage is either hardcoded (Hyper-V tests) or is artificially
      limited (arch_timer test and dirty_log test).  In KVM, dirty_log test is
      the only mildly interesting case as it's use indirectly restricted to
      unsigned 32-bit values, but in theory it could generate a negative value
      when cast to a signed int.  But in that case, taking an "unsigned long"
      is actually a bug fix.
      
      Perf's usage is more difficult to audit, but any code that is affected
      by the switch is likely already broken.  perf_header__{set,clear}_feat()
      and perf_file_header__read() effectively use only hardcoded enums with
      small, positive values, atom_new() passes an unsigned long, but its value
      is capped at 128 via NR_ATOM_PER_PAGE, etc...
      
      The only real potential for breakage is in the perf flows that take a
      "cpu", but it's unlikely perf is subtly relying on a negative index into
      bitmaps, e.g. "cpu" can be "-1", but only as "not valid" placeholder.
      
      Note, tools/testing/nvdimm/ makes heavy use of set_bit(), but that code
      builds into a kernel module of sorts, i.e. pulls in all of the kernel's
      header and so is getting the kernel's atomic set_bit().  The NVDIMM test
      usage of atomics is likely unnecessary, e.g. ndtest_dimm_register() sets
      bits in a local variable, but that's neither here nor there as far as
      this change is concerned.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221119013450.2643007-5-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7f2b47f2
    • Sean Christopherson's avatar
      KVM: arm64: selftests: Enable single-step without a "full" ucall() · ef16b2df
      Sean Christopherson authored
      Add a new ucall hook, GUEST_UCALL_NONE(), to allow tests to make ucalls
      without allocating a ucall struct, and use it to enable single-step
      in ARM's debug-exceptions test.  Like the disable single-step path, the
      enabling path also needs to ensure that no exclusive access sequences are
      attempted after enabling single-step, as the exclusive monitor is cleared
      on ERET from the debug exception taken to EL2.
      
      The test currently "works" because clear_bit() isn't actually an atomic
      operation... yet.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221119013450.2643007-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ef16b2df
    • Yuan ZhaoXiong's avatar
      KVM: x86: fix APICv/x2AVIC disabled when vm reboot by itself · ef407577
      Yuan ZhaoXiong authored
      When a VM reboots itself, the reset process will result in
      an ioctl(KVM_SET_LAPIC, ...) to disable x2APIC mode and set
      the xAPIC id of the vCPU to its default value, which is the
      vCPU id.
      
      That will be handled in KVM as follows:
      
           kvm_vcpu_ioctl_set_lapic
             kvm_apic_set_state
      	  kvm_lapic_set_base  =>  disable X2APIC mode
      	    kvm_apic_state_fixup
      	      kvm_lapic_xapic_id_updated
      	        kvm_xapic_id(apic) != apic->vcpu->vcpu_id
      		kvm_set_apicv_inhibit(APICV_INHIBIT_REASON_APIC_ID_MODIFIED)
      	   memcpy(vcpu->arch.apic->regs, s->regs, sizeof(*s))  => update APIC_ID
      
      When kvm_apic_set_state invokes kvm_lapic_set_base to disable
      x2APIC mode, the old 32-bit x2APIC id is still present rather
      than the 8-bit xAPIC id.  kvm_lapic_xapic_id_updated will set the
      APICV_INHIBIT_REASON_APIC_ID_MODIFIED bit and disable APICv/x2AVIC.
      
      Instead, kvm_lapic_xapic_id_updated must be called after APIC_ID is
      changed.
      
      In fact, this fixes another small issue in the code in that
      potential changes to a vCPU's xAPIC ID need not be tracked for
      KVM_GET_LAPIC.
      
      Fixes: 3743c2f0 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base")
      Signed-off-by: default avatarYuan ZhaoXiong <yuanzhaoxiong@baidu.com>
      Message-Id: <1669984574-32692-1-git-send-email-yuanzhaoxiong@baidu.com>
      Cc: stable@vger.kernel.org
      Reported-by: default avatarAlejandro Jimenez <alejandro.j.jimenez@oracle.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ef407577
    • Sean Christopherson's avatar
      KVM: Remove stale comment about KVM_REQ_UNHALT · dd03cc90
      Sean Christopherson authored
      Remove a comment about KVM_REQ_UNHALT being set by kvm_vcpu_check_block()
      that was missed when KVM_REQ_UNHALT was dropped.
      
      Fixes: c59fb127 ("KVM: remove KVM_REQ_UNHALT")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20221201220433.31366-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      dd03cc90
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-fixes-6.2-1' of https://github.com/kvm-x86/linux into HEAD · b3761445
      Paolo Bonzini authored
      Misc KVM x86 fixes and cleanups for 6.2:
      
       - One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
      
       - Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
         years back when eliminating unnecessary barriers when switching between
         vmcs01 and vmcs02.
      
       - Clean up the MSR filter docs.
      
       - Clean up vmread_error_trampoline() to make it more obvious that params
         must be passed on the stack, even for x86-64.
      
       - Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
         of the current guest CPUID.
      
       - Fudge around a race with TSC refinement that results in KVM incorrectly
         thinking a guest needs TSC scaling when running on a CPU with a
         constant TSC, but no hardware-enumerated TSC frequency.
      b3761445
    • Paolo Bonzini's avatar
      Merge tag 'kvm-selftests-6.2-2' of https://github.com/kvm-x86/linux into HEAD · 44bc6115
      Paolo Bonzini authored
      KVM selftests fixes for 6.2
      
       - Fix an inverted check in the access tracking perf test, and restore
         support for asserting that there aren't too many idle pages when
         running on bare metal.
      
       - Fix an ordering issue in the AMX test introduced by recent conversions
         to use kvm_cpu_has(), and harden the code to guard against similar bugs
         in the future.  Anything that tiggers caching of KVM's supported CPUID,
         kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
         the caching occurs before the test opts in via prctl().
      
       - Fix build errors that occur in certain setups (unsure exactly what is
         unique about the problematic setup) due to glibc overriding
         static_assert() to a variant that requires a custom message.
      44bc6115
    • Javier Martinez Canillas's avatar
      KVM: Add missing arch for KVM_CREATE_DEVICE and KVM_{SET,GET}_DEVICE_ATTR · 10c5e80b
      Javier Martinez Canillas authored
      The ioctls are missing an architecture property that is present in others.
      Suggested-by: default avatarSergio Lopez Pascual <slp@redhat.com>
      Signed-off-by: default avatarJavier Martinez Canillas <javierm@redhat.com>
      Message-Id: <20221202105011.185147-5-javierm@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      10c5e80b
    • Javier Martinez Canillas's avatar
      KVM: Reference to kvm_userspace_memory_region in doc and comments · 30ee198c
      Javier Martinez Canillas authored
      There are still references to the removed kvm_memory_region data structure
      but the doc and comments should mention struct kvm_userspace_memory_region
      instead, since that is what's used by the ioctl that replaced the old one
      and this data structure support the same set of flags.
      Signed-off-by: default avatarJavier Martinez Canillas <javierm@redhat.com>
      Message-Id: <20221202105011.185147-4-javierm@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      30ee198c
    • Javier Martinez Canillas's avatar
      KVM: Delete all references to removed KVM_SET_MEMORY_ALIAS ioctl · 66a9221d
      Javier Martinez Canillas authored
      The documentation says that the ioctl has been deprecated, but it has been
      actually removed and the remaining references are just left overs.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJavier Martinez Canillas <javierm@redhat.com>
      Message-Id: <20221202105011.185147-3-javierm@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      66a9221d
    • Javier Martinez Canillas's avatar
      KVM: Delete all references to removed KVM_SET_MEMORY_REGION ioctl · 61e15f87
      Javier Martinez Canillas authored
      The documentation says that the ioctl has been deprecated, but it has been
      actually removed and the remaining references are just left overs.
      Suggested-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJavier Martinez Canillas <javierm@redhat.com>
      Message-Id: <20221202105011.185147-2-javierm@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      61e15f87
  2. 01 Dec, 2022 23 commits
    • Sean Christopherson's avatar
      KVM: selftests: Define and use a custom static assert in lib headers · 0c326523
      Sean Christopherson authored
      Define and use kvm_static_assert() in the common KVM selftests headers to
      provide deterministic behavior, and to allow creating static asserts
      without dummy messages.
      
      The kernel's static_assert() makes the message param optional, and on the
      surface, tools/include/linux/build_bug.h appears to follow suit.  However,
      glibc may override static_assert() and redefine it as a direct alias of
      _Static_assert(), which makes the message parameter mandatory.  This leads
      to non-deterministic behavior as KVM selftests code that utilizes
      static_assert() without a custom message may or not compile depending on
      the order of includes.  E.g. recently added asserts in
      x86_64/processor.h fail on some systems with errors like
      
        In file included from lib/memstress.c:11:0:
        include/x86_64/processor.h: In function ‘this_cpu_has_p’:
        include/x86_64/processor.h:193:34: error: expected ‘,’ before ‘)’ token
          static_assert(low_bit < high_bit);     \
                                          ^
      due to _Static_assert() expecting a comma before a message.  The "message
      optional" version of static_assert() uses macro magic to strip away the
      comma when presented with empty an __VA_ARGS__
      
        #ifndef static_assert
        #define static_assert(expr, ...) __static_assert(expr, ##__VA_ARGS__, #expr)
        #define __static_assert(expr, msg, ...) _Static_assert(expr, msg)
        #endif // static_assert
      
      and effectively generates "_Static_assert(expr, #expr)".
      
      The incompatible version of static_assert() gets defined by this snippet
      in /usr/include/assert.h:
      
        #if defined __USE_ISOC11 && !defined __cplusplus
        # undef static_assert
        # define static_assert _Static_assert
        #endif
      
      which yields "_Static_assert(expr)" and thus fails as above.
      
      KVM selftests don't actually care about using C11, but __USE_ISOC11 gets
      defined because of _GNU_SOURCE, which many tests do #define.  _GNU_SOURCE
      triggers a massive pile of defines in /usr/include/features.h, including
      _ISOC11_SOURCE:
      
        /* If _GNU_SOURCE was defined by the user, turn on all the other features.  */
        #ifdef _GNU_SOURCE
        # undef  _ISOC95_SOURCE
        # define _ISOC95_SOURCE 1
        # undef  _ISOC99_SOURCE
        # define _ISOC99_SOURCE 1
        # undef  _ISOC11_SOURCE
        # define _ISOC11_SOURCE 1
        # undef  _POSIX_SOURCE
        # define _POSIX_SOURCE  1
        # undef  _POSIX_C_SOURCE
        # define _POSIX_C_SOURCE        200809L
        # undef  _XOPEN_SOURCE
        # define _XOPEN_SOURCE  700
        # undef  _XOPEN_SOURCE_EXTENDED
        # define _XOPEN_SOURCE_EXTENDED 1
        # undef  _LARGEFILE64_SOURCE
        # define _LARGEFILE64_SOURCE    1
        # undef  _DEFAULT_SOURCE
        # define _DEFAULT_SOURCE        1
        # undef  _ATFILE_SOURCE
        # define _ATFILE_SOURCE 1
        #endif
      
      which further down in /usr/include/features.h leads to:
      
        /* This is to enable the ISO C11 extension.  */
        #if (defined _ISOC11_SOURCE \
             || (defined __STDC_VERSION__ && __STDC_VERSION__ >= 201112L))
        # define __USE_ISOC11   1
        #endif
      
      To make matters worse, /usr/include/assert.h doesn't guard against
      multiple inclusion by turning itself into a nop, but instead #undefs a
      few macros and continues on.  As a result, it's all but impossible to
      ensure the "message optional" version of static_assert() will actually be
      used, e.g. explicitly including assert.h and #undef'ing static_assert()
      doesn't work as a later inclusion of assert.h will again redefine its
      version.
      
        #ifdef  _ASSERT_H
      
        # undef _ASSERT_H
        # undef assert
        # undef __ASSERT_VOID_CAST
      
        # ifdef __USE_GNU
        #  undef assert_perror
        # endif
      
        #endif /* assert.h      */
      
        #define _ASSERT_H       1
        #include <features.h>
      
      Fixes: fcba483e ("KVM: selftests: Sanity check input to ioctls() at build time")
      Fixes: ee379553 ("KVM: selftests: Refactor X86_FEATURE_* framework to prep for X86_PROPERTY_*")
      Fixes: 53a7dc0f ("KVM: selftests: Add X86_PROPERTY_* framework to retrieve CPUID values")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221122013309.1872347-1-seanjc@google.com
      0c326523
    • Sean Christopherson's avatar
      KVM: selftests: Do kvm_cpu_has() checks before creating VM+vCPU · 553d1652
      Sean Christopherson authored
      Move the AMX test's kvm_cpu_has() checks before creating the VM+vCPU,
      there are no dependencies between the two operations.  Opportunistically
      add a comment to call out that enabling off-by-default XSAVE-managed
      features must be done before KVM_GET_SUPPORTED_CPUID is cached.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221128225735.3291648-5-seanjc@google.com
      553d1652
    • Sean Christopherson's avatar
      KVM: selftests: Disallow "get supported CPUID" before REQ_XCOMP_GUEST_PERM · cd5f3d21
      Sean Christopherson authored
      Disallow using kvm_get_supported_cpuid() and thus caching KVM's supported
      CPUID info before enabling XSAVE-managed features that are off-by-default
      and must be enabled by ARCH_REQ_XCOMP_GUEST_PERM.  Caching the supported
      CPUID before all XSAVE features are enabled can result in false negatives
      due to testing features that were cached before they were enabled.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221128225735.3291648-4-seanjc@google.com
      cd5f3d21
    • Sean Christopherson's avatar
      KVM: selftests: Move __vm_xsave_require_permission() below CPUID helpers · 2ceade1d
      Sean Christopherson authored
      Move __vm_xsave_require_permission() below the CPUID helpers so that a
      future change can reference the cached result of KVM_GET_SUPPORTED_CPUID
      while keeping the definition of the variable close to its intended user,
      kvm_get_supported_cpuid().
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221128225735.3291648-3-seanjc@google.com
      2ceade1d
    • Lei Wang's avatar
      KVM: selftests: Move XFD CPUID checking out of __vm_xsave_require_permission() · 18eee7bf
      Lei Wang authored
      Move the kvm_cpu_has() check on X86_FEATURE_XFD out of the helper to
      enable off-by-default XSAVE-managed features and into the one test that
      currenty requires XFD (XFeature Disable) support.   kvm_cpu_has() uses
      kvm_get_supported_cpuid() and thus caches KVM_GET_SUPPORTED_CPUID, and so
      using kvm_cpu_has() before ARCH_REQ_XCOMP_GUEST_PERM effectively results
      in the test caching stale values, e.g. subsequent checks on AMX_TILE will
      get false negatives.
      
      Although off-by-default features are nonsensical without XFD, checking
      for XFD virtualization prior to enabling such features isn't strictly
      required.
      Signed-off-by: default avatarLei Wang <lei4.wang@intel.com>
      Fixes: 7fbb653e ("KVM: selftests: Check KVM's supported CPUID, not host CPUID, for XFD")
      Link: https://lore.kernel.org/r/20221125023839.315207-1-lei4.wang@intel.com
      [sean: add Fixes, reword changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221128225735.3291648-2-seanjc@google.com
      18eee7bf
    • Sean Christopherson's avatar
      KVM: selftests: Restore assert for non-nested VMs in access tracking test · 8fcee042
      Sean Christopherson authored
      Restore the assert (on x86-64) that <10% of pages are still idle when NOT
      running as a nested VM in the access tracking test.  The original assert
      was converted to a "warning" to avoid false failures when running the
      test in a VM, but the non-nested case does not suffer from the same
      "infinite TLB size" issue.
      
      Using the HYPERVISOR flag isn't infallible as VMMs aren't strictly
      required to enumerate the "feature" in CPUID, but practically speaking
      anyone that is running KVM selftests in VMs is going to be using a VMM
      and hypervisor that sets the HYPERVISOR flag.
      
      Cc: David Matlack <dmatlack@google.com>
      Reviewed-by: default avatarEmanuele Giuseppe Esposito <eesposit@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221129175300.4052283-3-seanjc@google.com
      8fcee042
    • Sean Christopherson's avatar
      KVM: selftests: Fix inverted "warning" in access tracking perf test · a33004e8
      Sean Christopherson authored
      Warn if the number of idle pages is greater than or equal to 10% of the
      total number of pages, not if the percentage of idle pages is less than
      10%.  The original code asserted that less than 10% of pages were still
      idle, but the check got inverted when the assert was converted to a
      warning.
      
      Opportunistically clean up the warning; selftests are 64-bit only, there
      is no need to use "%PRIu64" instead of "%lu".
      
      Fixes: 6336a810 ("KVM: selftests: replace assertion with warning in access_tracking_perf_test")
      Reviewed-by: default avatarEmanuele Giuseppe Esposito <eesposit@redhat.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221129175300.4052283-2-seanjc@google.com
      a33004e8
    • Anton Romanov's avatar
      KVM: x86: Use current rather than snapshotted TSC frequency if it is constant · 3ebcbd22
      Anton Romanov authored
      Don't snapshot tsc_khz into per-cpu cpu_tsc_khz if the host TSC is
      constant, in which case the actual TSC frequency will never change and thus
      capturing TSC during initialization is unnecessary, KVM can simply use
      tsc_khz.  This value is snapshotted from
      kvm_timer_init->kvmclock_cpu_online->tsc_khz_changed(NULL)
      
      On CPUs with constant TSC, but not a hardware-specified TSC frequency,
      snapshotting cpu_tsc_khz and using that to set a VM's target TSC frequency
      can lead to VM to think its TSC frequency is not what it actually is if
      refining the TSC completes after KVM snapshots tsc_khz.  The actual
      frequency never changes, only the kernel's calculation of what that
      frequency is changes.
      
      Ideally, KVM would not be able to race with TSC refinement, or would have
      a hook into tsc_refine_calibration_work() to get an alert when refinement
      is complete.  Avoiding the race altogether isn't practical as refinement
      takes a relative eternity; it's deliberately put on a work queue outside of
      the normal boot sequence to avoid unnecessarily delaying boot.
      
      Adding a hook is doable, but somewhat gross due to KVM's ability to be
      built as a module.  And if the TSC is constant, which is likely the case
      for every VMX/SVM-capable CPU produced in the last decade, the race can be
      hit if and only if userspace is able to create a VM before TSC refinement
      completes; refinement is slow, but not that slow.
      
      For now, punt on a proper fix, as not taking a snapshot can help some uses
      cases and not taking a snapshot is arguably correct irrespective of the
      race with refinement.
      Signed-off-by: default avatarAnton Romanov <romanton@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220608183525.1143682-1-romanton@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      3ebcbd22
    • Sean Christopherson's avatar
      KVM: selftests: Verify userspace can stuff IA32_FEATURE_CONTROL at will · b80732fd
      Sean Christopherson authored
      Verify the KVM allows userspace to set all supported bits in the
      IA32_FEATURE_CONTROL MSR irrespective of the current guest CPUID, and
      that all unsupported bits are rejected.
      
      Throw the testcase into vmx_msrs_test even though it's not technically a
      VMX MSR; it's close enough, and the most frequently feature controlled by
      the MSR is VMX.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220607232353.3375324-4-seanjc@google.com
      b80732fd
    • Sean Christopherson's avatar
      KVM: VMX: Move MSR_IA32_FEAT_CTL.LOCKED check into "is valid" helper · 2d6cd686
      Sean Christopherson authored
      Move the check on IA32_FEATURE_CONTROL being locked, i.e. read-only from
      the guest, into the helper to check the overall validity of the incoming
      value.  Opportunistically rename the helper to make it clear that it
      returns a bool.
      
      No functional change intended.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220607232353.3375324-3-seanjc@google.com
      2d6cd686
    • Sean Christopherson's avatar
      KVM: VMX: Allow userspace to set all supported FEATURE_CONTROL bits · d2a00af2
      Sean Christopherson authored
      Allow userspace to set all supported bits in MSR IA32_FEATURE_CONTROL
      irrespective of the guest CPUID model, e.g. via KVM_SET_MSRS.  KVM's ABI
      is that userspace is allowed to set MSRs before CPUID, i.e. can set MSRs
      to values that would fault according to the guest CPUID model.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220607232353.3375324-2-seanjc@google.com
      d2a00af2
    • Sean Christopherson's avatar
      KVM: VMX: Make vmread_error_trampoline() uncallable from C code · 0b5e7a16
      Sean Christopherson authored
      Declare vmread_error_trampoline() as an opaque symbol so that it cannot
      be called from C code, at least not without some serious fudging.  The
      trampoline always passes parameters on the stack so that the inline
      VMREAD sequence doesn't need to clobber registers.  regparm(0) was
      originally added to document the stack behavior, but it ended up being
      confusing because regparm(0) is a nop for 64-bit targets.
      
      Opportunustically wrap the trampoline and its declaration in #ifdeffery
      to make it even harder to invoke incorrectly, to document why it exists,
      and so that it's not left behind if/when CONFIG_CC_HAS_ASM_GOTO_OUTPUT
      is true for all supported toolchains.
      
      No functional change intended.
      
      Cc: Uros Bizjak <ubizjak@gmail.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220928232015.745948-1-seanjc@google.com
      0b5e7a16
    • Sean Christopherson's avatar
      KVM: nVMX: Reword comments about generating nested CR0/4 read shadows · 4a8fd4a7
      Sean Christopherson authored
      Reword the comments that (attempt to) document nVMX's overrides of the
      CR0/4 read shadows for L2 after calling vmx_set_cr0/4().  The important
      behavior that needs to be documented is that KVM needs to override the
      shadows to account for L1's masks even though the shadows are set by the
      common helpers (and that setting the shadows first would result in the
      correct shadows being clobbered).
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarJim Mattson <jmattson@google.com>
      Link: https://lore.kernel.org/r/20220831000721.4066617-1-seanjc@google.com
      4a8fd4a7
    • Sean Christopherson's avatar
      KVM: x86: Clean up KVM_CAP_X86_USER_SPACE_MSR documentation · 1f158147
      Sean Christopherson authored
      Clean up the KVM_CAP_X86_USER_SPACE_MSR documentation to eliminate
      misleading and/or inconsistent verbiage, and to actually document what
      accesses are intercepted by which flags.
      
        - s/will/may since not all #GPs are guaranteed to be intercepted
        - s/deflect/intercept to align with common KVM terminology
        - s/user space/userspace to align with the majority of KVM docs
        - Avoid using "trap" terminology, as KVM exits to userspace _before_
          stepping, i.e. doesn't exhibit trap-like behavior
        - Actually document the flags
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220831001706.4075399-4-seanjc@google.com
      1f158147
    • Sean Christopherson's avatar
      KVM: x86: Reword MSR filtering docs to more precisely define behavior · b93d2ec3
      Sean Christopherson authored
      Reword the MSR filtering documentatiion to more precisely define the
      behavior of filtering using common virtualization terminology.
      
        - Explicitly document KVM's behavior when an MSR is denied
        - s/handled/allowed as there is no guarantee KVM will "handle" the
          MSR access
        - Drop the "fall back" terminology, which incorrectly suggests that
          there is existing KVM behavior to fall back to
        - Fix an off-by-one error in the range (the end is exclusive)
        - Call out the interaction between MSR filtering and
          KVM_CAP_X86_USER_SPACE_MSR's KVM_MSR_EXIT_REASON_FILTER
        - Delete the redundant paragraph on what '0' and '1' in the bitmap
          means, it's covered by the sections on KVM_MSR_FILTER_{READ,WRITE}
        - Delete the clause on x2APIC MSR behavior depending on APIC base, this
          is covered by stating that KVM follows architectural behavior when
          emulating/virtualizing MSR accesses
      Reported-by: default avatarAaron Lewis <aaronlewis@google.com>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220831001706.4075399-3-seanjc@google.com
      b93d2ec3
    • Sean Christopherson's avatar
      KVM: x86: Delete documentation for READ|WRITE in KVM_X86_SET_MSR_FILTER · 5c8c0b32
      Sean Christopherson authored
      Delete the paragraph that describes the behavior when both
      KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE are set for a range.  There is
      nothing special about KVM's handling of this combination, whereas
      explicitly documenting the combination suggests that there is some magic
      behavior the user needs to be aware of.
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220831001706.4075399-2-seanjc@google.com
      5c8c0b32
    • Jim Mattson's avatar
      KVM: VMX: Execute IBPB on emulated VM-exit when guest has IBRS · 2e7eab81
      Jim Mattson authored
      According to Intel's document on Indirect Branch Restricted
      Speculation, "Enabling IBRS does not prevent software from controlling
      the predicted targets of indirect branches of unrelated software
      executed later at the same predictor mode (for example, between two
      different user applications, or two different virtual machines). Such
      isolation can be ensured through use of the Indirect Branch Predictor
      Barrier (IBPB) command." This applies to both basic and enhanced IBRS.
      
      Since L1 and L2 VMs share hardware predictor modes (guest-user and
      guest-kernel), hardware IBRS is not sufficient to virtualize
      IBRS. (The way that basic IBRS is implemented on pre-eIBRS parts,
      hardware IBRS is actually sufficient in practice, even though it isn't
      sufficient architecturally.)
      
      For virtual CPUs that support IBRS, add an indirect branch prediction
      barrier on emulated VM-exit, to ensure that the predicted targets of
      indirect branches executed in L1 cannot be controlled by software that
      was executed in L2.
      
      Since we typically don't intercept guest writes to IA32_SPEC_CTRL,
      perform the IBPB at emulated VM-exit regardless of the current
      IA32_SPEC_CTRL.IBRS value, even though the IBPB could technically be
      deferred until L1 sets IA32_SPEC_CTRL.IBRS, if IA32_SPEC_CTRL.IBRS is
      clear at emulated VM-exit.
      
      This is CVE-2022-2196.
      
      Fixes: 5c911bef ("KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02")
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221019213620.1953281-3-jmattson@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      2e7eab81
    • Jim Mattson's avatar
      KVM: VMX: Guest usage of IA32_SPEC_CTRL is likely · 4f209989
      Jim Mattson authored
      At this point in time, most guests (in the default, out-of-the-box
      configuration) are likely to use IA32_SPEC_CTRL.  Therefore, drop the
      compiler hint that it is unlikely for KVM to be intercepting WRMSR of
      IA32_SPEC_CTRL.
      Signed-off-by: default avatarJim Mattson <jmattson@google.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221019213620.1953281-2-jmattson@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      4f209989
    • Sean Christopherson's avatar
      KVM: nVMX: Inject #GP, not #UD, if "generic" VMXON CR0/CR4 check fails · 9cc40932
      Sean Christopherson authored
      Inject #GP for if VMXON is attempting with a CR0/CR4 that fails the
      generic "is CRx valid" check, but passes the CR4.VMXE check, and do the
      generic checks _after_ handling the post-VMXON VM-Fail.
      
      The CR4.VMXE check, and all other #UD cases, are special pre-conditions
      that are enforced prior to pivoting on the current VMX mode, i.e. occur
      before interception if VMXON is attempted in VMX non-root mode.
      
      All other CR0/CR4 checks generate #GP and effectively have lower priority
      than the post-VMXON check.
      
      Per the SDM:
      
          IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ...
              THEN #UD;
          ELSIF not in VMX operation
              THEN
                  IF (CPL > 0) or (in A20M mode) or
                  (the values of CR0 and CR4 are not supported in VMX operation)
                      THEN #GP(0);
          ELSIF in VMX non-root operation
              THEN VMexit;
          ELSIF CPL > 0
              THEN #GP(0);
          ELSE VMfail("VMXON executed in VMX root operation");
          FI;
      
      which, if re-written without ELSIF, yields:
      
          IF (register operand) or (CR0.PE = 0) or (CR4.VMXE = 0) or ...
              THEN #UD
      
          IF in VMX non-root operation
              THEN VMexit;
      
          IF CPL > 0
              THEN #GP(0)
      
          IF in VMX operation
              THEN VMfail("VMXON executed in VMX root operation");
      
          IF (in A20M mode) or
             (the values of CR0 and CR4 are not supported in VMX operation)
                      THEN #GP(0);
      
      Note, KVM unconditionally forwards VMXON VM-Exits that occur in L2 to L1,
      i.e. there is no need to check the vCPU is not in VMX non-root mode.  Add
      a comment to explain why unconditionally forwarding such exits is
      functionally correct.
      Reported-by: default avatarEric Li <ercli@ucdavis.edu>
      Fixes: c7d855c2 ("KVM: nVMX: Inject #UD if VMXON is attempted with incompatible CR0/CR4")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20221006001956.329314-1-seanjc@google.com
      9cc40932
    • Zhao Liu's avatar
      KVM: SVM: Replace kmap_atomic() with kmap_local_page() · a8a12c00
      Zhao Liu authored
      The use of kmap_atomic() is being deprecated in favor of
      kmap_local_page()[1].
      
      The main difference between atomic and local mappings is that local
      mappings don't disable page faults or preemption.
      
      There're 2 reasons we can use kmap_local_page() here:
      1. SEV is 64-bit only and kmap_local_page() doesn't disable migration in
      this case, but here the function clflush_cache_range() uses CLFLUSHOPT
      instruction to flush, and on x86 CLFLUSHOPT is not CPU-local and flushes
      the page out of the entire cache hierarchy on all CPUs (APM volume 3,
      chapter 3, CLFLUSHOPT). So there's no need to disable preemption to ensure
      CPU-local.
      2. clflush_cache_range() doesn't need to disable pagefault and the mapping
      is still valid even if sleeps. This is also true for sched out/in when
      preempted.
      
      In addition, though kmap_local_page() is a thin wrapper around
      page_address() on 64-bit, kmap_local_page() should still be used here in
      preference to page_address() since page_address() isn't suitable to be used
      in a generic function (like sev_clflush_pages()) where the page passed in
      is not easy to determine the source of allocation. Keeping the kmap* API in
      place means it can be used for things other than highmem mappings[2].
      
      Therefore, sev_clflush_pages() is a function that should use
      kmap_local_page() in place of kmap_atomic().
      
      Convert the calls of kmap_atomic() / kunmap_atomic() to kmap_local_page() /
      kunmap_local().
      
      [1]: https://lore.kernel.org/all/20220813220034.806698-1-ira.weiny@intel.com
      [2]: https://lore.kernel.org/lkml/5d667258-b58b-3d28-3609-e7914c99b31b@intel.com/Suggested-by: default avatarDave Hansen <dave.hansen@intel.com>
      Suggested-by: default avatarIra Weiny <ira.weiny@intel.com>
      Suggested-by: default avatarFabio M. De Francesco <fmdefrancesco@gmail.com>
      Signed-off-by: default avatarZhao Liu <zhao1.liu@intel.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220928092748.463631-1-zhao1.liu@linux.intel.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      a8a12c00
    • Sean Christopherson's avatar
      KVM: SVM: Skip WRMSR fastpath on VM-Exit if next RIP isn't valid · 5c30e810
      Sean Christopherson authored
      Skip the WRMSR fastpath in SVM's VM-Exit handler if the next RIP isn't
      valid, e.g. because KVM is running with nrips=false.  SVM must decode and
      emulate to skip the WRMSR if the CPU doesn't provide the next RIP.
      Getting the instruction bytes to decode the WRMSR requires reading guest
      memory, which in turn means dereferencing memslots, and that isn't safe
      because KVM doesn't hold SRCU when the fastpath runs.
      
      Don't bother trying to enable the fastpath for this case, e.g. by doing
      only the WRMSR and leaving the "skip" until later.  NRIPS is supported on
      all modern CPUs (KVM has considered making it mandatory), and the next
      RIP will be valid the vast, vast majority of the time.
      
        =============================
        WARNING: suspicious RCU usage
        6.0.0-smp--4e557fcd3d80-skip #13 Tainted: G           O
        -----------------------------
        include/linux/kvm_host.h:954 suspicious rcu_dereference_check() usage!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by stable/206475:
         #0: ffff9d9dfebcc0f0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x8b/0x620 [kvm]
      
        stack backtrace:
        CPU: 152 PID: 206475 Comm: stable Tainted: G           O       6.0.0-smp--4e557fcd3d80-skip #13
        Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 10.48.0 01/27/2022
        Call Trace:
         <TASK>
         dump_stack_lvl+0x69/0xaa
         dump_stack+0x10/0x12
         lockdep_rcu_suspicious+0x11e/0x130
         kvm_vcpu_gfn_to_memslot+0x155/0x190 [kvm]
         kvm_vcpu_gfn_to_hva_prot+0x18/0x80 [kvm]
         paging64_walk_addr_generic+0x183/0x450 [kvm]
         paging64_gva_to_gpa+0x63/0xd0 [kvm]
         kvm_fetch_guest_virt+0x53/0xc0 [kvm]
         __do_insn_fetch_bytes+0x18b/0x1c0 [kvm]
         x86_decode_insn+0xf0/0xef0 [kvm]
         x86_emulate_instruction+0xba/0x790 [kvm]
         kvm_emulate_instruction+0x17/0x20 [kvm]
         __svm_skip_emulated_instruction+0x85/0x100 [kvm_amd]
         svm_skip_emulated_instruction+0x13/0x20 [kvm_amd]
         handle_fastpath_set_msr_irqoff+0xae/0x180 [kvm]
         svm_vcpu_run+0x4b8/0x5a0 [kvm_amd]
         vcpu_enter_guest+0x16ca/0x22f0 [kvm]
         kvm_arch_vcpu_ioctl_run+0x39d/0x900 [kvm]
         kvm_vcpu_ioctl+0x538/0x620 [kvm]
         __se_sys_ioctl+0x77/0xc0
         __x64_sys_ioctl+0x1d/0x20
         do_syscall_64+0x3d/0x80
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 404d5d7b ("KVM: X86: Introduce more exit_fastpath_completion enum values")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220930234031.1732249-1-seanjc@google.com
      5c30e810
    • Sean Christopherson's avatar
      KVM: x86: Fail emulation during EMULTYPE_SKIP on any exception · 17122c06
      Sean Christopherson authored
      Treat any exception during instruction decode for EMULTYPE_SKIP as a
      "full" emulation failure, i.e. signal failure instead of queuing the
      exception.  When decoding purely to skip an instruction, KVM and/or the
      CPU has already done some amount of emulation that cannot be unwound,
      e.g. on an EPT misconfig VM-Exit KVM has already processeed the emulated
      MMIO.  KVM already does this if a #UD is encountered, but not for other
      exceptions, e.g. if a #PF is encountered during fetch.
      
      In SVM's soft-injection use case, queueing the exception is particularly
      problematic as queueing exceptions while injecting events can put KVM
      into an infinite loop due to bailing from VM-Enter to service the newly
      pending exception.  E.g. multiple warnings to detect such behavior fire:
      
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9873 kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm]
        Modules linked in: kvm_amd ccp kvm irqbypass
        CPU: 3 PID: 1017 Comm: svm_nested_soft Not tainted 6.0.0-rc1+ #220
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_arch_vcpu_ioctl_run+0x1de5/0x20a0 [kvm]
        Call Trace:
         kvm_vcpu_ioctl+0x223/0x6d0 [kvm]
         __x64_sys_ioctl+0x85/0xc0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
        ---[ end trace 0000000000000000 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 3 PID: 1017 at arch/x86/kvm/x86.c:9987 kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm]
        Modules linked in: kvm_amd ccp kvm irqbypass
        CPU: 3 PID: 1017 Comm: svm_nested_soft Tainted: G        W          6.0.0-rc1+ #220
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:kvm_arch_vcpu_ioctl_run+0x12a3/0x20a0 [kvm]
        Call Trace:
         kvm_vcpu_ioctl+0x223/0x6d0 [kvm]
         __x64_sys_ioctl+0x85/0xc0
         do_syscall_64+0x2b/0x50
         entry_SYSCALL_64_after_hwframe+0x46/0xb0
        ---[ end trace 0000000000000000 ]---
      
      Fixes: 6ea6e843 ("KVM: x86: inject exceptions produced by x86_decode_insn")
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/20220930233632.1725475-1-seanjc@google.com
      17122c06
    • Peng Hao's avatar
      KVM: x86: Keep the lock order consistent between SRCU and gpc spinlock · 4265df66
      Peng Hao authored
      Acquire SRCU before taking the gpc spinlock in wait_pending_event() so as
      to be consistent with all other functions that acquire both locks.  It's
      not illegal to acquire SRCU inside a spinlock, nor is there deadlock
      potential, but in general it's preferable to order locks from least
      restrictive to most restrictive, e.g. if wait_pending_event() needed to
      sleep for whatever reason, it could do so while holding SRCU, but would
      need to drop the spinlock.
      Signed-off-by: default avatarPeng Hao <flyingpeng@tencent.com>
      Reviewed-by: default avatarSean Christopherson <seanjc@google.com>
      Link: https://lore.kernel.org/r/CAPm50a++Cb=QfnjMZ2EnCj-Sb9Y4UM-=uOEtHAcjnNLCAAf-dQ@mail.gmail.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      4265df66
  3. 30 Nov, 2022 2 commits