1. 19 Oct, 2023 2 commits
  2. 18 Oct, 2023 2 commits
  3. 17 Oct, 2023 4 commits
  4. 10 Oct, 2023 1 commit
    • Like Xu's avatar
      KVM: x86: Don't sync user-written TSC against startup values · bf328e22
      Like Xu authored
      The legacy API for setting the TSC is fundamentally broken, and only
      allows userspace to set a TSC "now", without any way to account for
      time lost between the calculation of the value, and the kernel eventually
      handling the ioctl.
      
      To work around this, KVM has a hack which, if a TSC is set with a value
      which is within a second's worth of the last TSC "written" to any vCPU in
      the VM, assumes that userspace actually intended the two TSC values to be
      in sync and adjusts the newly-written TSC value accordingly.
      
      Thus, when a VMM restores a guest after suspend or migration using the
      legacy API, the TSCs aren't necessarily *right*, but at least they're
      in sync.
      
      This trick falls down when restoring a guest which genuinely has been
      running for less time than the 1 second of imprecision KVM allows for in
      in the legacy API.  On *creation*, the first vCPU starts its TSC counting
      from zero, and the subsequent vCPUs synchronize to that.  But then when
      the VMM tries to restore a vCPU's intended TSC, because the VM has been
      alive for less than 1 second and KVM's default TSC value for new vCPU's is
      '0', the intended TSC is within a second of the last "written" TSC and KVM
      incorrectly adjusts the intended TSC in an attempt to synchronize.
      
      But further hacks can be piled onto KVM's existing hackish ABI, and
      declare that the *first* value written by *userspace* (on any vCPU)
      should not be subject to this "correction", i.e. KVM can assume that the
      first write from userspace is not an attempt to sync up with TSC values
      that only come from the kernel's default vCPU creation.
      
      To that end: Add a flag, kvm->arch.user_set_tsc, protected by
      kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in
      the VM *has* been set by userspace, and make the 1-second slop hack only
      trigger if user_set_tsc is already set.
      
      Note that userspace can explicitly request a *synchronization* of the
      TSC by writing zero. For the purpose of user_set_tsc, an explicit
      synchronization counts as "setting" the TSC, i.e. if userspace then
      subsequently writes an explicit non-zero value which happens to be within
      1 second of the previous value, the new value will be "corrected".  This
      behavior is deliberate, as treating explicit synchronization as "setting"
      the TSC preserves KVM's existing behaviour inasmuch as possible (KVM
      always applied the 1-second "correction" regardless of whether the write
      came from userspace vs. the kernel).
      Reported-by: default avatarYong He <alexyonghe@tencent.com>
      Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423Suggested-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Original-by: default avatarOliver Upton <oliver.upton@linux.dev>
      Original-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Tested-by: default avatarYong He <alexyonghe@tencent.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Link: https://lore.kernel.org/r/20231008025335.7419-1-likexu@tencent.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      bf328e22
  5. 09 Oct, 2023 3 commits
  6. 06 Oct, 2023 1 commit
    • David Woodhouse's avatar
      KVM: x86: Refine calculation of guest wall clock to use a single TSC read · 5d6d6a7d
      David Woodhouse authored
      When populating the guest's PV wall clock information, KVM currently does
      a simple 'kvm_get_real_ns() - get_kvmclock_ns(kvm)'. This is an antipattern
      which should be avoided; when working with the relationship between two
      clocks, it's never correct to obtain one of them "now" and then the other
      at a slightly different "now" after an unspecified period of preemption
      (which might not even be under the control of the kernel, if this is an
      L1 hosting an L2 guest under nested virtualization).
      
      Add a kvm_get_wall_clock_epoch() function to return the guest wall clock
      epoch in nanoseconds using the same method as __get_kvmclock() — by using
      kvm_get_walltime_and_clockread() to calculate both the wall clock and KVM
      clock time from a *single* TSC reading.
      
      The condition using get_cpu_tsc_khz() is equivalent to the version in
      __get_kvmclock() which separately checks for the CONSTANT_TSC feature or
      the per-CPU cpu_tsc_khz. Which is what get_cpu_tsc_khz() does anyway.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Link: https://lore.kernel.org/r/bfc6d3d7cfb88c47481eabbf5a30a264c58c7789.camel@infradead.orgSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      5d6d6a7d
  7. 04 Oct, 2023 2 commits
  8. 28 Sep, 2023 1 commit
  9. 27 Sep, 2023 2 commits
  10. 23 Sep, 2023 6 commits
    • Paolo Bonzini's avatar
      Merge tag 'kvm-riscv-fixes-6.6-1' of https://github.com/kvm-riscv/linux into HEAD · 5804c19b
      Paolo Bonzini authored
      KVM/riscv fixes for 6.6, take #1
      
      - Fix KVM_GET_REG_LIST API for ISA_EXT registers
      - Fix reading ISA_EXT register of a missing extension
      - Fix ISA_EXT register handling in get-reg-list test
      - Fix filtering of AIA registers in get-reg-list test
      5804c19b
    • Tom Lendacky's avatar
      KVM: SVM: Do not use user return MSR support for virtualized TSC_AUX · 916e3e5f
      Tom Lendacky authored
      When the TSC_AUX MSR is virtualized, the TSC_AUX value is swap type "B"
      within the VMSA. This means that the guest value is loaded on VMRUN and
      the host value is restored from the host save area on #VMEXIT.
      
      Since the value is restored on #VMEXIT, the KVM user return MSR support
      for TSC_AUX can be replaced by populating the host save area with the
      current host value of TSC_AUX. And, since TSC_AUX is not changed by Linux
      post-boot, the host save area can be set once in svm_hardware_enable().
      This eliminates the two WRMSR instructions associated with the user return
      MSR support.
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <d381de38eb0ab6c9c93dda8503b72b72546053d7.1694811272.git.thomas.lendacky@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      916e3e5f
    • Tom Lendacky's avatar
      KVM: SVM: Fix TSC_AUX virtualization setup · e0096d01
      Tom Lendacky authored
      The checks for virtualizing TSC_AUX occur during the vCPU reset processing
      path. However, at the time of initial vCPU reset processing, when the vCPU
      is first created, not all of the guest CPUID information has been set. In
      this case the RDTSCP and RDPID feature support for the guest is not in
      place and so TSC_AUX virtualization is not established.
      
      This continues for each vCPU created for the guest. On the first boot of
      an AP, vCPU reset processing is executed as a result of an APIC INIT
      event, this time with all of the guest CPUID information set, resulting
      in TSC_AUX virtualization being enabled, but only for the APs. The BSP
      always sees a TSC_AUX value of 0 which probably went unnoticed because,
      at least for Linux, the BSP TSC_AUX value is 0.
      
      Move the TSC_AUX virtualization enablement out of the init_vmcb() path and
      into the vcpu_after_set_cpuid() path to allow for proper initialization of
      the support after the guest CPUID information has been set.
      
      With the TSC_AUX virtualization support now in the vcpu_set_after_cpuid()
      path, the intercepts must be either cleared or set based on the guest
      CPUID input.
      
      Fixes: 296d5a17 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts")
      Signed-off-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <4137fbcb9008951ab5f0befa74a0399d2cce809a.1694811272.git.thomas.lendacky@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e0096d01
    • Paolo Bonzini's avatar
      KVM: SVM: INTERCEPT_RDTSCP is never intercepted anyway · e8d93d5d
      Paolo Bonzini authored
      svm_recalc_instruction_intercepts() is always called at least once
      before the vCPU is started, so the setting or clearing of the RDTSCP
      intercept can be dropped from the TSC_AUX virtualization support.
      
      Extracted from a patch by Tom Lendacky.
      
      Cc: stable@vger.kernel.org
      Fixes: 296d5a17 ("KVM: SEV-ES: Use V_TSC_AUX if available instead of RDTSC/MSR_TSC_AUX intercepts")
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e8d93d5d
    • Sean Christopherson's avatar
      KVM: x86/mmu: Stop zapping invalidated TDP MMU roots asynchronously · 0df9dab8
      Sean Christopherson authored
      Stop zapping invalidate TDP MMU roots via work queue now that KVM
      preserves TDP MMU roots until they are explicitly invalidated.  Zapping
      roots asynchronously was effectively a workaround to avoid stalling a vCPU
      for an extended during if a vCPU unloaded a root, which at the time
      happened whenever the guest toggled CR0.WP (a frequent operation for some
      guest kernels).
      
      While a clever hack, zapping roots via an unbound worker had subtle,
      unintended consequences on host scheduling, especially when zapping
      multiple roots, e.g. as part of a memslot.  Because the work of zapping a
      root is no longer bound to the task that initiated the zap, things like
      the CPU affinity and priority of the original task get lost.  Losing the
      affinity and priority can be especially problematic if unbound workqueues
      aren't affined to a small number of CPUs, as zapping multiple roots can
      cause KVM to heavily utilize the majority of CPUs in the system, *beyond*
      the CPUs KVM is already using to run vCPUs.
      
      When deleting a memslot via KVM_SET_USER_MEMORY_REGION, the async root
      zap can result in KVM occupying all logical CPUs for ~8ms, and result in
      high priority tasks not being scheduled in in a timely manner.  In v5.15,
      which doesn't preserve unloaded roots, the issues were even more noticeable
      as KVM would zap roots more frequently and could occupy all CPUs for 50ms+.
      
      Consuming all CPUs for an extended duration can lead to significant jitter
      throughout the system, e.g. on ChromeOS with virtio-gpu, deleting memslots
      is a semi-frequent operation as memslots are deleted and recreated with
      different host virtual addresses to react to host GPU drivers allocating
      and freeing GPU blobs.  On ChromeOS, the jitter manifests as audio blips
      during games due to the audio server's tasks not getting scheduled in
      promptly, despite the tasks having a high realtime priority.
      
      Deleting memslots isn't exactly a fast path and should be avoided when
      possible, and ChromeOS is working towards utilizing MAP_FIXED to avoid the
      memslot shenanigans, but KVM is squarely in the wrong.  Not to mention
      that removing the async zapping eliminates a non-trivial amount of
      complexity.
      
      Note, one of the subtle behaviors hidden behind the async zapping is that
      KVM would zap invalidated roots only once (ignoring partial zaps from
      things like mmu_notifier events).  Preserve this behavior by adding a flag
      to identify roots that are scheduled to be zapped versus roots that have
      already been zapped but not yet freed.
      
      Add a comment calling out why kvm_tdp_mmu_invalidate_all_roots() can
      encounter invalid roots, as it's not at all obvious why zapping
      invalidated roots shouldn't simply zap all invalid roots.
      Reported-by: default avatarPattara Teerapong <pteerapong@google.com>
      Cc: David Stevens <stevensd@google.com>
      Cc: Yiwei Zhang<zzyiwei@google.com>
      Cc: Paul Hsia <paulhsia@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20230916003916.2545000-4-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0df9dab8
    • Paolo Bonzini's avatar
      KVM: x86/mmu: Do not filter address spaces in for_each_tdp_mmu_root_yield_safe() · 441a5dfc
      Paolo Bonzini authored
      All callers except the MMU notifier want to process all address spaces.
      Remove the address space ID argument of for_each_tdp_mmu_root_yield_safe()
      and switch the MMU notifier to use __for_each_tdp_mmu_root_yield_safe().
      
      Extracted out of a patch by Sean Christopherson <seanjc@google.com>
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      441a5dfc
  11. 21 Sep, 2023 5 commits
  12. 20 Sep, 2023 1 commit
    • Sean Christopherson's avatar
      KVM: selftests: Assert that vasprintf() is successful · 7c329bbd
      Sean Christopherson authored
      Assert that vasprintf() succeeds as the "returned" string is undefined
      on failure.  Checking the result also eliminates the only warning with
      default options in KVM selftests, i.e. is the only thing getting in the
      way of compile with -Werror.
      
        lib/test_util.c: In function ‘strdup_printf’:
        lib/test_util.c:390:9: error: ignoring return value of ‘vasprintf’
        declared with attribute ‘warn_unused_result’ [-Werror=unused-result]
        390 |         vasprintf(&str, fmt, ap);
            |         ^~~~~~~~~~~~~~~~~~~~~~~~
      
      Don't bother capturing the return value, allegedly vasprintf() can only
      fail due to a memory allocation failure.
      
      Fixes: dfaf20af ("KVM: arm64: selftests: Replace str_with_index with strdup_printf")
      Cc: Andrew Jones <ajones@ventanamicro.com>
      Cc: Haibo Xu <haibo1.xu@intel.com>
      Cc: Anup Patel <anup@brainfault.org>
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Reviewed-by: default avatarAndrew Jones <ajones@ventanamicro.com>
      Tested-by: default avatarAndrew Jones <ajones@ventanamicro.com>
      Message-Id: <20230914010636.1391735-1-seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      7c329bbd
  13. 17 Sep, 2023 10 commits