1. 05 Mar, 2024 5 commits
    • David Woodhouse's avatar
      KVM: x86/xen: fix recursive deadlock in timer injection · 7a36d680
      David Woodhouse authored
      The fast-path timer delivery introduced a recursive locking deadlock
      when userspace configures a timer which has already expired and is
      delivered immediately. The call to kvm_xen_inject_timer_irqs() can
      call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock,
      which is already held in kvm_xen_vcpu_get_attr().
      
       ============================================
       WARNING: possible recursive locking detected
       6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G           O
       --------------------------------------------
       xen_shinfo_test/250013 is trying to acquire lock:
       ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm]
      
       but task is already holding lock:
       ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm]
      
      Now that the gfn_to_pfn_cache has its own self-sufficient locking, its
      callers no longer need to ensure serialization, so just stop taking
      kvm->arch.xen.xen_lock from kvm_xen_set_evtchn().
      
      Fixes: 77c9b9de ("KVM: x86/xen: Use fast path for Xen timer delivery")
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Link: https://lore.kernel.org/r/20240227115648.3104-6-dwmw2@infradead.orgSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      7a36d680
    • David Woodhouse's avatar
      KVM: pfncache: simplify locking and make more self-contained · 6addfcf2
      David Woodhouse authored
      The locking on the gfn_to_pfn_cache is... interesting. And awful.
      
      There is a rwlock in ->lock which readers take to ensure protection
      against concurrent changes. But __kvm_gpc_refresh() makes assumptions
      that certain fields will not change even while it drops the write lock
      and performs MM operations to revalidate the target PFN and kernel
      mapping.
      
      Commit 93984f19 ("KVM: Fully serialize gfn=>pfn cache refresh via
      mutex") partly addressed that — not by fixing it, but by adding a new
      mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh()
      calls on a given gfn_to_pfn_cache, but is still only a partial solution.
      
      There is still a theoretical race where __kvm_gpc_refresh() runs in
      parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has
      dropped the write lock, kvm_gpc_deactivate() clears the ->active flag
      and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous
      ->pfn and ->khva are still valid, and reinstalls those values into the
      structure. This leaves the gfn_to_pfn_cache with the ->valid bit set,
      but ->active clear. And a ->khva which looks like a reasonable kernel
      address but is actually unmapped.
      
      All it takes is a subsequent reactivation to cause that ->khva to be
      dereferenced. This would theoretically cause an oops which would look
      something like this:
      
      [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0
      [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0
      
      I say "theoretically" because theoretically, that oops that was seen in
      production cannot happen. The code which uses the gfn_to_pfn_cache is
      supposed to have its *own* locking, to further paper over the fact that
      the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own
      rwlock abuse is not sufficient.
      
      For the Xen vcpu_info that external lock is the vcpu->mutex, and for the
      shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect
      the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all
      but the cases where the vcpu or kvm object is being *destroyed*, in
      which case the subsequent reactivation should never happen.
      
      Theoretically.
      
      Nevertheless, this locking abuse is awful and should be fixed, even if
      no clear explanation can be found for how the oops happened. So expand
      the use of the ->refresh_lock mutex to ensure serialization of
      activate/deactivate vs. refresh and make the pfncache locking entirely
      self-sufficient.
      
      This means that a future commit can simplify the locking in the callers,
      such as the Xen emulation code which has an outstanding problem with
      recursive locking of kvm->arch.xen.xen_lock, which will no longer be
      necessary.
      
      The rwlock abuse described above is still not best practice, although
      it's harmless now that the ->refresh_lock is held for the entire duration
      while the offending code drops the write lock, does some other stuff,
      then takes the write lock again and assumes nothing changed. That can
      also be fixed^W cleaned up in a subsequent commit, but this commit is
      a simpler basis for the Xen deadlock fix mentioned above.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Link: https://lore.kernel.org/r/20240227115648.3104-5-dwmw2@infradead.org
      [sean: use guard(mutex) to fix a missed unlock]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      6addfcf2
    • David Woodhouse's avatar
      KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery · 66e3cf72
      David Woodhouse authored
      The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast
      version will always work for physical unicast", justifying its use of
      kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails.
      
      In fact that assumption isn't true if X2APIC isn't in use by the guest
      and there is (8-bit x)APIC ID aliasing. A single "unicast" destination
      APIC ID *may* then be delivered to multiple vCPUs. Remove the warning,
      and in fact it might as well just call kvm_irq_delivery_to_apic().
      Reported-by: default avatarMichal Luczaj <mhal@rbox.co>
      Fixes: fde0451b ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC")
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Link: https://lore.kernel.org/r/20240227115648.3104-4-dwmw2@infradead.orgSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      66e3cf72
    • David Woodhouse's avatar
      KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled · 8e62bf2b
      David Woodhouse authored
      Linux guests since commit b1c3497e ("x86/xen: Add support for
      HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU
      upcall vector when it's advertised in the Xen CPUID leaves.
      
      This upcall is injected through the guest's local APIC as an MSI, unlike
      the older system vector which was merely injected by the hypervisor any
      time the CPU was able to receive an interrupt and the upcall_pending
      flags is set in its vcpu_info.
      
      Effectively, that makes the per-CPU upcall edge triggered instead of
      level triggered, which results in the upcall being lost if the MSI is
      delivered when the local APIC is *disabled*.
      
      Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC
      for a vCPU is software enabled (in fact, on any write to the SPIV
      register which doesn't disable the APIC). Do the same in KVM since KVM
      doesn't provide a way for userspace to intervene and trap accesses to
      the SPIV register of a local APIC emulated by KVM.
      
      Fixes: fde0451b ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC")
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20240227115648.3104-3-dwmw2@infradead.orgSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      8e62bf2b
    • David Woodhouse's avatar
      KVM: x86/xen: improve accuracy of Xen timers · 451a7078
      David Woodhouse authored
      A test program such as http://david.woodhou.se/timerlat.c confirms user
      reports that timers are increasingly inaccurate as the lifetime of a
      guest increases. Reporting the actual delay observed when asking for
      100µs of sleep, it starts off OK on a newly-launched guest but gets
      worse over time, giving incorrect sleep times:
      
      root@ip-10-0-193-21:~# ./timerlat -c -n 5
      00000000 latency 103243/100000 (3.2430%)
      00000001 latency 103243/100000 (3.2430%)
      00000002 latency 103242/100000 (3.2420%)
      00000003 latency 103245/100000 (3.2450%)
      00000004 latency 103245/100000 (3.2450%)
      
      The biggest problem is that get_kvmclock_ns() returns inaccurate values
      when the guest TSC is scaled. The guest sees a TSC value scaled from the
      host TSC by a mul/shift conversion (hopefully done in hardware). The
      guest then converts that guest TSC value into nanoseconds using the
      mul/shift conversion given to it by the KVM pvclock information.
      
      But get_kvmclock_ns() performs only a single conversion directly from
      host TSC to nanoseconds, giving a different result. A test program at
      http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
      over a day.
      
      It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
      that. The actual guest hv_clock is per-CPU, and *theoretically* each
      vCPU could be running at a *different* frequency. But this patch is
      needed anyway because...
      
      The other issue with Xen timers was that the code would snapshot the
      host CLOCK_MONOTONIC at some point in time, and then... after a few
      interrupts may have occurred, some preemption perhaps... would also read
      the guest's kvmclock. Then it would proceed under the false assumption
      that those two happened at the *same* time. Any time which *actually*
      elapsed between reading the two clocks was introduced as inaccuracies
      in the time at which the timer fired.
      
      Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
      host TSC just *once*, then use the returned TSC value to calculate the
      kvmclock (making sure to do that the way the guest would instead of
      making the same mistake get_kvmclock_ns() does).
      
      Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
      timers still have to use CLOCK_MONOTONIC. In practice the difference
      between the two won't matter over the timescales involved, as the
      *absolute* values don't matter; just the delta.
      
      This does mean a new variant of kvm_get_time_and_clockread() is needed;
      called kvm_get_monotonic_and_clockread() because that's what it does.
      
      Fixes: 53639526 ("KVM: x86/xen: handle PV timers oneshot mode")
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Reviewed-by: default avatarPaul Durrant <paul@xen.org>
      Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org
      [sean: massage moved comment, tweak if statement formatting]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      451a7078
  2. 22 Feb, 2024 7 commits
  3. 20 Feb, 2024 11 commits
  4. 08 Feb, 2024 13 commits
  5. 07 Feb, 2024 4 commits
    • Linus Torvalds's avatar
      Merge tag 'loongarch-fixes-6.8-2' of... · 547ab8fc
      Linus Torvalds authored
      Merge tag 'loongarch-fixes-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
      
      Pull LoongArch fixes from Huacai Chen:
       "Fix acpi_core_pic[] array overflow, fix earlycon parameter if KASAN
        enabled, disable UBSAN instrumentation for vDSO build, and two Kconfig
        cleanups"
      
      * tag 'loongarch-fixes-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
        LoongArch: vDSO: Disable UBSAN instrumentation
        LoongArch: Fix earlycon parameter if KASAN enabled
        LoongArch: Change acpi_core_pic[NR_CPUS] to acpi_core_pic[MAX_CORE_PIC]
        LoongArch: Select HAVE_ARCH_SECCOMP to use the common SECCOMP menu
        LoongArch: Select ARCH_ENABLE_THP_MIGRATION instead of redefining it
      547ab8fc
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 5c24ba20
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
       "x86 guest:
      
         - Avoid false positive for check that only matters on AMD processors
      
        x86:
      
         - Give a hint when Win2016 might fail to boot due to XSAVES &&
           !XSAVEC configuration
      
         - Do not allow creating an in-kernel PIT unless an IOAPIC already
           exists
      
        RISC-V:
      
         - Allow ISA extensions that were enabled for bare metal in 6.8 (Zbc,
           scalar and vector crypto, Zfh[min], Zihintntl, Zvfh[min], Zfa)
      
        S390:
      
         - fix CC for successful PQAP instruction
      
         - fix a race when creating a shadow page"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        x86/coco: Define cc_vendor without CONFIG_ARCH_HAS_CC_PLATFORM
        x86/kvm: Fix SEV check in sev_map_percpu_data()
        KVM: x86: Give a hint when Win2016 might fail to boot due to XSAVES erratum
        KVM: x86: Check irqchip mode before create PIT
        KVM: riscv: selftests: Add Zfa extension to get-reg-list test
        RISC-V: KVM: Allow Zfa extension for Guest/VM
        KVM: riscv: selftests: Add Zvfh[min] extensions to get-reg-list test
        RISC-V: KVM: Allow Zvfh[min] extensions for Guest/VM
        KVM: riscv: selftests: Add Zihintntl extension to get-reg-list test
        RISC-V: KVM: Allow Zihintntl extension for Guest/VM
        KVM: riscv: selftests: Add Zfh[min] extensions to get-reg-list test
        RISC-V: KVM: Allow Zfh[min] extensions for Guest/VM
        KVM: riscv: selftests: Add vector crypto extensions to get-reg-list test
        RISC-V: KVM: Allow vector crypto extensions for Guest/VM
        KVM: riscv: selftests: Add scaler crypto extensions to get-reg-list test
        RISC-V: KVM: Allow scalar crypto extensions for Guest/VM
        KVM: riscv: selftests: Add Zbc extension to get-reg-list test
        RISC-V: KVM: Allow Zbc extension for Guest/VM
        KVM: s390: fix cc for successful PQAP
        KVM: s390: vsie: fix race during shadow creation
      5c24ba20
    • Linus Torvalds's avatar
      Merge tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux · c8d80f83
      Linus Torvalds authored
      Pull nfsd fix from Chuck Lever:
      
       - Address a deadlock regression in RELEASE_LOCKOWNER
      
      * tag 'nfsd-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
        nfsd: don't take fi_lock in nfsd_break_deleg_cb()
      c8d80f83
    • Linus Torvalds's avatar
      Merge tag 'for-6.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 6d280f4d
      Linus Torvalds authored
      Pull btrfs fixes from David Sterba:
      
       - two fixes preventing deletion and manual creation of subvolume qgroup
      
       - unify error code returned for unknown send flags
      
       - fix assertion during subvolume creation when anonymous device could
         be allocated by other thread (e.g. due to backref walk)
      
      * tag 'for-6.8-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        btrfs: do not ASSERT() if the newly created subvolume already got read
        btrfs: forbid deleting live subvol qgroup
        btrfs: forbid creating subvol qgroups
        btrfs: send: return EOPNOTSUPP on unknown flags
      6d280f4d