1. 13 Dec, 2023 2 commits
    • Michael Roth's avatar
      KVM: SEV: Do not intercept accesses to MSR_IA32_XSS for SEV-ES guests · a26b7cd2
      Michael Roth authored
      When intercepts are enabled for MSR_IA32_XSS, the host will swap in/out
      the guest-defined values while context-switching to/from guest mode.
      However, in the case of SEV-ES, vcpu->arch.guest_state_protected is set,
      so the guest-defined value is effectively ignored when switching to
      guest mode with the understanding that the VMSA will handle swapping
      in/out this register state.
      
      However, SVM is still configured to intercept these accesses for SEV-ES
      guests, so the values in the initial MSR_IA32_XSS are effectively
      read-only, and a guest will experience undefined behavior if it actually
      tries to write to this MSR. Fortunately, only CET/shadowstack makes use
      of this register on SEV-ES-capable systems currently, which isn't yet
      widely used, but this may become more of an issue in the future.
      
      Additionally, enabling intercepts of MSR_IA32_XSS results in #VC
      exceptions in the guest in certain paths that can lead to unexpected #VC
      nesting levels. One example is SEV-SNP guests when handling #VC
      exceptions for CPUID instructions involving leaf 0xD, subleaf 0x1, since
      they will access MSR_IA32_XSS as part of servicing the CPUID #VC, then
      generate another #VC when accessing MSR_IA32_XSS, which can lead to
      guest crashes if an NMI occurs at that point in time. Running perf on a
      guest while it is issuing such a sequence is one example where these can
      be problematic.
      
      Address this by disabling intercepts of MSR_IA32_XSS for SEV-ES guests
      if the host/guest configuration allows it. If the host/guest
      configuration doesn't allow for MSR_IA32_XSS, leave it intercepted so
      that it can be caught by the existing checks in
      kvm_{set,get}_msr_common() if the guest still attempts to access it.
      
      Fixes: 376c6d28 ("KVM: SVM: Provide support for SEV-ES vCPU creation/loading")
      Cc: Alexey Kardashevskiy <aik@amd.com>
      Suggested-by: default avatarTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: default avatarMichael Roth <michael.roth@amd.com>
      Message-Id: <20231016132819.1002933-4-michael.roth@amd.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a26b7cd2
    • Paolo Bonzini's avatar
      KVM: selftests: Fix dynamic generation of configuration names · e39120ab
      Paolo Bonzini authored
      When we dynamically generate a name for a configuration in get-reg-list
      we use strcat() to append to a buffer allocated using malloc() but we
      never initialise that buffer. Since malloc() offers no guarantees
      regarding the contents of the memory it returns this can lead to us
      corrupting, and likely overflowing, the buffer:
      
        vregs: PASS
        vregs+pmu: PASS
        sve: PASS
        sve+pmu: PASS
        vregs+pauth_address+pauth_generic: PASS
        X?vr+gspauth_addre+spauth_generi+pmu: PASS
      
      The bug is that strcat() should have been strcpy(), and that replacement
      would be enough to fix it, but there are other things in the function
      that leave something to be desired.  In particular, an (incorrectly)
      empty config would cause an out of bounds access to c->name[-1].
      Since the strcpy() call relies on c->name[0..len-1] being initialized,
      enforce that invariant throughout the function.
      
      Fixes: 2f9ace5d ("KVM: arm64: selftests: get-reg-list: Introduce vcpu configs")
      Reviewed-by: default avatarAndrew Jones <ajones@ventanamicro.com>
      Co-developed-by: default avatarMark Brown <broonie@kernel.org>
      Signed-off-by: default avatarMark Brown <broonie@kernel.org>
      Message-Id: <20231211-kvm-get-reg-list-str-init-v3-1-6554c71c77b1@kernel.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      e39120ab
  2. 08 Dec, 2023 6 commits
    • Sean Christopherson's avatar
      KVM: SVM: Update EFER software model on CR0 trap for SEV-ES · 4cdf351d
      Sean Christopherson authored
      In general, activating long mode involves setting the EFER_LME bit in
      the EFER register and then enabling the X86_CR0_PG bit in the CR0
      register. At this point, the EFER_LMA bit will be set automatically by
      hardware.
      
      In the case of SVM/SEV guests where writes to CR0 are intercepted, it's
      necessary for the host to set EFER_LMA on behalf of the guest since
      hardware does not see the actual CR0 write.
      
      In the case of SEV-ES guests where writes to CR0 are trapped instead of
      intercepted, the hardware *does* see/record the write to CR0 before
      exiting and passing the value on to the host, so as part of enabling
      SEV-ES support commit f1c6366e ("KVM: SVM: Add required changes to
      support intercepts under SEV-ES") dropped special handling of the
      EFER_LMA bit with the understanding that it would be set automatically.
      
      However, since the guest never explicitly sets the EFER_LMA bit, the
      host never becomes aware that it has been set. This becomes problematic
      when userspace tries to get/set the EFER values via
      KVM_GET_SREGS/KVM_SET_SREGS, since the EFER contents tracked by the host
      will be missing the EFER_LMA bit, and when userspace attempts to pass
      the EFER value back via KVM_SET_SREGS it will fail a sanity check that
      asserts that EFER_LMA should always be set when X86_CR0_PG and EFER_LME
      are set.
      
      Fix this by always inferring the value of EFER_LMA based on X86_CR0_PG
      and EFER_LME, regardless of whether or not SEV-ES is enabled.
      
      Fixes: f1c6366e ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
      Reported-by: default avatarPeter Gonda <pgonda@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Message-Id: <20210507165947.2502412-2-seanjc@google.com>
      [A two year old patch that was revived after we noticed the failure in
       KVM_SET_SREGS and a similar patch was posted by Michael Roth.  This is
       Sean's patch, but with Michael's more complete commit message. - Paolo]
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4cdf351d
    • David Woodhouse's avatar
      KVM: selftests: add -MP to CFLAGS · 96f12401
      David Woodhouse authored
      Using -MD without -MP causes build failures when a header file is deleted
      or moved. With -MP, the compiler will emit phony targets for the header
      files it lists as dependencies, and the Makefiles won't refuse to attempt
      to rebuild a C unit which no longer includes the deleted header.
      Signed-off-by: default avatarDavid Woodhouse <dwmw@amazon.co.uk>
      Link: https://lore.kernel.org/r/9fc8b5395321abbfcaf5d78477a9a7cd350b08e4.camel@infradead.orgSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      96f12401
    • angquan yu's avatar
      KVM: selftests: Actually print out magic token in NX hugepages skip message · 4a073e81
      angquan yu authored
      Pass MAGIC_TOKEN to __TEST_REQUIRE() when printing the help message about
      needing to pass a magic value to manually run the NX hugepages test,
      otherwise the help message will contain garbage.
      
        In file included from x86_64/nx_huge_pages_test.c:15:
        x86_64/nx_huge_pages_test.c: In function ‘main’:
        include/test_util.h:40:32: error: format ‘%d’ expects a matching ‘int’ argument [-Werror=format=]
           40 |                 ksft_exit_skip("- " fmt "\n", ##__VA_ARGS__);   \
              |                                ^~~~
        x86_64/nx_huge_pages_test.c:259:9: note: in expansion of macro ‘__TEST_REQUIRE’
          259 |         __TEST_REQUIRE(token == MAGIC_TOKEN,
              |         ^~~~~~~~~~~~~~
      Signed-off-by: default avatarangquan yu <angquan21@gmail.com>
      Link: https://lore.kernel.org/r/20231128221105.63093-1-angquan21@gmail.com
      [sean: rewrite shortlog+changelog]
      Signed-off-by: default avatarSean Christopherson <seanjc@google.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      4a073e81
    • Paolo Bonzini's avatar
      Merge tag 'kvm-x86-fixes-6.7-rcN' of https://github.com/kvm-x86/linux into kvm-master · 6254eeba
      Paolo Bonzini authored
      KVM fixes for 6.7-rcN:
      
       - When checking if a _running_ vCPU is "in-kernel", i.e. running at CPL0,
         get the CPL directly instead of relying on preempted_in_kernel, which
         is valid if and only if the vCPU was preempted, i.e. NOT running.
      
       - Set .owner for various KVM file_operations so that files refcount the
         KVM module until KVM is done executing _all_ code, including the last
         few instructions of kvm_put_kvm().  And then revert the misguided
         attempt to rely on "struct kvm" refcounts to pin KVM-the-module.
      
       - Fix a benign "return void" that was recently introduced.
      6254eeba
    • Paolo Bonzini's avatar
      Merge tag 'kvm-s390-master-6.7-1' of... · aa0ae3df
      Paolo Bonzini authored
      Merge tag 'kvm-s390-master-6.7-1' of https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master
      
      Two small but important bugfixes.
      aa0ae3df
    • Paolo Bonzini's avatar
      Merge tag 'kvmarm-fixes-6.7-1' of... · c8a11a93
      Paolo Bonzini authored
      Merge tag 'kvmarm-fixes-6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master
      
      KVM/arm64 fixes for 6.7, take #1
      
       - Avoid mapping vLPIs that have already been mapped
      c8a11a93
  3. 03 Dec, 2023 3 commits
  4. 02 Dec, 2023 5 commits
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 1b8af655
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Fix corruption of f0/vs0 during FP/Vector save, seen as userspace
         crashes when using io-uring workers (in particular with MariaDB)
      
       - Fix KVM_RUN potentially clobbering all host userspace FP/Vector
         registers
      
      Thanks to Timothy Pearson, Jens Axboe, and Nicholas Piggin.
      
      * tag 'powerpc-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        KVM: PPC: Book3S HV: Fix KVM_RUN clobbering FP/VEC user registers
        powerpc: Don't clobber f0/vs0 during fp|altivec register save
      1b8af655
    • Linus Torvalds's avatar
      Merge tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio · 17b17be2
      Linus Torvalds authored
      Pull vfio fixes from Alex Williamson:
      
       - Fix the lifecycle of a mutex in the pds variant driver such that a
         reset prior to opening the device won't find it uninitialized.
         Implement the release path to symmetrically destroy the mutex. Also
         switch a different lock from spinlock to mutex as the code path has
         the potential to sleep and doesn't need the spinlock context
         otherwise (Brett Creeley)
      
       - Fix an issue detected via randconfig where KVM tries to symbol_get an
         undeclared function. The symbol is temporarily declared
         unconditionally here, which resolves the problem and avoids churn
         relative to a series pending for the next merge window which resolves
         some of this symbol ugliness, but also fixes Kconfig dependencies
         (Sean Christopherson)
      
      * tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio:
        vfio: Drop vfio_file_iommu_group() stub to fudge around a KVM wart
        vfio/pds: Fix possible sleep while in atomic context
        vfio/pds: Fix mutex lock->magic != lock warning
      17b17be2
    • Linus Torvalds's avatar
      Merge tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · deb4b9dd
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
      
       - A fix for the Xen event driver setting the correct return value when
         experiencing an allocation failure
      
       - A fix for allocating space for a struct in the percpu area to not
         cross page boundaries (this one is for x86, a similar one for Arm was
         already in the pull request for rc3)
      
      * tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen/events: fix error code in xen_bind_pirq_msi_to_irq()
        x86/xen: fix percpu vcpu_info allocation
      deb4b9dd
    • Linus Torvalds's avatar
      Merge tag 'probes-fixes-v6.7-rc3' of... · 669fc834
      Linus Torvalds authored
      Merge tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull probes fixes from Masami Hiramatsu:
      
       - objpool: Fix objpool overrun case on memory/cache access delay
         especially on the big.LITTLE SoC. The objpool uses a copy of object
         slot index internal loop, but the slot index can be changed on
         another processor in parallel. In that case, the difference of 'head'
         local copy and the 'slot->last' index will be bigger than local slot
         size. In that case, we need to re-read the slot::head to update it.
      
       - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
         kretprobe_holder::rp is RCU managed, it should use
         rcu_assign_pointer() and rcu_dereference_check() correctly. Also
         adding __rcu tag for finding wrong usage by sparse.
      
       - rethook: Fix to use appropriate rcu API for rethook::handler. The
         same as kretprobe, rethook::handler is RCU managed and it should use
         rcu_assign_pointer() and rcu_dereference_check(). This also adds
         __rcu tag for finding wrong usage by sparse.
      
      * tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        rethook: Use __rcu pointer for rethook::handler
        kprobes: consistent rcu api usage for kretprobe holder
        lib: objpool: fix head overrun on RK3588 SBC
      669fc834
    • Linus Torvalds's avatar
      Merge tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 815fb87b
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "These fix issues in two cpufreq drivers, in the AMD P-state driver and
        in the power-capping DTPM framework.
      
        Specifics:
      
         - Fix the AMD P-state driver's EPP sysfs interface in the cases when
           the performance governor is in use (Ayush Jain)
      
         - Make the ->fast_switch() callback in the AMD P-state driver return
           the target frequency as expected (Gautham R. Shenoy)
      
         - Allow user space to control the range of frequencies to use via
           scaling_min_freq and scaling_max_freq when AMD P-state driver is in
           use (Wyes Karny)
      
         - Prevent power domains needed for wakeup signaling from being turned
           off during system suspend on Qualcomm systems and prevent
           performance states votes from runtime-suspended devices from being
           lost across a system suspend-resume cycle in qcom-cpufreq-nvmem
           (Stephan Gerhold)
      
         - Fix disabling the 792 Mhz OPP in the imx6q cpufreq driver for the
           i.MX6ULL types that can run at that frequency (Christoph
           Niedermaier)
      
         - Eliminate unnecessary and harmful conversions to uW from the DTPM
           (dynamic thermal and power management) framework (Lukasz Luba)"
      
      * tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq/amd-pstate: Only print supported EPP values for performance governor
        cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update
        powercap: DTPM: Fix unneeded conversions to micro-Watts
        cpufreq/amd-pstate: Fix the return value of amd_pstate_fast_switch()
        pmdomain: qcom: rpmpd: Set GENPD_FLAG_ACTIVE_WAKEUP
        cpufreq: qcom-nvmem: Preserve PM domain votes in system suspend
        cpufreq: qcom-nvmem: Enable virtual power domain devices
        cpufreq: imx6q: Don't disable 792 Mhz OPP unnecessarily
      815fb87b
  5. 01 Dec, 2023 24 commits
    • Linus Torvalds's avatar
      Merge tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · ce474ae7
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "This fixes a recently introduced build issue on ARM32 and a NULL
        pointer dereference in the ACPI backlight driver due to a design issue
        exposed by a recent change in the ACPI bus type code.
      
        Specifics:
      
         - Fix a recently introduced build issue on ARM32 platforms caused by
           an inadvertent header file breakage (Dave Jiang)
      
         - Eliminate questionable usage of acpi_driver_data() in the ACPI
           backlight cooling device code that leads to NULL pointer
           dereferences after recent ACPI core changes (Hans de Goede)"
      
      * tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: video: Use acpi_video_device for cooling-dev driver data
        ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
      ce474ae7
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 35f84584
      Linus Torvalds authored
      Pull arm64 fix from Catalin Marinas:
       "Fix a regression where the arm64 KPTI ends up enabled even on systems
        that don't need it"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: Avoid enabling KPTI unnecessarily
      35f84584
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 1a2b4185
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
      
       - Fix race conditions in device probe path
      
       - Handle ERR_PTR() returns in __iommu_domain_alloc() path
      
       - Update MAINTAINERS entry for Qualcom IOMMUs
      
       - Printk argument fix in device tree specific code
      
       - Several Intel VT-d fixes from Lu Baolu:
           - Do not support enforcing cache coherency for non-empty domains
           - Avoid devTLB invalidation if iommu is off
           - Disable PCI ATS in legacy passthrough mode
           - Support non-PCI devices when clearing context
           - Fix incorrect cache invalidation for mm notification
           - Add MTL to quirk list to skip TE disabling
           - Set variable intel_dirty_ops to static
      
      * tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu: Fix printk arg in of_iommu_get_resv_regions()
        iommu/vt-d: Set variable intel_dirty_ops to static
        iommu/vt-d: Fix incorrect cache invalidation for mm notification
        iommu/vt-d: Add MTL to quirk list to skip TE disabling
        iommu/vt-d: Make context clearing consistent with context mapping
        iommu/vt-d: Disable PCI ATS in legacy passthrough mode
        iommu/vt-d: Omit devTLB invalidation requests when TES=0
        iommu/vt-d: Support enforce_cache_coherency only for empty domains
        iommu: Avoid more races around device probe
        MAINTAINERS: list all Qualcomm IOMMU drivers in the QUALCOMM IOMMU entry
        iommu: Flow ERR_PTR out from __iommu_domain_alloc()
      1a2b4185
    • Linus Torvalds's avatar
      Merge tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 06a3c59f
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "No surprise here, including only a collection of HD-audio
        device-specific small fixes"
      
      * tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda: Disable power-save on KONTRON SinglePC
        ALSA: hda/realtek: Add supported ALC257 for ChromeOS
        ALSA: hda/realtek: Headset Mic VREF to 100%
        ALSA: hda: intel-nhlt: Ignore vbps when looking for DMIC 32 bps format
        ALSA: hda: cs35l56: Enable low-power hibernation mode on SPI
        ALSA: cs35l41: Fix for old systems which do not support command
        ALSA: hda: cs35l41: Remove unnecessary boolean state variable firmware_running
        ALSA: hda - Fix speaker and headset mic pin config for CHUWI CoreBook XPro
      06a3c59f
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm · b1e51588
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Weekly fixes, mostly amdgpu fixes with a scattering of nouveau, i915,
        and a couple of reverts. Hopefully it will quieten down in coming
        weeks.
      
        drm:
         - Revert unexport of prime helpers for fd/handle conversion
      
        dma_resv:
         - Do not double add fences in dma_resv_add_fence.
      
        gpuvm:
         - Fix GPUVM license identifier.
      
        i915:
         - Mark internal GSC engine with reserved uabi class
         - Take VGA converters into account in eDP probe
         - Fix intel_pre_plane_updates() call to ensure workarounds get applied
      
        panel:
         - Revert panel fixes as they require exporting device_is_dependent.
      
        nouveau:
         - fix oversized allocations in new vm path
         - fix zero-length array
         - remove a stray lock
      
        nt36523:
         - Fix error check for nt36523.
      
        amdgpu:
         - DMUB fix
         - DCN 3.5 fixes
         - XGMI fix
         - DCN 3.2 fixes
         - Vangogh suspend fix
         - NBIO 7.9 fix
         - GFX11 golden register fix
         - Backlight fix
         - NBIO 7.11 fix
         - IB test overflow fix
         - DCN 3.1.4 fixes
         - fix a runtime pm ref count
         - Retimer fix
         - ABM fix
         - DCN 3.1.5 fix
         - Fix AGP addressing
         - Fix possible memory leak in SMU error path
         - Make sure PME is enabled in D3
         - Fix possible NULL pointer dereference in debugfs
         - EEPROM fix
         - GC 9.4.3 fix
      
        amdkfd:
         - IP version check fix
         - Fix memory leak in pqm_uninit()"
      
      * tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm: (53 commits)
        Revert "drm/prime: Unexport helpers for fd/handle conversion"
        drm/amdgpu: Use another offset for GC 9.4.3 remap
        drm/amd/display: Fix some HostVM parameters in DML
        drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit
        drm/amdgpu: Update EEPROM I2C address for smu v13_0_0
        drm/amd/display: Allow DTBCLK disable for DCN35
        drm/amdgpu: Fix cat debugfs amdgpu_regs_didt causes kernel null pointer
        drm/amd: Enable PCIe PME from D3
        drm/amd/pm: fix a memleak in aldebaran_tables_init
        drm/amdgpu: fix AGP addressing when GART is not at 0
        drm/amd/display: update dcn315 lpddr pstate latency
        drm/amd/display: fix ABM disablement
        drm/amd/display: Fix black screen on video playback with embedded panel
        drm/amd/display: Fix conversions between bytes and KB
        drm/amdkfd: Use common function for IP version check
        drm/amd/display: Remove config update
        drm/amd/display: Update DCN35 clock table policy
        drm/amd/display: force toggle rate wa for first link training for a retimer
        drm/amdgpu: correct the amdgpu runtime dereference usage count
        drm/amd/display: Update min Z8 residency time to 2100 for DCN314
        ...
      b1e51588
    • Linus Torvalds's avatar
      Merge tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux · c9a925b7
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - Fix an issue with discontig page checking for IORING_SETUP_NO_MMAP
      
       - Fix an issue with not allowing IORING_SETUP_NO_MMAP also disallowing
         mmap'ed buffer rings
      
       - Fix an issue with deferred release of memory mapped pages
      
       - Fix a lockdep issue with IORING_SETUP_NO_MMAP
      
       - Use fget/fput consistently, even from our sync system calls. No real
         issue here, but if we were ever to allow closing io_uring descriptors
         it would be required. Let's play it safe and just use the full ref
         counted versions upfront. Most uses of io_uring are threaded anyway,
         and hence already doing the full version underneath.
      
      * tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux:
        io_uring: use fget/fput consistently
        io_uring: free io_buffer_list entries via RCU
        io_uring/kbuf: prune deferred locked cache when tearing down
        io_uring/kbuf: recycle freed mapped buffer ring entries
        io_uring/kbuf: defer release of mapped buffer rings
        io_uring: enable io_mem_alloc/free to be used in other parts
        io_uring: don't guard IORING_OFF_PBUF_RING with SETUP_NO_MMAP
        io_uring: don't allow discontig pages for IORING_SETUP_NO_MMAP
      c9a925b7
    • Linus Torvalds's avatar
      Merge tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux · ee0c8a9b
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - NVMe pull request via Keith:
           - Invalid namespace identification error handling (Marizio Ewan,
             Keith)
           - Fabrics keep-alive tuning (Mark)
      
       - Fix for a bad error check regression in bcache (Markus)
      
       - Fix for a performance regression with O_DIRECT (Ming)
      
       - Fix for a flush related deadlock (Ming)
      
       - Make the read-only warn on per-partition (Yu)
      
      * tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux:
        nvme-core: check for too small lba shift
        blk-mq: don't count completed flush data request as inflight in case of quiesce
        block: Document the role of the two attribute groups
        block: warn once for each partition in bio_check_ro()
        block: move .bd_inode into 1st cacheline of block_device
        nvme: check for valid nvme_identify_ns() before using it
        nvme-core: fix a memory leak in nvme_ns_info_from_identify()
        nvme: fine-tune sending of first keep-alive
        bcache: revert replacing IS_ERR_OR_NULL with IS_ERR
      ee0c8a9b
    • Linus Torvalds's avatar
      Merge tag 'dm-6.7/dm-fixes-2' of... · abd792f3
      Linus Torvalds authored
      Merge tag 'dm-6.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper fixes from Mike Snitzer:
      
       - Fix DM verity target's FEC support to always initialize IO before it
         frees it. Also fix alignment of struct dm_verity_fec_io within the
         per-bio-data
      
       - Fix DM verity target to not FEC failed readahead IO
      
       - Update DM flakey target to use MAX_ORDER rather than MAX_ORDER - 1
      
      * tag 'dm-6.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm-flakey: start allocating with MAX_ORDER
        dm-verity: align struct dm_verity_fec_io properly
        dm verity: don't perform FEC for failed readahead IO
        dm verity: initialize fec io before freeing it
      abd792f3
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · ff4a9f49
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Three small fixes, one in drivers.
      
        The core changes are to the internal representation of flags in
        scsi_devices which removes space wasting bools in favour of single bit
        flags and to add a flag to force a runtime resume which is used by ATA
        devices"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: sd: Fix system start for ATA devices
        scsi: Change SCSI device boolean fields to single bit flags
        scsi: ufs: core: Clear cmd if abort succeeds in MCQ mode
      ff4a9f49
    • Linus Torvalds's avatar
      Merge tag 'fs_for_v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · c1c09da0
      Linus Torvalds authored
      Pull ext2 fix from Jan Kara:
       "Fix an ext2 bug introduced by changes in ext2 & iomap stepping on each
        other toes (apparently ext2 driver does not get much testing in
        linux-next)"
      
      * tag 'fs_for_v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        ext2: Fix ki_pos update for DIO buffered-io fallback case
      c1c09da0
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs · e6861be4
      Linus Torvalds authored
      Pull more bcachefs bugfixes from Kent Overstreet:
      
       - bcache & bcachefs were broken with CFI enabled; patch for closures to
         fix type punning
      
       - mark erasure coding as extra-experimental; there are incompatible
         disk space accounting changes coming for erasure coding, and I'm
         still seeing checksum errors in some tests
      
       - several fixes for durability-related issues (durability is a device
         specific setting where we can tell bcachefs that data on a given
         device should be counted as replicated x times)
      
       - a fix for a rare livelock when a btree node merge then updates a
         parent node that is almost full
      
       - fix a race in the device removal path, where dropping a pointer in a
         btree node to a device would be clobbered by an in flight btree write
         updating the btree node key on completion
      
       - fix one SRCU lock hold time warning in the btree gc code - ther's
         still a bunch more of these to fix
      
       - fix a rare race where we'd start copygc before initializing the "are
         we rw" percpu refcount; copygc would think we were already ro and die
         immediately
      
      * tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs: (23 commits)
        bcachefs: Extra kthread_should_stop() calls for copygc
        bcachefs: Convert gc_alloc_start() to for_each_btree_key2()
        bcachefs: Fix race between btree writes and metadata drop
        bcachefs: move journal seq assertion
        bcachefs: -EROFS doesn't count as move_extent_start_fail
        bcachefs: trace_move_extent_start_fail() now includes errcode
        bcachefs: Fix split_race livelock
        bcachefs: Fix bucket data type for stripe buckets
        bcachefs: Add missing validation for jset_entry_data_usage
        bcachefs: Fix zstd compress workspace size
        bcachefs: bpos is misaligned on big endian
        bcachefs: Fix ec + durability calculation
        bcachefs: Data update path won't accidentaly grow replicas
        bcachefs: deallocate_extra_replicas()
        bcachefs: Proper refcounting for journal_keys
        bcachefs: preserve device path as device name
        bcachefs: Fix an endianness conversion
        bcachefs: Start gc, copygc, rebalance threads after initing writes ref
        bcachefs: Don't stop copygc thread on device resize
        bcachefs: Make sure bch2_move_ratelimit() also waits for move_ops
        ...
      e6861be4
    • Rafael J. Wysocki's avatar
      Merge branch 'acpi-tables' · 7d4c44a5
      Rafael J. Wysocki authored
      Merge a fix for a recently introduced build issue on ARM32 platforms
      caused by an inadvertent header file breakage (Dave Jiang).
      
      * acpi-tables:
        ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
      7d4c44a5
    • Rafael J. Wysocki's avatar
      Merge branch 'powercap' · a6b31256
      Rafael J. Wysocki authored
      Merge a power capping fix for 6.7-rc4 which eliminates unnecessary
      and harmful conversions to uW from the DTPM (dynamic thermal and power
      management) framework (Lukasz Luba).
      
      * powercap:
        powercap: DTPM: Fix unneeded conversions to micro-Watts
      a6b31256
    • Like Xu's avatar
      KVM: x86: Remove 'return void' expression for 'void function' · ef8d8903
      Like Xu authored
      The requested info will be stored in 'guest_xsave->region' referenced by
      the incoming pointer "struct kvm_xsave *guest_xsave", thus there is no need
      to explicitly use return void expression for a void function "static void
      kvm_vcpu_ioctl_x86_get_xsave(...)". The issue is caught with [-Wpedantic].
      
      Fixes: 2d287ec65e79 ("x86/fpu: Allow caller to constrain xfeatures when copying to uabi buffer")
      Signed-off-by: default avatarLike Xu <likexu@tencent.com>
      Reviewed-by: default avatarMaxim Levitsky <mlevitsk@redhat.com>
      Link: https://lore.kernel.org/r/20231007064019.17472-1-likexu@tencent.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      ef8d8903
    • Sean Christopherson's avatar
      Revert "KVM: Prevent module exit until all VMs are freed" · ea61294b
      Sean Christopherson authored
      Revert KVM's misguided attempt to "fix" a use-after-module-unload bug that
      was actually due to failure to flush a workqueue, not a lack of module
      refcounting.  Pinning the KVM module until kvm_vm_destroy() doesn't
      prevent use-after-free due to the module being unloaded, as userspace can
      invoke delete_module() the instant the last reference to KVM is put, i.e.
      can cause all KVM code to be unmapped while KVM is actively executing said
      code.
      
      Generally speaking, the many instances of module_put(THIS_MODULE)
      notwithstanding, outside of a few special paths, a module can never safely
      put the last reference to itself without creating deadlock, i.e. something
      external to the module *must* put the last reference.  In other words,
      having VMs grab a reference to the KVM module is futile, pointless, and as
      evidenced by the now-reverted commit 70375c2d ("Revert "KVM: set owner
      of cpu and vm file operations""), actively dangerous.
      
      This reverts commit 405294f2 and commit
      5f6de5cb.
      
      Fixes: 405294f2 ("KVM: Unconditionally get a ref to /dev/kvm module when creating a VM")
      Fixes: 5f6de5cb ("KVM: Prevent module exit until all VMs are freed")
      Link: https://lore.kernel.org/r/20231018204624.1905300-4-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      ea61294b
    • Sean Christopherson's avatar
      KVM: Set file_operations.owner appropriately for all such structures · 087e1520
      Sean Christopherson authored
      Set .owner for all KVM-owned filed types so that the KVM module is pinned
      until any files with callbacks back into KVM are completely freed.  Using
      "struct kvm" as a proxy for the module, i.e. keeping KVM-the-module alive
      while there are active VMs, doesn't provide full protection.
      
      Userspace can invoke delete_module() the instant the last reference to KVM
      is put.  If KVM itself puts the last reference, e.g. via kvm_destroy_vm(),
      then it's possible for KVM to be preempted and deleted/unloaded before KVM
      fully exits, e.g. when the task running kvm_destroy_vm() is scheduled back
      in, it will jump to a code page that is no longer mapped.
      
      Note, file types that can call into sub-module code, e.g. kvm-intel.ko or
      kvm-amd.ko on x86, must use the module pointer passed to kvm_init(), not
      THIS_MODULE (which points at kvm.ko).  KVM assumes that if /dev/kvm is
      reachable, e.g. VMs are active, then the vendor module is loaded.
      
      To reduce the probability of forgetting to set .owner entirely, use
      THIS_MODULE for stats files where KVM does not call back into vendor code.
      
      This reverts commit 70375c2d, and fixes
      several other file types that have been buggy since their introduction.
      
      Fixes: 70375c2d ("Revert "KVM: set owner of cpu and vm file operations"")
      Fixes: 3bcd0662 ("KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file")
      Reported-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      Link: https://lore.kernel.org/all/20231010003746.GN800259@ZenIV
      Link: https://lore.kernel.org/r/20231018204624.1905300-2-seanjc@google.comSigned-off-by: default avatarSean Christopherson <seanjc@google.com>
      087e1520
    • Jens Axboe's avatar
      Merge tag 'nvme-6.7-2023-12-01' of git://git.infradead.org/nvme into block-6.7 · 8ad3ac92
      Jens Axboe authored
      Pull NVMe fixes from Keith:
      
      "nvme fixes for Linux 6.7
      
       - Invalid namespace identification error handling (Marizio Ewan, Keith)
       - Fabrics keep-alive tuning (Mark)"
      
      * tag 'nvme-6.7-2023-12-01' of git://git.infradead.org/nvme:
        nvme-core: check for too small lba shift
        nvme: check for valid nvme_identify_ns() before using it
        nvme-core: fix a memory leak in nvme_ns_info_from_identify()
        nvme: fine-tune sending of first keep-alive
      8ad3ac92
    • Keith Busch's avatar
      nvme-core: check for too small lba shift · 74fbc88e
      Keith Busch authored
      The block layer doesn't support logical block sizes smaller than 512
      bytes. The nvme spec doesn't support that small either, but the driver
      isn't checking to make sure the device responded with usable data.
      Failing to catch this will result in a kernel bug, either from a
      division by zero when stacking, or a zero length bio.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      74fbc88e
    • Ming Lei's avatar
      blk-mq: don't count completed flush data request as inflight in case of quiesce · 0e4237ae
      Ming Lei authored
      Request queue quiesce may interrupt flush sequence, and the original request
      may have been marked as COMPLETE, but can't get finished because of
      queue quiesce.
      
      This way is fine from driver viewpoint, because flush sequence is block
      layer concept, and it isn't related with driver.
      
      However, driver(such as dm-rq) can call blk_mq_queue_inflight() to count &
      drain inflight requests, then the wait & drain never gets done because
      the completed & not-finished flush request is counted as inflight.
      
      Fix this issue by not counting completed flush data request as inflight in
      case of quiesce.
      
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: David Jeffery <djeffery@redhat.com>
      Cc: John Pittman <jpittman@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20231201085605.577730-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0e4237ae
    • Daniel Mentz's avatar
      iommu: Fix printk arg in of_iommu_get_resv_regions() · c2183b3d
      Daniel Mentz authored
      The variable phys is defined as (struct resource *) which aligns with
      the printk format specifier %pr. Taking the address of it results in a
      value of type (struct resource **) which is incompatible with the format
      specifier %pr. Therefore, remove the address of operator (&).
      
      Fixes: a5bf3cfc ("iommu: Implement of_iommu_get_resv_regions()")
      Signed-off-by: default avatarDaniel Mentz <danielmentz@google.com>
      Acked-by: default avatarThierry Reding <treding@nvidia.com>
      Link: https://lore.kernel.org/r/20231108062226.928985-1-danielmentz@google.comSigned-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      c2183b3d
    • Masami Hiramatsu (Google)'s avatar
      rethook: Use __rcu pointer for rethook::handler · a1461f1f
      Masami Hiramatsu (Google) authored
      Since the rethook::handler is an RCU-maganged pointer so that it will
      notice readers the rethook is stopped (unregistered) or not, it should
      be an __rcu pointer and use appropriate functions to be accessed. This
      will use appropriate memory barrier when accessing it. OTOH,
      rethook::data is never changed, so we don't need to check it in
      get_kretprobe().
      
      NOTE: To avoid sparse warning, rethook::handler is defined by a raw
      function pointer type with __rcu instead of rethook_handler_t.
      
      Link: https://lore.kernel.org/all/170126066201.398836.837498688669005979.stgit@devnote2/
      
      Fixes: 54ecbe6f ("rethook: Add a generic return hook")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202311241808.rv9ceuAh-lkp@intel.com/Tested-by: default avatarJP Kobryn <inwardvessel@gmail.com>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      a1461f1f
    • JP Kobryn's avatar
      kprobes: consistent rcu api usage for kretprobe holder · d839a656
      JP Kobryn authored
      It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is
      RCU-managed, based on the (non-rethook) implementation of get_kretprobe().
      The thought behind this patch is to make use of the RCU API where possible
      when accessing this pointer so that the needed barriers are always in place
      and to self-document the code.
      
      The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes
      done to the "rp" pointer are changed to make use of the RCU macro for
      assignment. For the single read, the implementation of get_kretprobe()
      is simplified by making use of an RCU macro which accomplishes the same,
      but note that the log warning text will be more generic.
      
      I did find that there is a difference in assembly generated between the
      usage of the RCU macros vs without. For example, on arm64, when using
      rcu_assign_pointer(), the corresponding store instruction is a
      store-release (STLR) which has an implicit barrier. When normal assignment
      is done, a regular store (STR) is found. In the macro case, this seems to
      be a result of rcu_assign_pointer() using smp_store_release() when the
      value to write is not NULL.
      
      Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/
      
      Fixes: d741bf41 ("kprobes: Remove kretprobe hash")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJP Kobryn <inwardvessel@gmail.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      d839a656
    • wuqiang.matt's avatar
      lib: objpool: fix head overrun on RK3588 SBC · d67f39d2
      wuqiang.matt authored
      objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the
      following kernel warnings:
      
          WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100
      
      This message is from objpool.c:168:
      
          WARN_ON_ONCE(tail - head > pool->nr_objs);
      
      The overrun test case is to validate the case that pre-allocated objects
      are insufficient: 8 objects are pre-allocated for each node and consumer
      thread per node tries to grab 16 objects in a row. The testing system is
      OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When
      disabling either all 4 big or 4 little cores, the overrun tests run well,
      and once with big and little cores mixed together, the overrun test would
      always cause an overrun loop. It's likely the memory timing differences
      of big and little cores cause this trouble. Here are the debugging data
      of objpool_try_get_slot after try_cmpxchg_release:
      
          objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278
      
      The local copies of 'head' and 'last' were 278 and 276, and reloading of
      'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release
      'slot->head' became 'head + 1', which is correct. But what's wrong here
      is the stale value of 'last', and that stale value of 'last' finally led
      the overrun of 'head'.
      
      Memory updating of 'last' and 'head' are performed in push() and pop()
      independently, which could be the culprit leading this out of order
      visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's
      not enough only checking the condition of 'head != slot', the implicit
      condition 'last - head <= nr_objs' must also be explicitly asserted to
      guarantee 'last' is always behind 'head' before the object retrieving.
      
      This patch will check and try reloading of 'head' and 'last' to ensure
      'last' is behind 'head' at the time of object retrieving. Performance
      testings show the average impact is about 0.1% for X86_64 and 1.12% for
      ARM64. Here are the results:
      
          OS: Debian 10 X86_64, Linux 6.6rc
          HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
                            1T         2T         4T         8T        16T
          native:     49543304   99277826  199017659  399070324  795185848
          objpool:    29909085   59865637  119692073  239750369  478005250
          objpool+:   29879313   59230743  119609856  239067773  478509029
                           32T        48T        64T        96T       128T
          native:   1596927073 2390099988 2929397330 3183875848 3257546602
          objpool:   957553042 1435814086 1680872925 2043126796 2165424198
          objpool+:  956476281 1434491297 1666055740 2041556569 2157415622
      
          OS: Debian 11 AARCH64, Linux 6.6rc
          HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
                            1T         2T         4T         8T        16T
          native:     30890508   60399915  123111980  242257008  494002946
          objpool:    14742531   28883047   57739948  115886644  232455421
          objpool+:   14107220   29032998   57286084  113730493  232232850
                           24T        32T        48T        64T        96T
          native:    746406039 1000174750 1493236240 1998318364 2942911180
          objpool:   349164852  467284332  702296756  934459713 1387898285
          objpool+:  348388180  462750976  696606096  927865887 1368402195
      
      Link: https://lore.kernel.org/all/20231114115148.298821-1-wuqiang.matt@bytedance.com/
      
      Fixes: b4edb8d2 ("lib: objpool added: ring-array based lockless MPMC")
      Signed-off-by: default avatarwuqiang.matt <wuqiang.matt@bytedance.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      d67f39d2
    • Linus Torvalds's avatar
      Merge tag 'hardening-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux · 994d5c58
      Linus Torvalds authored
      Pull hardening fixes from Kees Cook:
      
       - struct_group: propagate attributes to top-level union (Dmitry
         Antipov)
      
       - gcc-plugins: randstruct: Update code comment in relayout_struct
         (Gustavo A. R. Silva)
      
       - MAINTAINERS: refresh LLVM support (Nick Desaulniers)
      
      * tag 'hardening-v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
        gcc-plugins: randstruct: Update code comment in relayout_struct()
        uapi: propagate __struct_group() attributes to the container union
        MAINTAINERS: refresh LLVM support
      994d5c58