1. 08 Dec, 2023 12 commits
    • Kai Huang's avatar
      x86/virt/tdx: Designate reserved areas for all TDMRs · dde3b60d
      Kai Huang authored
      As the last step of constructing TDMRs, populate reserved areas for all
      TDMRs.  Cover all memory holes and PAMTs with a TMDR reserved area.
      
      [ dhansen: trim down chagnelog ]
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-12-dave.hansen%40intel.com
      dde3b60d
    • Kai Huang's avatar
      x86/virt/tdx: Allocate and set up PAMTs for TDMRs · ac3a2208
      Kai Huang authored
      The TDX module uses additional metadata to record things like which
      guest "owns" a given page of memory.  This metadata, referred as
      Physical Address Metadata Table (PAMT), essentially serves as the
      'struct page' for the TDX module.  PAMTs are not reserved by hardware
      up front.  They must be allocated by the kernel and then given to the
      TDX module during module initialization.
      
      TDX supports 3 page sizes: 4K, 2M, and 1G.  Each "TD Memory Region"
      (TDMR) has 3 PAMTs to track the 3 supported page sizes.  Each PAMT must
      be a physically contiguous area from a Convertible Memory Region (CMR).
      However, the PAMTs which track pages in one TDMR do not need to reside
      within that TDMR but can be anywhere in CMRs.  If one PAMT overlaps with
      any TDMR, the overlapping part must be reported as a reserved area in
      that particular TDMR.
      
      Use alloc_contig_pages() since PAMT must be a physically contiguous area
      and it may be potentially large (~1/256th of the size of the given TDMR).
      The downside is alloc_contig_pages() may fail at runtime.  One (bad)
      mitigation is to launch a TDX guest early during system boot to get
      those PAMTs allocated at early time, but the only way to fix is to add a
      boot option to allocate or reserve PAMTs during kernel boot.
      
      It is imperfect but will be improved on later.
      
      TDX only supports a limited number of reserved areas per TDMR to cover
      both PAMTs and memory holes within the given TDMR.  If many PAMTs are
      allocated within a single TDMR, the reserved areas may not be sufficient
      to cover all of them.
      
      Adopt the following policies when allocating PAMTs for a given TDMR:
      
        - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize
          the total number of reserved areas consumed for PAMTs.
        - Try to first allocate PAMT from the local node of the TDMR for better
          NUMA locality.
      
      Also dump out how many pages are allocated for PAMTs when the TDX module
      is initialized successfully.  This helps answer the eternal "where did
      all my memory go?" questions.
      
      [ dhansen: merge in error handling cleanup ]
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-11-dave.hansen%40intel.com
      ac3a2208
    • Kai Huang's avatar
      x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions · f3338ac1
      Kai Huang authored
      Start to transit out the "multi-steps" to construct a list of "TD Memory
      Regions" (TDMRs) to cover all TDX-usable memory regions.
      
      The kernel configures TDX-usable memory regions by passing a list of
      TDMRs "TD Memory Regions" (TDMRs) to the TDX module.  Each TDMR contains
      the information of the base/size of a memory region, the base/size of the
      associated Physical Address Metadata Table (PAMT) and a list of reserved
      areas in the region.
      
      Do the first step to fill out a number of TDMRs to cover all TDX memory
      regions.  To keep it simple, always try to use one TDMR for each memory
      region.  As the first step only set up the base/size for each TDMR.
      
      Each TDMR must be 1G aligned and the size must be in 1G granularity.
      This implies that one TDMR could cover multiple memory regions.  If a
      memory region spans the 1GB boundary and the former part is already
      covered by the previous TDMR, just use a new TDMR for the remaining
      part.
      
      TDX only supports a limited number of TDMRs.  Disable TDX if all TDMRs
      are consumed but there is more memory region to cover.
      
      There are fancier things that could be done like trying to merge
      adjacent TDMRs.  This would allow more pathological memory layouts to be
      supported.  But, current systems are not even close to exhausting the
      existing TDMR resources in practice.  For now, keep it simple.
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
      Reviewed-by: default avatarYuan Yao <yuan.yao@intel.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-10-dave.hansen%40intel.com
      f3338ac1
    • Kai Huang's avatar
      x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions · 5173d3c5
      Kai Huang authored
      After the kernel selects all TDX-usable memory regions, the kernel needs
      to pass those regions to the TDX module via data structure "TD Memory
      Region" (TDMR).
      
      Add a placeholder to construct a list of TDMRs (in multiple steps) to
      cover all TDX-usable memory regions.
      
      === Long Version ===
      
      TDX provides increased levels of memory confidentiality and integrity.
      This requires special hardware support for features like memory
      encryption and storage of memory integrity checksums.  Not all memory
      satisfies these requirements.
      
      As a result, TDX introduced the concept of a "Convertible Memory Region"
      (CMR).  During boot, the firmware builds a list of all of the memory
      ranges which can provide the TDX security guarantees.  The list of these
      ranges is available to the kernel by querying the TDX module.
      
      The TDX architecture needs additional metadata to record things like
      which TD guest "owns" a given page of memory.  This metadata essentially
      serves as the 'struct page' for the TDX module.  The space for this
      metadata is not reserved by the hardware up front and must be allocated
      by the kernel and given to the TDX module.
      
      Since this metadata consumes space, the VMM can choose whether or not to
      allocate it for a given area of convertible memory.  If it chooses not
      to, the memory cannot receive TDX protections and can not be used by TDX
      guests as private memory.
      
      For every memory region that the VMM wants to use as TDX memory, it sets
      up a "TD Memory Region" (TDMR).  Each TDMR represents a physically
      contiguous convertible range and must also have its own physically
      contiguous metadata table, referred to as a Physical Address Metadata
      Table (PAMT), to track status for each page in the TDMR range.
      
      Unlike a CMR, each TDMR requires 1G granularity and alignment.  To
      support physical RAM areas that don't meet those strict requirements,
      each TDMR permits a number of internal "reserved areas" which can be
      placed over memory holes.  If PAMT metadata is placed within a TDMR it
      must be covered by one of these reserved areas.
      
      Let's summarize the concepts:
      
       CMR - Firmware-enumerated physical ranges that support TDX.  CMRs are
             4K aligned.
      TDMR - Physical address range which is chosen by the kernel to support
             TDX.  1G granularity and alignment required.  Each TDMR has
             reserved areas where TDX memory holes and overlapping PAMTs can
             be represented.
      PAMT - Physically contiguous TDX metadata.  One table for each page size
             per TDMR.  Roughly 1/256th of TDMR in size.  256G TDMR = ~1G
             PAMT.
      
      As one step of initializing the TDX module, the kernel configures
      TDX-usable memory regions by passing a list of TDMRs to the TDX module.
      
      Constructing the list of TDMRs consists below steps:
      
      1) Fill out TDMRs to cover all memory regions that the TDX module will
         use for TD memory.
      2) Allocate and set up PAMT for each TDMR.
      3) Designate reserved areas for each TDMR.
      
      Add a placeholder to construct TDMRs to do the above steps.  To keep
      things simple, just allocate enough space to hold maximum number of
      TDMRs up front.
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-9-dave.hansen%40intel.com
      5173d3c5
    • Kai Huang's avatar
      x86/virt/tdx: Get module global metadata for module initialization · cf72bc48
      Kai Huang authored
      The TDX module global metadata provides system-wide information about
      the module.
      
      TL;DR:
      
      Use the TDH.SYS.RD SEAMCALL to tell if the module is good or not.
      
      Long Version:
      
      1) Only initialize TDX module with version 1.5 and later
      
      TDX module 1.0 has some compatibility issues with the later versions of
      module, as documented in the "Intel TDX module ABI incompatibilities
      between TDX1.0 and TDX1.5" spec.  Don't bother with module versions that
      do not have a stable ABI.
      
      2) Get the essential global metadata for module initialization
      
      TDX reports a list of "Convertible Memory Region" (CMR) to tell the
      kernel which memory is TDX compatible.  The kernel needs to build a list
      of memory regions (out of CMRs) as "TDX-usable" memory and pass them to
      the TDX module.  The kernel does this by constructing a list of "TD
      Memory Regions" (TDMRs) to cover all these memory regions and passing
      them to the TDX module.
      
      Each TDMR is a TDX architectural data structure containing the memory
      region that the TDMR covers, plus the information to track (within this
      TDMR):
        a) the "Physical Address Metadata Table" (PAMT) to track each TDX
           memory page's status (such as which TDX guest "owns" a given page,
           and
        b) the "reserved areas" to tell memory holes that cannot be used as
           TDX memory.
      
      The kernel needs to get below metadata from the TDX module to build the
      list of TDMRs:
        a) the maximum number of supported TDMRs
        b) the maximum number of supported reserved areas per TDMR and,
        c) the PAMT entry size for each TDX-supported page size.
      
      == Implementation ==
      
      The TDX module has two modes of fetching the metadata: a one field at
      a time, or all in one blob.  Use the field at a time for now.  It is
      slower, but there just are not enough fields now to justify the
      complexity of extra unpacking.
      
      The err_free_tdxmem=>out_put_tdxmem goto looks wonky by itself.  But
      it is the first of a bunch of error handling that will get stuck at
      its site.
      
      [ dhansen: clean up changelog and add a struct to map between
      	   the TDX module fields and 'struct tdx_tdmr_sysinfo' ]
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-8-dave.hansen%40intel.com
      cf72bc48
    • Kai Huang's avatar
      x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory · abe8dbab
      Kai Huang authored
      Start to transit out the "multi-steps" to initialize the TDX module.
      
      TDX provides increased levels of memory confidentiality and integrity.
      This requires special hardware support for features like memory
      encryption and storage of memory integrity checksums.  Not all memory
      satisfies these requirements.
      
      As a result, TDX introduced the concept of a "Convertible Memory Region"
      (CMR).  During boot, the firmware builds a list of all of the memory
      ranges which can provide the TDX security guarantees.  The list of these
      ranges is available to the kernel by querying the TDX module.
      
      CMRs tell the kernel which memory is TDX compatible.  The kernel needs
      to build a list of memory regions (out of CMRs) as "TDX-usable" memory
      and pass them to the TDX module.  Once this is done, those "TDX-usable"
      memory regions are fixed during module's lifetime.
      
      To keep things simple, assume that all TDX-protected memory will come
      from the page allocator.  Make sure all pages in the page allocator
      *are* TDX-usable memory.
      
      As TDX-usable memory is a fixed configuration, take a snapshot of the
      memory configuration from memblocks at the time of module initialization
      (memblocks are modified on memory hotplug).  This snapshot is used to
      enable TDX support for *this* memory configuration only.  Use a memory
      hotplug notifier to ensure that no other RAM can be added outside of
      this configuration.
      
      This approach requires all memblock memory regions at the time of module
      initialization to be TDX convertible memory to work, otherwise module
      initialization will fail in a later SEAMCALL when passing those regions
      to the module.  This approach works when all boot-time "system RAM" is
      TDX convertible memory and no non-TDX-convertible memory is hot-added
      to the core-mm before module initialization.
      
      For instance, on the first generation of TDX machines, both CXL memory
      and NVDIMM are not TDX convertible memory.  Using kmem driver to hot-add
      any CXL memory or NVDIMM to the core-mm before module initialization
      will result in failure to initialize the module.  The SEAMCALL error
      code will be available in the dmesg to help user to understand the
      failure.
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-7-dave.hansen%40intel.com
      abe8dbab
    • Kai Huang's avatar
      x86/virt/tdx: Add skeleton to enable TDX on demand · 6162b310
      Kai Huang authored
      There are essentially two steps to get the TDX module ready:
      1) Get each CPU ready to run TDX
      2) Set up the shared TDX module data structures
      
      Introduce and export (to KVM) the infrastructure to do both of these
      pieces at runtime.
      
      == Per-CPU TDX Initialization ==
      
      Track the initialization status of each CPU with a per-cpu variable.
      This avoids failures in the case of KVM module reloads and handles cases
      where CPUs come online later.
      
      Generally, the per-cpu SEAMCALLs happen first.  But there's actually one
      global call that has to happen before _any_ others (TDH_SYS_INIT).  It's
      analogous to the boot CPU having to do a bit of extra work just because
      it happens to be the first one.  Track if _any_ CPU has done this call
      and then only actually do it during the first per-cpu init.
      
      == Shared TDX Initialization ==
      
      Create the global state function (tdx_enable()) as a simple placeholder.
      The TODO list will be pared down as functionality is added.
      
      Use a state machine protected by mutex to make sure the work in
      tdx_enable() will only be done once.  This avoids failures if the KVM
      module is reloaded.
      
      A CPU must be made ready to run TDX before it can participate in
      initializing the shared parts of the module.  Any caller of tdx_enable()
      need to ensure that it can never run on a CPU which is not ready to
      run TDX.  It needs to be wary of CPU hotplug, preemption and the
      VMX enabling state of any CPU on which it might run.
      
      == Why runtime instead of boot time? ==
      
      The TDX module can be initialized only once in its lifetime.  Instead
      of always initializing it at boot time, this implementation chooses an
      "on demand" approach to initialize TDX until there is a real need (e.g
      when requested by KVM).  This approach has below pros:
      
      1) It avoids consuming the memory that must be allocated by kernel and
      given to the TDX module as metadata (~1/256th of the TDX-usable memory),
      and also saves the CPU cycles of initializing the TDX module (and the
      metadata) when TDX is not used at all.
      
      2) The TDX module design allows it to be updated while the system is
      running.  The update procedure shares quite a few steps with this "on
      demand" initialization mechanism.  The hope is that much of "on demand"
      mechanism can be shared with a future "update" mechanism.  A boot-time
      TDX module implementation would not be able to share much code with the
      update mechanism.
      
      3) Making SEAMCALL requires VMX to be enabled.  Currently, only the KVM
      code mucks with VMX enabling.  If the TDX module were to be initialized
      separately from KVM (like at boot), the boot code would need to be
      taught how to muck with VMX enabling and KVM would need to be taught how
      to cope with that.  Making KVM itself responsible for TDX initialization
      lets the rest of the kernel stay blissfully unaware of VMX.
      
      [ dhansen: completely reorder/rewrite changelog ]
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarNikolay Borisov <nik.borisov@suse.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-6-dave.hansen%40intel.com
      6162b310
    • Kai Huang's avatar
      x86/virt/tdx: Add SEAMCALL error printing for module initialization · df01f5ae
      Kai Huang authored
      The SEAMCALLs involved during the TDX module initialization are not
      expected to fail.  In fact, they are not expected to return any non-zero
      code (except the "running out of entropy error", which can be handled
      internally already).
      
      Add yet another set of SEAMCALL wrappers, which treats all non-zero
      return code as error, to support printing SEAMCALL error upon failure
      for module initialization.  Note the TDX module initialization doesn't
      use the _saved_ret() variant thus no wrapper is added for it.
      
      SEAMCALL assembly can also return kernel-defined error codes for three
      special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't
      loaded; 3) CPU isn't in VMX operation.  Whether they can legally happen
      depends on the caller, so leave to the caller to print error message
      when desired.
      
      Also convert the SEAMCALL error codes to the kernel error codes in the
      new wrappers so that each SEAMCALL caller doesn't have to repeat the
      conversion.
      
      [ dhansen: Align the register dump with show_regs().  Zero-pad the
      	   contents, split on two lines and use consistent spacing. ]
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-5-dave.hansen%40intel.com
      df01f5ae
    • Kai Huang's avatar
      x86/virt/tdx: Handle SEAMCALL no entropy error in common code · 1e66a7e2
      Kai Huang authored
      Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons
      as RDRAND.  Use the kernel RDRAND retry logic for them.
      
      There are three __seamcall*() variants.  Do the SEAMCALL retry in common
      code and add a wrapper for each of them.
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirll.shutemov@linux.intel.com>
      Reviewed-by: default avatarKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-4-dave.hansen%40intel.com
      1e66a7e2
    • Kai Huang's avatar
      x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC · 3115cabd
      Kai Huang authored
      TDX capable platforms are locked to X2APIC mode and cannot fall back to
      the legacy xAPIC mode when TDX is enabled by the BIOS.  TDX host support
      requires x2APIC.  Make INTEL_TDX_HOST depend on X86_X2APIC.
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/
      Link: https://lore.kernel.org/all/20231208170740.53979-3-dave.hansen%40intel.com
      3115cabd
    • Kai Huang's avatar
      x86/virt/tdx: Define TDX supported page sizes as macros · d623704b
      Kai Huang authored
      TDX supports 4K, 2M and 1G page sizes.  The corresponding values are
      defined by the TDX module spec and used as TDX module ABI.  Currently,
      they are used in try_accept_one() when the TDX guest tries to accept a
      page.  However currently try_accept_one() uses hard-coded magic values.
      
      Define TDX supported page sizes as macros and get rid of the hard-coded
      values in try_accept_one().  TDX host support will need to use them too.
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-2-dave.hansen%40intel.com
      d623704b
    • Kai Huang's avatar
      x86/virt/tdx: Detect TDX during kernel boot · 765a0542
      Kai Huang authored
      Intel Trust Domain Extensions (TDX) protects guest VMs from malicious
      host and certain physical attacks.  A CPU-attested software module
      called 'the TDX module' runs inside a new isolated memory range as a
      trusted hypervisor to manage and run protected VMs.
      
      Pre-TDX Intel hardware has support for a memory encryption architecture
      called MKTME.  The memory encryption hardware underpinning MKTME is also
      used for Intel TDX.  TDX ends up "stealing" some of the physical address
      space from the MKTME architecture for crypto-protection to VMs.  The
      BIOS is responsible for partitioning the "KeyID" space between legacy
      MKTME and TDX.  The KeyIDs reserved for TDX are called 'TDX private
      KeyIDs' or 'TDX KeyIDs' for short.
      
      During machine boot, TDX microcode verifies that the BIOS programmed TDX
      private KeyIDs consistently and correctly programmed across all CPU
      packages.  The MSRs are locked in this state after verification.  This
      is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration:
      it indicates not just that the hardware supports TDX, but that all the
      boot-time security checks passed.
      
      The TDX module is expected to be loaded by the BIOS when it enables TDX,
      but the kernel needs to properly initialize it before it can be used to
      create and run any TDX guests.  The TDX module will be initialized by
      the KVM subsystem when KVM wants to use TDX.
      
      Detect platform TDX support by detecting TDX private KeyIDs.
      
      The TDX module itself requires one TDX KeyID as the 'TDX global KeyID'
      to protect its metadata.  Each TDX guest also needs a TDX KeyID for its
      own protection.  Just use the first TDX KeyID as the global KeyID and
      leave the rest for TDX guests.  If no TDX KeyID is left for TDX guests,
      disable TDX as initializing the TDX module alone is useless.
      
      [ dhansen: add X86_FEATURE, replace helper function ]
      Signed-off-by: default avatarKai Huang <kai.huang@intel.com>
      Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: default avatarIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: default avatarKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
      Link: https://lore.kernel.org/all/20231208170740.53979-1-dave.hansen%40intel.com
      765a0542
  2. 03 Dec, 2023 3 commits
  3. 02 Dec, 2023 5 commits
    • Linus Torvalds's avatar
      Merge tag 'powerpc-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 1b8af655
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
      
       - Fix corruption of f0/vs0 during FP/Vector save, seen as userspace
         crashes when using io-uring workers (in particular with MariaDB)
      
       - Fix KVM_RUN potentially clobbering all host userspace FP/Vector
         registers
      
      Thanks to Timothy Pearson, Jens Axboe, and Nicholas Piggin.
      
      * tag 'powerpc-6.7-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        KVM: PPC: Book3S HV: Fix KVM_RUN clobbering FP/VEC user registers
        powerpc: Don't clobber f0/vs0 during fp|altivec register save
      1b8af655
    • Linus Torvalds's avatar
      Merge tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio · 17b17be2
      Linus Torvalds authored
      Pull vfio fixes from Alex Williamson:
      
       - Fix the lifecycle of a mutex in the pds variant driver such that a
         reset prior to opening the device won't find it uninitialized.
         Implement the release path to symmetrically destroy the mutex. Also
         switch a different lock from spinlock to mutex as the code path has
         the potential to sleep and doesn't need the spinlock context
         otherwise (Brett Creeley)
      
       - Fix an issue detected via randconfig where KVM tries to symbol_get an
         undeclared function. The symbol is temporarily declared
         unconditionally here, which resolves the problem and avoids churn
         relative to a series pending for the next merge window which resolves
         some of this symbol ugliness, but also fixes Kconfig dependencies
         (Sean Christopherson)
      
      * tag 'vfio-v6.7-rc4' of https://github.com/awilliam/linux-vfio:
        vfio: Drop vfio_file_iommu_group() stub to fudge around a KVM wart
        vfio/pds: Fix possible sleep while in atomic context
        vfio/pds: Fix mutex lock->magic != lock warning
      17b17be2
    • Linus Torvalds's avatar
      Merge tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip · deb4b9dd
      Linus Torvalds authored
      Pull xen fixes from Juergen Gross:
      
       - A fix for the Xen event driver setting the correct return value when
         experiencing an allocation failure
      
       - A fix for allocating space for a struct in the percpu area to not
         cross page boundaries (this one is for x86, a similar one for Arm was
         already in the pull request for rc3)
      
      * tag 'for-linus-6.7a-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
        xen/events: fix error code in xen_bind_pirq_msi_to_irq()
        x86/xen: fix percpu vcpu_info allocation
      deb4b9dd
    • Linus Torvalds's avatar
      Merge tag 'probes-fixes-v6.7-rc3' of... · 669fc834
      Linus Torvalds authored
      Merge tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
      
      Pull probes fixes from Masami Hiramatsu:
      
       - objpool: Fix objpool overrun case on memory/cache access delay
         especially on the big.LITTLE SoC. The objpool uses a copy of object
         slot index internal loop, but the slot index can be changed on
         another processor in parallel. In that case, the difference of 'head'
         local copy and the 'slot->last' index will be bigger than local slot
         size. In that case, we need to re-read the slot::head to update it.
      
       - kretprobe: Fix to use appropriate rcu API for kretprobe holder. Since
         kretprobe_holder::rp is RCU managed, it should use
         rcu_assign_pointer() and rcu_dereference_check() correctly. Also
         adding __rcu tag for finding wrong usage by sparse.
      
       - rethook: Fix to use appropriate rcu API for rethook::handler. The
         same as kretprobe, rethook::handler is RCU managed and it should use
         rcu_assign_pointer() and rcu_dereference_check(). This also adds
         __rcu tag for finding wrong usage by sparse.
      
      * tag 'probes-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
        rethook: Use __rcu pointer for rethook::handler
        kprobes: consistent rcu api usage for kretprobe holder
        lib: objpool: fix head overrun on RK3588 SBC
      669fc834
    • Linus Torvalds's avatar
      Merge tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · 815fb87b
      Linus Torvalds authored
      Pull power management fixes from Rafael Wysocki:
       "These fix issues in two cpufreq drivers, in the AMD P-state driver and
        in the power-capping DTPM framework.
      
        Specifics:
      
         - Fix the AMD P-state driver's EPP sysfs interface in the cases when
           the performance governor is in use (Ayush Jain)
      
         - Make the ->fast_switch() callback in the AMD P-state driver return
           the target frequency as expected (Gautham R. Shenoy)
      
         - Allow user space to control the range of frequencies to use via
           scaling_min_freq and scaling_max_freq when AMD P-state driver is in
           use (Wyes Karny)
      
         - Prevent power domains needed for wakeup signaling from being turned
           off during system suspend on Qualcomm systems and prevent
           performance states votes from runtime-suspended devices from being
           lost across a system suspend-resume cycle in qcom-cpufreq-nvmem
           (Stephan Gerhold)
      
         - Fix disabling the 792 Mhz OPP in the imx6q cpufreq driver for the
           i.MX6ULL types that can run at that frequency (Christoph
           Niedermaier)
      
         - Eliminate unnecessary and harmful conversions to uW from the DTPM
           (dynamic thermal and power management) framework (Lukasz Luba)"
      
      * tag 'pm-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        cpufreq/amd-pstate: Only print supported EPP values for performance governor
        cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update
        powercap: DTPM: Fix unneeded conversions to micro-Watts
        cpufreq/amd-pstate: Fix the return value of amd_pstate_fast_switch()
        pmdomain: qcom: rpmpd: Set GENPD_FLAG_ACTIVE_WAKEUP
        cpufreq: qcom-nvmem: Preserve PM domain votes in system suspend
        cpufreq: qcom-nvmem: Enable virtual power domain devices
        cpufreq: imx6q: Don't disable 792 Mhz OPP unnecessarily
      815fb87b
  4. 01 Dec, 2023 20 commits
    • Linus Torvalds's avatar
      Merge tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm · ce474ae7
      Linus Torvalds authored
      Pull ACPI fixes from Rafael Wysocki:
       "This fixes a recently introduced build issue on ARM32 and a NULL
        pointer dereference in the ACPI backlight driver due to a design issue
        exposed by a recent change in the ACPI bus type code.
      
        Specifics:
      
         - Fix a recently introduced build issue on ARM32 platforms caused by
           an inadvertent header file breakage (Dave Jiang)
      
         - Eliminate questionable usage of acpi_driver_data() in the ACPI
           backlight cooling device code that leads to NULL pointer
           dereferences after recent ACPI core changes (Hans de Goede)"
      
      * tag 'acpi-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
        ACPI: video: Use acpi_video_device for cooling-dev driver data
        ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
      ce474ae7
    • Linus Torvalds's avatar
      Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux · 35f84584
      Linus Torvalds authored
      Pull arm64 fix from Catalin Marinas:
       "Fix a regression where the arm64 KPTI ends up enabled even on systems
        that don't need it"
      
      * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
        arm64: Avoid enabling KPTI unnecessarily
      35f84584
    • Linus Torvalds's avatar
      Merge tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu · 1a2b4185
      Linus Torvalds authored
      Pull iommu fixes from Joerg Roedel:
      
       - Fix race conditions in device probe path
      
       - Handle ERR_PTR() returns in __iommu_domain_alloc() path
      
       - Update MAINTAINERS entry for Qualcom IOMMUs
      
       - Printk argument fix in device tree specific code
      
       - Several Intel VT-d fixes from Lu Baolu:
           - Do not support enforcing cache coherency for non-empty domains
           - Avoid devTLB invalidation if iommu is off
           - Disable PCI ATS in legacy passthrough mode
           - Support non-PCI devices when clearing context
           - Fix incorrect cache invalidation for mm notification
           - Add MTL to quirk list to skip TE disabling
           - Set variable intel_dirty_ops to static
      
      * tag 'iommu-fixes-v6.7-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu:
        iommu: Fix printk arg in of_iommu_get_resv_regions()
        iommu/vt-d: Set variable intel_dirty_ops to static
        iommu/vt-d: Fix incorrect cache invalidation for mm notification
        iommu/vt-d: Add MTL to quirk list to skip TE disabling
        iommu/vt-d: Make context clearing consistent with context mapping
        iommu/vt-d: Disable PCI ATS in legacy passthrough mode
        iommu/vt-d: Omit devTLB invalidation requests when TES=0
        iommu/vt-d: Support enforce_cache_coherency only for empty domains
        iommu: Avoid more races around device probe
        MAINTAINERS: list all Qualcomm IOMMU drivers in the QUALCOMM IOMMU entry
        iommu: Flow ERR_PTR out from __iommu_domain_alloc()
      1a2b4185
    • Linus Torvalds's avatar
      Merge tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound · 06a3c59f
      Linus Torvalds authored
      Pull sound fixes from Takashi Iwai:
       "No surprise here, including only a collection of HD-audio
        device-specific small fixes"
      
      * tag 'sound-6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
        ALSA: hda: Disable power-save on KONTRON SinglePC
        ALSA: hda/realtek: Add supported ALC257 for ChromeOS
        ALSA: hda/realtek: Headset Mic VREF to 100%
        ALSA: hda: intel-nhlt: Ignore vbps when looking for DMIC 32 bps format
        ALSA: hda: cs35l56: Enable low-power hibernation mode on SPI
        ALSA: cs35l41: Fix for old systems which do not support command
        ALSA: hda: cs35l41: Remove unnecessary boolean state variable firmware_running
        ALSA: hda - Fix speaker and headset mic pin config for CHUWI CoreBook XPro
      06a3c59f
    • Linus Torvalds's avatar
      Merge tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm · b1e51588
      Linus Torvalds authored
      Pull drm fixes from Dave Airlie:
       "Weekly fixes, mostly amdgpu fixes with a scattering of nouveau, i915,
        and a couple of reverts. Hopefully it will quieten down in coming
        weeks.
      
        drm:
         - Revert unexport of prime helpers for fd/handle conversion
      
        dma_resv:
         - Do not double add fences in dma_resv_add_fence.
      
        gpuvm:
         - Fix GPUVM license identifier.
      
        i915:
         - Mark internal GSC engine with reserved uabi class
         - Take VGA converters into account in eDP probe
         - Fix intel_pre_plane_updates() call to ensure workarounds get applied
      
        panel:
         - Revert panel fixes as they require exporting device_is_dependent.
      
        nouveau:
         - fix oversized allocations in new vm path
         - fix zero-length array
         - remove a stray lock
      
        nt36523:
         - Fix error check for nt36523.
      
        amdgpu:
         - DMUB fix
         - DCN 3.5 fixes
         - XGMI fix
         - DCN 3.2 fixes
         - Vangogh suspend fix
         - NBIO 7.9 fix
         - GFX11 golden register fix
         - Backlight fix
         - NBIO 7.11 fix
         - IB test overflow fix
         - DCN 3.1.4 fixes
         - fix a runtime pm ref count
         - Retimer fix
         - ABM fix
         - DCN 3.1.5 fix
         - Fix AGP addressing
         - Fix possible memory leak in SMU error path
         - Make sure PME is enabled in D3
         - Fix possible NULL pointer dereference in debugfs
         - EEPROM fix
         - GC 9.4.3 fix
      
        amdkfd:
         - IP version check fix
         - Fix memory leak in pqm_uninit()"
      
      * tag 'drm-fixes-2023-12-01' of git://anongit.freedesktop.org/drm/drm: (53 commits)
        Revert "drm/prime: Unexport helpers for fd/handle conversion"
        drm/amdgpu: Use another offset for GC 9.4.3 remap
        drm/amd/display: Fix some HostVM parameters in DML
        drm/amdkfd: Free gang_ctx_bo and wptr_bo in pqm_uninit
        drm/amdgpu: Update EEPROM I2C address for smu v13_0_0
        drm/amd/display: Allow DTBCLK disable for DCN35
        drm/amdgpu: Fix cat debugfs amdgpu_regs_didt causes kernel null pointer
        drm/amd: Enable PCIe PME from D3
        drm/amd/pm: fix a memleak in aldebaran_tables_init
        drm/amdgpu: fix AGP addressing when GART is not at 0
        drm/amd/display: update dcn315 lpddr pstate latency
        drm/amd/display: fix ABM disablement
        drm/amd/display: Fix black screen on video playback with embedded panel
        drm/amd/display: Fix conversions between bytes and KB
        drm/amdkfd: Use common function for IP version check
        drm/amd/display: Remove config update
        drm/amd/display: Update DCN35 clock table policy
        drm/amd/display: force toggle rate wa for first link training for a retimer
        drm/amdgpu: correct the amdgpu runtime dereference usage count
        drm/amd/display: Update min Z8 residency time to 2100 for DCN314
        ...
      b1e51588
    • Linus Torvalds's avatar
      Merge tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux · c9a925b7
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
      
       - Fix an issue with discontig page checking for IORING_SETUP_NO_MMAP
      
       - Fix an issue with not allowing IORING_SETUP_NO_MMAP also disallowing
         mmap'ed buffer rings
      
       - Fix an issue with deferred release of memory mapped pages
      
       - Fix a lockdep issue with IORING_SETUP_NO_MMAP
      
       - Use fget/fput consistently, even from our sync system calls. No real
         issue here, but if we were ever to allow closing io_uring descriptors
         it would be required. Let's play it safe and just use the full ref
         counted versions upfront. Most uses of io_uring are threaded anyway,
         and hence already doing the full version underneath.
      
      * tag 'io_uring-6.7-2023-11-30' of git://git.kernel.dk/linux:
        io_uring: use fget/fput consistently
        io_uring: free io_buffer_list entries via RCU
        io_uring/kbuf: prune deferred locked cache when tearing down
        io_uring/kbuf: recycle freed mapped buffer ring entries
        io_uring/kbuf: defer release of mapped buffer rings
        io_uring: enable io_mem_alloc/free to be used in other parts
        io_uring: don't guard IORING_OFF_PBUF_RING with SETUP_NO_MMAP
        io_uring: don't allow discontig pages for IORING_SETUP_NO_MMAP
      c9a925b7
    • Linus Torvalds's avatar
      Merge tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux · ee0c8a9b
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
      
       - NVMe pull request via Keith:
           - Invalid namespace identification error handling (Marizio Ewan,
             Keith)
           - Fabrics keep-alive tuning (Mark)
      
       - Fix for a bad error check regression in bcache (Markus)
      
       - Fix for a performance regression with O_DIRECT (Ming)
      
       - Fix for a flush related deadlock (Ming)
      
       - Make the read-only warn on per-partition (Yu)
      
      * tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux:
        nvme-core: check for too small lba shift
        blk-mq: don't count completed flush data request as inflight in case of quiesce
        block: Document the role of the two attribute groups
        block: warn once for each partition in bio_check_ro()
        block: move .bd_inode into 1st cacheline of block_device
        nvme: check for valid nvme_identify_ns() before using it
        nvme-core: fix a memory leak in nvme_ns_info_from_identify()
        nvme: fine-tune sending of first keep-alive
        bcache: revert replacing IS_ERR_OR_NULL with IS_ERR
      ee0c8a9b
    • Linus Torvalds's avatar
      Merge tag 'dm-6.7/dm-fixes-2' of... · abd792f3
      Linus Torvalds authored
      Merge tag 'dm-6.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
      
      Pull device mapper fixes from Mike Snitzer:
      
       - Fix DM verity target's FEC support to always initialize IO before it
         frees it. Also fix alignment of struct dm_verity_fec_io within the
         per-bio-data
      
       - Fix DM verity target to not FEC failed readahead IO
      
       - Update DM flakey target to use MAX_ORDER rather than MAX_ORDER - 1
      
      * tag 'dm-6.7/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
        dm-flakey: start allocating with MAX_ORDER
        dm-verity: align struct dm_verity_fec_io properly
        dm verity: don't perform FEC for failed readahead IO
        dm verity: initialize fec io before freeing it
      abd792f3
    • Linus Torvalds's avatar
      Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi · ff4a9f49
      Linus Torvalds authored
      Pull SCSI fixes from James Bottomley:
       "Three small fixes, one in drivers.
      
        The core changes are to the internal representation of flags in
        scsi_devices which removes space wasting bools in favour of single bit
        flags and to add a flag to force a runtime resume which is used by ATA
        devices"
      
      * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
        scsi: sd: Fix system start for ATA devices
        scsi: Change SCSI device boolean fields to single bit flags
        scsi: ufs: core: Clear cmd if abort succeeds in MCQ mode
      ff4a9f49
    • Linus Torvalds's avatar
      Merge tag 'fs_for_v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · c1c09da0
      Linus Torvalds authored
      Pull ext2 fix from Jan Kara:
       "Fix an ext2 bug introduced by changes in ext2 & iomap stepping on each
        other toes (apparently ext2 driver does not get much testing in
        linux-next)"
      
      * tag 'fs_for_v6.7-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        ext2: Fix ki_pos update for DIO buffered-io fallback case
      c1c09da0
    • Linus Torvalds's avatar
      Merge tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs · e6861be4
      Linus Torvalds authored
      Pull more bcachefs bugfixes from Kent Overstreet:
      
       - bcache & bcachefs were broken with CFI enabled; patch for closures to
         fix type punning
      
       - mark erasure coding as extra-experimental; there are incompatible
         disk space accounting changes coming for erasure coding, and I'm
         still seeing checksum errors in some tests
      
       - several fixes for durability-related issues (durability is a device
         specific setting where we can tell bcachefs that data on a given
         device should be counted as replicated x times)
      
       - a fix for a rare livelock when a btree node merge then updates a
         parent node that is almost full
      
       - fix a race in the device removal path, where dropping a pointer in a
         btree node to a device would be clobbered by an in flight btree write
         updating the btree node key on completion
      
       - fix one SRCU lock hold time warning in the btree gc code - ther's
         still a bunch more of these to fix
      
       - fix a rare race where we'd start copygc before initializing the "are
         we rw" percpu refcount; copygc would think we were already ro and die
         immediately
      
      * tag 'bcachefs-2023-11-29' of https://evilpiepirate.org/git/bcachefs: (23 commits)
        bcachefs: Extra kthread_should_stop() calls for copygc
        bcachefs: Convert gc_alloc_start() to for_each_btree_key2()
        bcachefs: Fix race between btree writes and metadata drop
        bcachefs: move journal seq assertion
        bcachefs: -EROFS doesn't count as move_extent_start_fail
        bcachefs: trace_move_extent_start_fail() now includes errcode
        bcachefs: Fix split_race livelock
        bcachefs: Fix bucket data type for stripe buckets
        bcachefs: Add missing validation for jset_entry_data_usage
        bcachefs: Fix zstd compress workspace size
        bcachefs: bpos is misaligned on big endian
        bcachefs: Fix ec + durability calculation
        bcachefs: Data update path won't accidentaly grow replicas
        bcachefs: deallocate_extra_replicas()
        bcachefs: Proper refcounting for journal_keys
        bcachefs: preserve device path as device name
        bcachefs: Fix an endianness conversion
        bcachefs: Start gc, copygc, rebalance threads after initing writes ref
        bcachefs: Don't stop copygc thread on device resize
        bcachefs: Make sure bch2_move_ratelimit() also waits for move_ops
        ...
      e6861be4
    • Rafael J. Wysocki's avatar
      Merge branch 'acpi-tables' · 7d4c44a5
      Rafael J. Wysocki authored
      Merge a fix for a recently introduced build issue on ARM32 platforms
      caused by an inadvertent header file breakage (Dave Jiang).
      
      * acpi-tables:
        ACPI: Fix ARM32 platforms compile issue introduced by fw_table changes
      7d4c44a5
    • Rafael J. Wysocki's avatar
      Merge branch 'powercap' · a6b31256
      Rafael J. Wysocki authored
      Merge a power capping fix for 6.7-rc4 which eliminates unnecessary
      and harmful conversions to uW from the DTPM (dynamic thermal and power
      management) framework (Lukasz Luba).
      
      * powercap:
        powercap: DTPM: Fix unneeded conversions to micro-Watts
      a6b31256
    • Jens Axboe's avatar
      Merge tag 'nvme-6.7-2023-12-01' of git://git.infradead.org/nvme into block-6.7 · 8ad3ac92
      Jens Axboe authored
      Pull NVMe fixes from Keith:
      
      "nvme fixes for Linux 6.7
      
       - Invalid namespace identification error handling (Marizio Ewan, Keith)
       - Fabrics keep-alive tuning (Mark)"
      
      * tag 'nvme-6.7-2023-12-01' of git://git.infradead.org/nvme:
        nvme-core: check for too small lba shift
        nvme: check for valid nvme_identify_ns() before using it
        nvme-core: fix a memory leak in nvme_ns_info_from_identify()
        nvme: fine-tune sending of first keep-alive
      8ad3ac92
    • Keith Busch's avatar
      nvme-core: check for too small lba shift · 74fbc88e
      Keith Busch authored
      The block layer doesn't support logical block sizes smaller than 512
      bytes. The nvme spec doesn't support that small either, but the driver
      isn't checking to make sure the device responded with usable data.
      Failing to catch this will result in a kernel bug, either from a
      division by zero when stacking, or a zero length bio.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      74fbc88e
    • Ming Lei's avatar
      blk-mq: don't count completed flush data request as inflight in case of quiesce · 0e4237ae
      Ming Lei authored
      Request queue quiesce may interrupt flush sequence, and the original request
      may have been marked as COMPLETE, but can't get finished because of
      queue quiesce.
      
      This way is fine from driver viewpoint, because flush sequence is block
      layer concept, and it isn't related with driver.
      
      However, driver(such as dm-rq) can call blk_mq_queue_inflight() to count &
      drain inflight requests, then the wait & drain never gets done because
      the completed & not-finished flush request is counted as inflight.
      
      Fix this issue by not counting completed flush data request as inflight in
      case of quiesce.
      
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: David Jeffery <djeffery@redhat.com>
      Cc: John Pittman <jpittman@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20231201085605.577730-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0e4237ae
    • Daniel Mentz's avatar
      iommu: Fix printk arg in of_iommu_get_resv_regions() · c2183b3d
      Daniel Mentz authored
      The variable phys is defined as (struct resource *) which aligns with
      the printk format specifier %pr. Taking the address of it results in a
      value of type (struct resource **) which is incompatible with the format
      specifier %pr. Therefore, remove the address of operator (&).
      
      Fixes: a5bf3cfc ("iommu: Implement of_iommu_get_resv_regions()")
      Signed-off-by: default avatarDaniel Mentz <danielmentz@google.com>
      Acked-by: default avatarThierry Reding <treding@nvidia.com>
      Link: https://lore.kernel.org/r/20231108062226.928985-1-danielmentz@google.comSigned-off-by: default avatarJoerg Roedel <jroedel@suse.de>
      c2183b3d
    • Masami Hiramatsu (Google)'s avatar
      rethook: Use __rcu pointer for rethook::handler · a1461f1f
      Masami Hiramatsu (Google) authored
      Since the rethook::handler is an RCU-maganged pointer so that it will
      notice readers the rethook is stopped (unregistered) or not, it should
      be an __rcu pointer and use appropriate functions to be accessed. This
      will use appropriate memory barrier when accessing it. OTOH,
      rethook::data is never changed, so we don't need to check it in
      get_kretprobe().
      
      NOTE: To avoid sparse warning, rethook::handler is defined by a raw
      function pointer type with __rcu instead of rethook_handler_t.
      
      Link: https://lore.kernel.org/all/170126066201.398836.837498688669005979.stgit@devnote2/
      
      Fixes: 54ecbe6f ("rethook: Add a generic return hook")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202311241808.rv9ceuAh-lkp@intel.com/Tested-by: default avatarJP Kobryn <inwardvessel@gmail.com>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      a1461f1f
    • JP Kobryn's avatar
      kprobes: consistent rcu api usage for kretprobe holder · d839a656
      JP Kobryn authored
      It seems that the pointer-to-kretprobe "rp" within the kretprobe_holder is
      RCU-managed, based on the (non-rethook) implementation of get_kretprobe().
      The thought behind this patch is to make use of the RCU API where possible
      when accessing this pointer so that the needed barriers are always in place
      and to self-document the code.
      
      The __rcu annotation to "rp" allows for sparse RCU checking. Plain writes
      done to the "rp" pointer are changed to make use of the RCU macro for
      assignment. For the single read, the implementation of get_kretprobe()
      is simplified by making use of an RCU macro which accomplishes the same,
      but note that the log warning text will be more generic.
      
      I did find that there is a difference in assembly generated between the
      usage of the RCU macros vs without. For example, on arm64, when using
      rcu_assign_pointer(), the corresponding store instruction is a
      store-release (STLR) which has an implicit barrier. When normal assignment
      is done, a regular store (STR) is found. In the macro case, this seems to
      be a result of rcu_assign_pointer() using smp_store_release() when the
      value to write is not NULL.
      
      Link: https://lore.kernel.org/all/20231122132058.3359-1-inwardvessel@gmail.com/
      
      Fixes: d741bf41 ("kprobes: Remove kretprobe hash")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJP Kobryn <inwardvessel@gmail.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      d839a656
    • wuqiang.matt's avatar
      lib: objpool: fix head overrun on RK3588 SBC · d67f39d2
      wuqiang.matt authored
      objpool overrun stress with test_objpool on OrangePi5+ SBC triggered the
      following kernel warnings:
      
          WARNING: CPU: 6 PID: 3115 at lib/objpool.c:168 objpool_push+0xc0/0x100
      
      This message is from objpool.c:168:
      
          WARN_ON_ONCE(tail - head > pool->nr_objs);
      
      The overrun test case is to validate the case that pre-allocated objects
      are insufficient: 8 objects are pre-allocated for each node and consumer
      thread per node tries to grab 16 objects in a row. The testing system is
      OrangePI 5+, with RK3588, a big.LITTLE SOC with 4x A76 and 4x A55. When
      disabling either all 4 big or 4 little cores, the overrun tests run well,
      and once with big and little cores mixed together, the overrun test would
      always cause an overrun loop. It's likely the memory timing differences
      of big and little cores cause this trouble. Here are the debugging data
      of objpool_try_get_slot after try_cmpxchg_release:
      
          objpool_pop: cpu: 4/0 0:0 head: 278/279 tail:278 last:276/278
      
      The local copies of 'head' and 'last' were 278 and 276, and reloading of
      'slot->head' and 'slot->last' got 279 and 278. After try_cmpxchg_release
      'slot->head' became 'head + 1', which is correct. But what's wrong here
      is the stale value of 'last', and that stale value of 'last' finally led
      the overrun of 'head'.
      
      Memory updating of 'last' and 'head' are performed in push() and pop()
      independently, which could be the culprit leading this out of order
      visibility of 'last' and 'head'. So for objpool_try_get_slot(), it's
      not enough only checking the condition of 'head != slot', the implicit
      condition 'last - head <= nr_objs' must also be explicitly asserted to
      guarantee 'last' is always behind 'head' before the object retrieving.
      
      This patch will check and try reloading of 'head' and 'last' to ensure
      'last' is behind 'head' at the time of object retrieving. Performance
      testings show the average impact is about 0.1% for X86_64 and 1.12% for
      ARM64. Here are the results:
      
          OS: Debian 10 X86_64, Linux 6.6rc
          HW: XEON 8336C x 2, 64 cores/128 threads, DDR4 3200MT/s
                            1T         2T         4T         8T        16T
          native:     49543304   99277826  199017659  399070324  795185848
          objpool:    29909085   59865637  119692073  239750369  478005250
          objpool+:   29879313   59230743  119609856  239067773  478509029
                           32T        48T        64T        96T       128T
          native:   1596927073 2390099988 2929397330 3183875848 3257546602
          objpool:   957553042 1435814086 1680872925 2043126796 2165424198
          objpool+:  956476281 1434491297 1666055740 2041556569 2157415622
      
          OS: Debian 11 AARCH64, Linux 6.6rc
          HW: Kunpeng-920 96 cores/2 sockets/4 NUMA nodes, DDR4 2933 MT/s
                            1T         2T         4T         8T        16T
          native:     30890508   60399915  123111980  242257008  494002946
          objpool:    14742531   28883047   57739948  115886644  232455421
          objpool+:   14107220   29032998   57286084  113730493  232232850
                           24T        32T        48T        64T        96T
          native:    746406039 1000174750 1493236240 1998318364 2942911180
          objpool:   349164852  467284332  702296756  934459713 1387898285
          objpool+:  348388180  462750976  696606096  927865887 1368402195
      
      Link: https://lore.kernel.org/all/20231114115148.298821-1-wuqiang.matt@bytedance.com/
      
      Fixes: b4edb8d2 ("lib: objpool added: ring-array based lockless MPMC")
      Signed-off-by: default avatarwuqiang.matt <wuqiang.matt@bytedance.com>
      Acked-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      Signed-off-by: default avatarMasami Hiramatsu (Google) <mhiramat@kernel.org>
      d67f39d2