• Linus Torvalds's avatar
    Merge tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 9d33edb2
    Linus Torvalds authored
    Pull irq updates from Thomas Gleixner:
     "Updates for the interrupt core and driver subsystem:
    
      The bulk is the rework of the MSI subsystem to support per device MSI
      interrupt domains. This solves conceptual problems of the current
      PCI/MSI design which are in the way of providing support for
      PCI/MSI[-X] and the upcoming PCI/IMS mechanism on the same device.
    
      IMS (Interrupt Message Store] is a new specification which allows
      device manufactures to provide implementation defined storage for MSI
      messages (as opposed to PCI/MSI and PCI/MSI-X that has a specified
      message store which is uniform accross all devices). The PCI/MSI[-X]
      uniformity allowed us to get away with "global" PCI/MSI domains.
    
      IMS not only allows to overcome the size limitations of the MSI-X
      table, but also gives the device manufacturer the freedom to store the
      message in arbitrary places, even in host memory which is shared with
      the device.
    
      There have been several attempts to glue this into the current MSI
      code, but after lengthy discussions it turned out that there is a
      fundamental design problem in the current PCI/MSI-X implementation.
      This needs some historical background.
    
      When PCI/MSI[-X] support was added around 2003, interrupt management
      was completely different from what we have today in the actively
      developed architectures. Interrupt management was completely
      architecture specific and while there were attempts to create common
      infrastructure the commonalities were rudimentary and just providing
      shared data structures and interfaces so that drivers could be written
      in an architecture agnostic way.
    
      The initial PCI/MSI[-X] support obviously plugged into this model
      which resulted in some basic shared infrastructure in the PCI core
      code for setting up MSI descriptors, which are a pure software
      construct for holding data relevant for a particular MSI interrupt,
      but the actual association to Linux interrupts was completely
      architecture specific. This model is still supported today to keep
      museum architectures and notorious stragglers alive.
    
      In 2013 Intel tried to add support for hot-pluggable IO/APICs to the
      kernel, which was creating yet another architecture specific mechanism
      and resulted in an unholy mess on top of the existing horrors of x86
      interrupt handling. The x86 interrupt management code was already an
      incomprehensible maze of indirections between the CPU vector
      management, interrupt remapping and the actual IO/APIC and PCI/MSI[-X]
      implementation.
    
      At roughly the same time ARM struggled with the ever growing SoC
      specific extensions which were glued on top of the architected GIC
      interrupt controller.
    
      This resulted in a fundamental redesign of interrupt management and
      provided the today prevailing concept of hierarchical interrupt
      domains. This allowed to disentangle the interactions between x86
      vector domain and interrupt remapping and also allowed ARM to handle
      the zoo of SoC specific interrupt components in a sane way.
    
      The concept of hierarchical interrupt domains aims to encapsulate the
      functionality of particular IP blocks which are involved in interrupt
      delivery so that they become extensible and pluggable. The X86
      encapsulation looks like this:
    
                                                |--- device 1
         [Vector]---[Remapping]---[PCI/MSI]--|...
                                                |--- device N
    
      where the remapping domain is an optional component and in case that
      it is not available the PCI/MSI[-X] domains have the vector domain as
      their parent. This reduced the required interaction between the
      domains pretty much to the initialization phase where it is obviously
      required to establish the proper parent relation ship in the
      components of the hierarchy.
    
      While in most cases the model is strictly representing the chain of IP
      blocks and abstracting them so they can be plugged together to form a
      hierarchy, the design stopped short on PCI/MSI[-X]. Looking at the
      hardware it's clear that the actual PCI/MSI[-X] interrupt controller
      is not a global entity, but strict a per PCI device entity.
    
      Here we took a short cut on the hierarchical model and went for the
      easy solution of providing "global" PCI/MSI domains which was possible
      because the PCI/MSI[-X] handling is uniform across the devices. This
      also allowed to keep the existing PCI/MSI[-X] infrastructure mostly
      unchanged which in turn made it simple to keep the existing
      architecture specific management alive.
    
      A similar problem was created in the ARM world with support for IP
      block specific message storage. Instead of going all the way to stack
      a IP block specific domain on top of the generic MSI domain this ended
      in a construct which provides a "global" platform MSI domain which
      allows overriding the irq_write_msi_msg() callback per allocation.
    
      In course of the lengthy discussions we identified other abuse of the
      MSI infrastructure in wireless drivers, NTB etc. where support for
      implementation specific message storage was just mindlessly glued into
      the existing infrastructure. Some of this just works by chance on
      particular platforms but will fail in hard to diagnose ways when the
      driver is used on platforms where the underlying MSI interrupt
      management code does not expect the creative abuse.
    
      Another shortcoming of today's PCI/MSI-X support is the inability to
      allocate or free individual vectors after the initial enablement of
      MSI-X. This results in an works by chance implementation of VFIO (PCI
      pass-through) where interrupts on the host side are not set up upfront
      to avoid resource exhaustion. They are expanded at run-time when the
      guest actually tries to use them. The way how this is implemented is
      that the host disables MSI-X and then re-enables it with a larger
      number of vectors again. That works by chance because most device
      drivers set up all interrupts before the device actually will utilize
      them. But that's not universally true because some drivers allocate a
      large enough number of vectors but do not utilize them until it's
      actually required, e.g. for acceleration support. But at that point
      other interrupts of the device might be in active use and the MSI-X
      disable/enable dance can just result in losing interrupts and
      therefore hard to diagnose subtle problems.
    
      Last but not least the "global" PCI/MSI-X domain approach prevents to
      utilize PCI/MSI[-X] and PCI/IMS on the same device due to the fact
      that IMS is not longer providing a uniform storage and configuration
      model.
    
      The solution to this is to implement the missing step and switch from
      global PCI/MSI domains to per device PCI/MSI domains. The resulting
      hierarchy then looks like this:
    
                                  |--- [PCI/MSI] device 1
         [Vector]---[Remapping]---|...
                                  |--- [PCI/MSI] device N
    
      which in turn allows to provide support for multiple domains per
      device:
    
                                  |--- [PCI/MSI] device 1
                                  |--- [PCI/IMS] device 1
         [Vector]---[Remapping]---|...
                                  |--- [PCI/MSI] device N
                                  |--- [PCI/IMS] device N
    
      This work converts the MSI and PCI/MSI core and the x86 interrupt
      domains to the new model, provides new interfaces for post-enable
      allocation/free of MSI-X interrupts and the base framework for
      PCI/IMS. PCI/IMS has been verified with the work in progress IDXD
      driver.
    
      There is work in progress to convert ARM over which will replace the
      platform MSI train-wreck. The cleanup of VFIO, NTB and other creative
      "solutions" are in the works as well.
    
      Drivers:
    
       - Updates for the LoongArch interrupt chip drivers
    
       - Support for MTK CIRQv2
    
       - The usual small fixes and updates all over the place"
    
    * tag 'irq-core-2022-12-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (134 commits)
      irqchip/ti-sci-inta: Fix kernel doc
      irqchip/gic-v2m: Mark a few functions __init
      irqchip/gic-v2m: Include arm-gic-common.h
      irqchip/irq-mvebu-icu: Fix works by chance pointer assignment
      iommu/amd: Enable PCI/IMS
      iommu/vt-d: Enable PCI/IMS
      x86/apic/msi: Enable PCI/IMS
      PCI/MSI: Provide pci_ims_alloc/free_irq()
      PCI/MSI: Provide IMS (Interrupt Message Store) support
      genirq/msi: Provide constants for PCI/IMS support
      x86/apic/msi: Enable MSI_FLAG_PCI_MSIX_ALLOC_DYN
      PCI/MSI: Provide post-enable dynamic allocation interfaces for MSI-X
      PCI/MSI: Provide prepare_desc() MSI domain op
      PCI/MSI: Split MSI-X descriptor setup
      genirq/msi: Provide MSI_FLAG_MSIX_ALLOC_DYN
      genirq/msi: Provide msi_domain_alloc_irq_at()
      genirq/msi: Provide msi_domain_ops:: Prepare_desc()
      genirq/msi: Provide msi_desc:: Msi_data
      genirq/msi: Provide struct msi_map
      x86/apic/msi: Remove arch_create_remap_msi_irq_domain()
      ...
    9d33edb2
Kconfig 6.2 KB