• Suganath Prabu's avatar
    scsi: mpt3sas: Irq poll to avoid CPU hard lockups · 320e77ac
    Suganath Prabu authored
    Issue Description:
    We have seen cpu lock up issue from fields if system has greater (more than
    96) logical cpu count.  SAS3.0 controller (Invader series) supports at max
    96 msix vector and SAS3.5 product (Ventura) supports at max 128 msix
    vectors.
    
    This may be a generic issue (if PCI device supports completion on multiple
    reply queues).  Let me explain it w.r.t to mpt3sas supported h/w just to
    simplify the problem and possible changes to handle such issues. IT HBA
    (mpt3sas) supports multiple reply queues in completion path. Driver creates
    MSI-x vectors for controller as "min of (FW supported Reply queue, Logical
    CPUs)". If submitter is not interrupted via completion on same CPU, there
    is a loop in the IO path. This behavior can cause hard/soft CPU lockups, IO
    timeout, system sluggish etc.
    
    Example - one CPU (e.g. CPU A) is busy submitting the IOs and another CPU
    (e.g. CPU B) is busy with processing the corresponding IO's reply
    descriptors from reply descriptor queue upon receiving the interrupts from
    HBA.  If the CPU A is continuously pumping the IOs then always CPU B (which
    is executing the ISR) will see the valid reply descriptors in the reply
    descriptor queue and it will be continuously processing those reply
    descriptor in a loop without quitting the ISR handler.
    
    Mpt3sas driver will exit ISR handler if it finds unused reply descriptor in
    the reply descriptor queue. Since CPU A will be continuously sending the
    IOs, CPU B may always see a valid reply descriptor (posted by HBA Firmware
    after processing the IO) in the reply descriptor queue. In worst case,
    driver will not quit from this loop in the ISR handler. Eventually, CPU
    lockup will be detected by watchdog.
    
    Above mentioned behavior is not common if "rq_affinity" set to 2 or
    affinity_hint is honored by irqbalance as "exact". If rq_affinity is set
    to 2, submitter will be always interrupted via completion on same CPU.  If
    irqbalance is using "exact" policy, interrupt will be delivered to
    submitter CPU.
    
    If CPU counts to MSI-X vectors (reply descriptor Queues) count ratio is not
    1:1, we still have exposure of issue explained above and for that we don't
    have any solution.
    
    Exposure of soft/hard lockup if CPU count is more than MSI-x supported by
    device.
    
    If CPUs count to MSI-x vectors count ratio is not 1:1, (Other way, if CPU
    counts to MSI-x vector count ratio is something like X:1, where X > 1) then
    'exact' irqbalance policy OR rq_affinity = 2 won't help to avoid CPU
    hard/soft lockups. There won't be any one to one mapping between CPU to
    MSI-x vector instead one MSI-x interrupt (or reply descriptor queue) is
    shared with group/set of CPUs and there is a possibility of having a loop
    in the IO path within that CPU group and may observe lockups.
    
    For example: Consider a system having two NUMA nodes and each node having
    four logical CPUs and also consider that number of MSI-x vectors enabled on
    the HBA is two, then CPUs count to MSI-x vector count ratio as 4:1.  e.g.
    MSIx vector 0 is affinity to CPU 0, CPU 1, CPU 2 & CPU 3 of NUMA node 0 and
    MSI-x vector 1 is affinity to CPU 4, CPU 5, CPU 6 & CPU 7 of NUMA node 1.
    
    numactl --hardware
    available: 2 nodes (0-1)
    node 0 cpus: 0 1 2 3                 --> MSI-x 0
    node 0 size: 65536 MB
    node 0 free: 63176 MB
    node 1 cpus: 4 5 6 7                 -->MSI-x 1
    node 1 size: 65536 MB
    node 1 free: 63176 MB
    
    Assume that user started an application which uses all the CPUs of NUMA
    node 0 for issuing the IOs.  Only one CPU from affinity list (it can be any
    cpu since this behavior depends upon irqbalance) CPU0 will receive the
    interrupts from MSIx vector 0 for all the IOs. Eventually, CPU 0 IO
    submission percentage will be decreasing and ISR processing percentage will
    be increasing as it is more busy with processing the interrupts.  Gradually
    IO submission percentage on CPU 0 will be zero and it's ISR processing
    percentage will be 100 percentage as IO loop has already formed within the
    NUMA node 0, i.e. CPU 1, CPU 2 & CPU 3 will be continuously busy with
    submitting the heavy IOs and only CPU 0 is busy in the ISR path as it
    always find the valid reply descriptor in the reply descriptor
    queue. Eventually, we will observe the hard lockup here.
    
    Chances of occurring of hard/soft lockups are directly proportional to
    value of X. If value of X is high, then chances of observing CPU lockups is
    high.
    
    Solution: Use IRQ poll interface defined in " irq_poll.c".  mpt3sas driver
    will execute ISR routine in Softirq context and it will always quit the
    loop based on budget provided in IRQ poll interface.
    
    In these scenarios (i.e. where CPUs count to MSI-X vectors count ratio is
    X:1 (where X > 1)), IRQ poll interface will avoid CPU hard lockups due to
    voluntary exit from the reply queue processing based on budget.  Note -
    Only one MSI-x vector is busy doing processing.
    
    Irqstat output:
    
    IRQs / 1 second(s)
    IRQ#  TOTAL  NODE0   NODE1   NODE2   NODE3  NAME
      44    122871   122871   0       0       0  IR-PCI-MSI-edge mpt3sas0-msix0
      45        0              0           0       0       0  IR-PCI-MSI-edge mpt3sas0-msix1
    
    We use this approach only if cpu count is more than FW supported MSI-x
    vector
    Signed-off-by: default avatarSuganath Prabu <suganath-prabu.subramani@broadcom.com>
    Signed-off-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
    320e77ac
Kconfig 3.47 KB