• Adrian Huang's avatar
    nvme-pci: clamp max_hw_sectors based on DMA optimized limitation · 3710e2b0
    Adrian Huang authored
    When running the fio test on a 448-core AMD server + a NVME disk,
    a soft lockup or a hard lockup call trace is shown:
    
    [soft lockup]
    watchdog: BUG: soft lockup - CPU#126 stuck for 23s! [swapper/126:0]
    RIP: 0010:_raw_spin_unlock_irqrestore+0x21/0x50
    ...
    Call Trace:
     <IRQ>
     fq_flush_timeout+0x7d/0xd0
     ? __pfx_fq_flush_timeout+0x10/0x10
     call_timer_fn+0x2e/0x150
     run_timer_softirq+0x48a/0x560
     ? __pfx_fq_flush_timeout+0x10/0x10
     ? clockevents_program_event+0xaf/0x130
     __do_softirq+0xf1/0x335
     irq_exit_rcu+0x9f/0xd0
     sysvec_apic_timer_interrupt+0xb4/0xd0
     </IRQ>
     <TASK>
     asm_sysvec_apic_timer_interrupt+0x1f/0x30
    ...
    
    Obvisouly, fq_flush_timeout spends over 20 seconds. Here is ftrace log:
    
                   |  fq_flush_timeout() {
                   |    fq_ring_free() {
                   |      put_pages_list() {
       0.170 us    |        free_unref_page_list();
       0.810 us    |      }
                   |      free_iova_fast() {
                   |        free_iova() {
     * 85622.66 us |          _raw_spin_lock_irqsave();
       2.860 us    |          remove_iova();
       0.600 us    |          _raw_spin_unlock_irqrestore();
       0.470 us    |          lock_info_report();
       2.420 us    |          free_iova_mem.part.0();
     * 85638.27 us |        }
     * 85638.84 us |      }
                   |      put_pages_list() {
       0.230 us    |        free_unref_page_list();
       0.470 us    |      }
       ...            ...
     $ 31017069 us |  }
    
    Most of cores are under lock contention for acquiring iova_rbtree_lock due
    to the iova flush queue mechanism.
    
    [hard lockup]
    NMI watchdog: Watchdog detected hard LOCKUP on cpu 351
    RIP: 0010:native_queued_spin_lock_slowpath+0x2d8/0x330
    
    Call Trace:
     <IRQ>
     _raw_spin_lock_irqsave+0x4f/0x60
     free_iova+0x27/0xd0
     free_iova_fast+0x4d/0x1d0
     fq_ring_free+0x9b/0x150
     iommu_dma_free_iova+0xb4/0x2e0
     __iommu_dma_unmap+0x10b/0x140
     iommu_dma_unmap_sg+0x90/0x110
     dma_unmap_sg_attrs+0x4a/0x50
     nvme_unmap_data+0x5d/0x120 [nvme]
     nvme_pci_complete_batch+0x77/0xc0 [nvme]
     nvme_irq+0x2ee/0x350 [nvme]
     ? __pfx_nvme_pci_complete_batch+0x10/0x10 [nvme]
     __handle_irq_event_percpu+0x53/0x1a0
     handle_irq_event_percpu+0x19/0x60
     handle_irq_event+0x3d/0x60
     handle_edge_irq+0xb3/0x210
     __common_interrupt+0x7f/0x150
     common_interrupt+0xc5/0xf0
     </IRQ>
     <TASK>
     asm_common_interrupt+0x2b/0x40
    ...
    
    ftrace shows fq_ring_free spends over 10 seconds [1]. Again, most of
    cores are under lock contention for acquiring iova_rbtree_lock due
    to the iova flush queue mechanism.
    
    [Root Cause]
    The root cause is that the max_hw_sectors_kb of nvme disk (mdts=10)
    is 4096kb, which streaming DMA mappings cannot benefit from the
    scalable IOVA mechanism introduced by the commit 9257b4a2
    ("iommu/iova: introduce per-cpu caching to iova allocation") if
    the length is greater than 128kb.
    
    To fix the lock contention issue, clamp max_hw_sectors based on
    DMA optimized limitation in order to leverage scalable IOVA mechanism.
    
    Note: The issue does not happen with another NVME disk (mdts = 5
    and max_hw_sectors_kb = 128)
    
    [1] https://gist.github.com/AdrianHuang/bf8ec7338204837631fbdaed25d19cc4Suggested-by: default avatarKeith Busch <kbusch@kernel.org>
    Reported-and-tested-by: default avatarJiwei Sun <sunjw10@lenovo.com>
    Signed-off-by: default avatarAdrian Huang <ahuang12@lenovo.com>
    Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    3710e2b0
pci.c 93.2 KB