• Dave Chinner's avatar
    xfs, iomap: limit individual ioend chain lengths in writeback · ebb7fb15
    Dave Chinner authored
    Trond Myklebust reported soft lockups in XFS IO completion such as
    this:
    
     watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
     CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
     Workqueue: xfs-conv/md127 xfs_end_io [xfs]
     RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
     Call Trace:
      wake_up_page_bit+0x8a/0x110
      iomap_finish_ioend+0xd7/0x1c0
      iomap_finish_ioends+0x7f/0xb0
      xfs_end_ioend+0x6b/0x100 [xfs]
      xfs_end_io+0xb9/0xe0 [xfs]
      process_one_work+0x1a7/0x360
      worker_thread+0x1fa/0x390
      kthread+0x116/0x130
      ret_from_fork+0x35/0x40
    
    Ioends are processed as an atomic completion unit when all the
    chained bios in the ioend have completed their IO. Logically
    contiguous ioends can also be merged and completed as a single,
    larger unit.  Both of these things can be problematic as both the
    bio chains per ioend and the size of the merged ioends processed as
    a single completion are both unbound.
    
    If we have a large sequential dirty region in the page cache,
    write_cache_pages() will keep feeding us sequential pages and we
    will keep mapping them into ioends and bios until we get a dirty
    page at a non-sequential file offset. These large sequential runs
    can will result in bio and ioend chaining to optimise the io
    patterns. The pages iunder writeback are pinned within these chains
    until the submission chaining is broken, allowing the entire chain
    to be completed. This can result in huge chains being processed
    in IO completion context.
    
    We get deep bio chaining if we have large contiguous physical
    extents. We will keep adding pages to the current bio until it is
    full, then we'll chain a new bio to keep adding pages for writeback.
    Hence we can build bio chains that map millions of pages and tens of
    gigabytes of RAM if the page cache contains big enough contiguous
    dirty file regions. This long bio chain pins those pages until the
    final bio in the chain completes and the ioend can iterate all the
    chained bios and complete them.
    
    OTOH, if we have a physically fragmented file, we end up submitting
    one ioend per physical fragment that each have a small bio or bio
    chain attached to them. We do not chain these at IO submission time,
    but instead we chain them at completion time based on file
    offset via iomap_ioend_try_merge(). Hence we can end up with unbound
    ioend chains being built via completion merging.
    
    XFS can then do COW remapping or unwritten extent conversion on that
    merged chain, which involves walking an extent fragment at a time
    and running a transaction to modify the physical extent information.
    IOWs, we merge all the discontiguous ioends together into a
    contiguous file range, only to then process them individually as
    discontiguous extents.
    
    This extent manipulation is computationally expensive and can run in
    a tight loop, so merging logically contiguous but physically
    discontigous ioends gains us nothing except for hiding the fact the
    fact we broke the ioends up into individual physical extents at
    submission and then need to loop over those individual physical
    extents at completion.
    
    Hence we need to have mechanisms to limit ioend sizes and
    to break up completion processing of large merged ioend chains:
    
    1. bio chains per ioend need to be bound in length. Pure overwrites
    go straight to iomap_finish_ioend() in softirq context with the
    exact bio chain attached to the ioend by submission. Hence the only
    way to prevent long holdoffs here is to bound ioend submission
    sizes because we can't reschedule in softirq context.
    
    2. iomap_finish_ioends() has to handle unbound merged ioend chains
    correctly. This relies on any one call to iomap_finish_ioend() being
    bound in runtime so that cond_resched() can be issued regularly as
    the long ioend chain is processed. i.e. this relies on mechanism #1
    to limit individual ioend sizes to work correctly.
    
    3. filesystems have to loop over the merged ioends to process
    physical extent manipulations. This means they can loop internally,
    and so we break merging at physical extent boundaries so the
    filesystem can easily insert reschedule points between individual
    extent manipulations.
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reported-and-tested-by: default avatarTrond Myklebust <trondmy@hammerspace.com>
    Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    ebb7fb15
buffered-io.c 43.9 KB