• Darrick J. Wong's avatar
    xfs: reserve less log space when recovering log intent items · 3c919b09
    Darrick J. Wong authored
    Wengang Wang reports that a customer's system was running a number of
    truncate operations on a filesystem with a very small log.  Contention
    on the reserve heads lead to other threads stalling on smaller updates
    (e.g.  mtime updates) long enough to result in the node being rebooted
    on account of the lack of responsivenes.  The node failed to recover
    because log recovery of an EFI became stuck waiting for a grant of
    reserve space.  From Wengang's report:
    
    "For the file deletion, log bytes are reserved basing on
    xfs_mount->tr_itruncate which is:
    
        tr_logres = 175488,
        tr_logcount = 2,
        tr_logflags = XFS_TRANS_PERM_LOG_RES,
    
    "You see it's a permanent log reservation with two log operations (two
    transactions in rolling mode).  After calculation (xlog_calc_unit_res()
    adds space for various log headers), the final log space needed per
    transaction changes from  175488 to 180208 bytes.  So the total log
    space needed is 360416 bytes (180208 * 2).  [That quantity] of log space
    (360416 bytes) needs to be reserved for both run time inode removing
    (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover())."
    
    In other words, runtime pre-reserves 360K of space in anticipation of
    running a chain of two transactions in which each transaction gets a
    180K reservation.
    
    Now that we've allocated the transaction, we delete the bmap mapping,
    log an EFI to free the space, and roll the transaction as part of
    finishing the deferops chain.  Rolling creates a new xfs_trans which
    shares its ticket with the old transaction.  Next, xfs_trans_roll calls
    __xfs_trans_commit with regrant == true, which calls xlog_cil_commit
    with the same regrant parameter.
    
    xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and
    subtracts t_curr_res from the reservation and write heads.
    
    If the filesystem is fresh and the first transaction only used (say)
    20K, then t_curr_res will be 160K, and we give that much reservation
    back to the reservation head.  Or if the file is really fragmented and
    the first transaction actually uses 170K, then t_curr_res will be 10K,
    and that's what we give back to the reservation.
    
    Having done that, we're now headed into the second transaction with an
    EFI and 180K of reservation.  Other threads apparently consumed all the
    reservation for smaller transactions, such as timestamp updates.
    
    Now let's say the first transaction gets written to disk and we crash
    without ever completing the second transaction.  Now we remount the fs,
    log recovery finds the unfinished EFI, and calls xfs_efi_recover to
    finish the EFI.  However, xfs_efi_recover starts a new tr_itruncate
    tranasction, which asks for 360K log reservation.  This is a lot more
    than the 180K that we had reserved at the time of the crash.  If the
    first EFI to be recovered is also pinning the tail of the log, we will
    be unable to free any space in the log, and recovery livelocks.
    
    Wengang confirmed this:
    
    "Now we have the second transaction which has 180208 log bytes reserved
    too. The second transaction is supposed to process intents including
    extent freeing.  With my hacking patch, I blocked the extent freeing 5
    hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved.
    
    "With my test case, other transactions (update timestamps) then happen.
    As my hacking patch pins the journal tail, those timestamp-updating
    transactions finally use up (almost) all the left available log space
    (in memory in on disk).  And finally the on disk (and in memory)
    available log space goes down near to 180208 bytes.  Those 180208 bytes
    are reserved by [the] second (extent-free) transaction [in the chain]."
    
    Wengang and I noticed that EFI recovery starts a transaction, completes
    one step of the chain, and commits the transaction without completing
    any other steps of the chain.  Those subsequent steps are completed by
    xlog_finish_defer_ops, which allocates yet another transaction to
    finish the rest of the chain.  That transaction gets the same tr_logres
    as the head transaction, but with tr_logcount = 1 to force regranting
    with every roll to avoid livelocks.
    
    In other words, we already figured this out in commit 929b92f6
    ("xfs: xfs_defer_capture should absorb remaining transaction
    reservation"), but should have applied that logic to each intent item's
    recovery function.  For Wengang's case, the xfs_trans_alloc call in the
    EFI recovery function should only be asking for a single transaction's
    worth of log reservation -- 180K, not 360K.
    
    Quoting Wengang again:
    
    "With log recovery, during EFI recovery, we use tr_itruncate again to
    reserve two transactions that needs 360416 log bytes.  Reserving 360416
    bytes fails [stalls] because we now only have about 180208 available.
    
    "Actually during the EFI recover, we only need one transaction to free
    the extents just like the 2nd transaction at RUNTIME.  So it only needs
    to reserve 180208 rather than 360416 bytes.  We have (a bit) more than
    180208 available log bytes on disk, so [if we decrease the reservation
    to 180K] the reservation goes and the recovery [finishes].  That is to
    say: we can fix the log recover part to fix the issue. We can introduce
    a new xfs_trans_res xfs_mount->tr_ext_free
    
    {
      tr_logres = 175488,
      tr_logcount = 0,
      tr_logflags = 0,
    }
    
    "and use tr_ext_free instead of tr_itruncate in EFI recover."
    
    However, I don't think it quite makes sense to create an entirely new
    transaction reservation type to handle single-stepping during log
    recovery.  Instead, we should copy the transaction reservation
    information in the xfs_mount, change tr_logcount to 1, and pass that
    into xfs_trans_alloc.  We know this won't risk changing the min log size
    computation since we always ask for a fraction of the reservation for
    all known transaction types.
    
    This looks like it's been lurking in the codebase since commit
    3d3c8b52, which changed the xfs_trans_reserve call in
    xlog_recover_process_efi to use the tr_logcount in tr_itruncate.
    That changed the EFI recovery transaction from making a
    non-XFS_TRANS_PERM_LOG_RES request for one transaction's worth of log
    space to a XFS_TRANS_PERM_LOG_RES request for two transactions worth.
    
    Fixes: 3d3c8b52 ("xfs: refactor xfs_trans_reserve() interface")
    Complements: 929b92f6 ("xfs: xfs_defer_capture should absorb remaining transaction reservation")
    Suggested-by: default avatarWengang Wang <wen.gang.wang@oracle.com>
    Cc: Srikanth C S <srikanth.c.s@oracle.com>
    [djwong: apply the same transformation to all log intent recovery]
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
    3c919b09
xfs_bmap_item.c 19.1 KB