• Dave Chinner's avatar
    xfs: fix AGF vs inode cluster buffer deadlock · 82842fee
    Dave Chinner authored
    Lock order in XFS is AGI -> AGF, hence for operations involving
    inode unlinked list operations we always lock the AGI first. Inode
    unlinked list operations operate on the inode cluster buffer,
    so the lock order there is AGI -> inode cluster buffer.
    
    For O_TMPFILE operations, this now means the lock order set down in
    xfs_rename and xfs_link is AGI -> inode cluster buffer -> AGF as the
    unlinked ops are done before the directory modifications that may
    allocate space and lock the AGF.
    
    Unfortunately, we also now lock the inode cluster buffer when
    logging an inode so that we can attach the inode to the cluster
    buffer and pin it in memory. This creates a lock order of AGF ->
    inode cluster buffer in directory operations as we have to log the
    inode after we've allocated new space for it.
    
    This creates a lock inversion between the AGF and the inode cluster
    buffer. Because the inode cluster buffer is shared across multiple
    inodes, the inversion is not specific to individual inodes but can
    occur when inodes in the same cluster buffer are accessed in
    different orders.
    
    To fix this we need move all the inode log item cluster buffer
    interactions to the end of the current transaction. Unfortunately,
    xfs_trans_log_inode() calls are littered throughout the transactions
    with no thought to ordering against other items or locking. This
    makes it difficult to do anything that involves changing the call
    sites of xfs_trans_log_inode() to change locking orders.
    
    However, we do now have a mechanism that allows is to postpone dirty
    item processing to just before we commit the transaction: the
    ->iop_precommit method. This will be called after all the
    modifications are done and high level objects like AGI and AGF
    buffers have been locked and modified, thereby providing a mechanism
    that guarantees we don't lock the inode cluster buffer before those
    high level objects are locked.
    
    This change is largely moving the guts of xfs_trans_log_inode() to
    xfs_inode_item_precommit() and providing an extra flag context in
    the inode log item to track the dirty state of the inode in the
    current transaction. This also means we do a lot less repeated work
    in xfs_trans_log_inode() by only doing it once per transaction when
    all the work is done.
    
    Fixes: 298f7bec ("xfs: pin inode backing buffer to the inode log item")
    Signed-off-by: default avatarDave Chinner <dchinner@redhat.com>
    Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
    Reviewed-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Signed-off-by: default avatarDave Chinner <david@fromorbit.com>
    82842fee
xfs_inode_item.c 33.2 KB