• Darrick J. Wong's avatar
    xfs: only run COW extent recovery when there are no live extents · 7993f1a4
    Darrick J. Wong authored
    As part of multiple customer escalations due to file data corruption
    after copy on write operations, I wrote some fstests that use fsstress
    to hammer on COW to shake things loose.  Regrettably, I caught some
    filesystem shutdowns due to incorrect rmap operations with the following
    loop:
    
    mount <filesystem>				# (0)
    fsstress <run only readonly ops> &		# (1)
    while true; do
    	fsstress <run all ops>
    	mount -o remount,ro			# (2)
    	fsstress <run only readonly ops>
    	mount -o remount,rw			# (3)
    done
    
    When (2) happens, notice that (1) is still running.  xfs_remount_ro will
    call xfs_blockgc_stop to walk the inode cache to free all the COW
    extents, but the blockgc mechanism races with (1)'s reader threads to
    take IOLOCKs and loses, which means that it doesn't clean them all out.
    Call such a file (A).
    
    When (3) happens, xfs_remount_rw calls xfs_reflink_recover_cow, which
    walks the ondisk refcount btree and frees any COW extent that it finds.
    This function does not check the inode cache, which means that incore
    COW forks of inode (A) is now inconsistent with the ondisk metadata.  If
    one of those former COW extents are allocated and mapped into another
    file (B) and someone triggers a COW to the stale reservation in (A), A's
    dirty data will be written into (B) and once that's done, those blocks
    will be transferred to (A)'s data fork without bumping the refcount.
    
    The results are catastrophic -- file (B) and the refcount btree are now
    corrupt.  In the first patch, we fixed the race condition in (2) so that
    (A) will always flush the COW fork.  In this second patch, we move the
    _recover_cow call to the initial mount call in (0) for safety.
    
    As mentioned previously, xfs_reflink_recover_cow walks the refcount
    btree looking for COW staging extents, and frees them.  This was
    intended to be run at mount time (when we know there are no live inodes)
    to clean up any leftover staging events that may have been left behind
    during an unclean shutdown.  As a time "optimization" for readonly
    mounts, we deferred this to the ro->rw transition, not realizing that
    any failure to clean all COW forks during a rw->ro transition would
    result in catastrophic corruption.
    
    Therefore, remove this optimization and only run the recovery routine
    when we're guaranteed not to have any COW staging extents anywhere,
    which means we always run this at mount time.  While we're at it, move
    the callsite to xfs_log_mount_finish because any refcount btree
    expansion (however unlikely given that we're removing records from the
    right side of the index) must be fed by a per-AG reservation, which
    doesn't exist in its current location.
    
    Fixes: 174edb0e ("xfs: store in-progress CoW allocations in the refcount btree")
    Signed-off-by: default avatarDarrick J. Wong <djwong@kernel.org>
    Reviewed-by: default avatarChandan Babu R <chandan.babu@oracle.com>
    Reviewed-by: default avatarDave Chinner <dchinner@redhat.com>
    7993f1a4
xfs_reflink.c 44.3 KB