• Filipe Manana's avatar
    Btrfs: fix hole detection during file fsync · 74121f7c
    Filipe Manana authored
    The file hole detection logic during a file fsync wasn't correct,
    because it didn't look back (in a previous leaf) for the last file
    extent item that can be in a leaf to the left of our leaf and that
    has a generation lower than the current transaction id. This made it
    assume that a hole exists when it really doesn't exist in the file.
    
    Such false positive hole detection happens in the following scenario:
    
    * We have a file that has many file extent items, covering 3 or more
      btree leafs (the first leaf must contain non file extent items too).
    
    * Two ranges of the file are modified, with their extent items being
      located at 2 different leafs and those leafs aren't consecutive.
    
    * When processing the second modified leaf, we weren't checking if
      some file extent item exists that is located in some leaf that is
      between our 2 modified leafs, and therefore assumed the range defined
      between the last file extent item in the first leaf and the first file
      extent item in the second leaf matched a hole.
    
    Fortunately this didn't result in overriding the log with wrong data,
    instead it made the last loop in copy_items() attempt to insert a
    duplicated key (for a hole file extent item), which makes the file
    fsync code return with -EEXIST to file.c:btrfs_sync_file() which in
    turn ends up doing a full transaction commit, which is much more
    expensive then writing only to the log tree and wait for it to be
    durably persisted (as well as the file's modified extents/pages).
    Therefore fix the hole detection logic, so that we don't pay the
    cost of doing full transaction commits.
    
    I could trigger this issue with the following test for xfstests (which
    never fails, either without or with this patch). The last fsync call
    results in a full transaction commit, due to the -EEXIST error mentioned
    above. I could also observe this behaviour happening frequently when
    running xfstests/generic/075 in a loop.
    
    Test:
    
        _cleanup()
        {
            _cleanup_flakey
            rm -fr $tmp
        }
    
        # get standard environment, filters and checks
        . ./common/rc
        . ./common/filter
        . ./common/dmflakey
    
        # real QA test starts here
        _supported_fs btrfs
        _supported_os Linux
        _require_scratch
        _require_dm_flakey
        _need_to_be_root
    
        rm -f $seqres.full
    
        # Create a file with many file extent items, each representing a 4Kb extent.
        # These items span 3 btree leaves, of 16Kb each (default mkfs.btrfs leaf size
        # as of btrfs-progs 3.12).
        _scratch_mkfs -l 16384 >/dev/null 2>&1
        _init_flakey
        SAVE_MOUNT_OPTIONS="$MOUNT_OPTIONS"
        MOUNT_OPTIONS="$MOUNT_OPTIONS -o commit=999"
        _mount_flakey
    
        # First fsync, inode has BTRFS_INODE_NEEDS_FULL_SYNC flag set.
        $XFS_IO_PROG -f -c "pwrite -S 0x01 -b 4096 0 4096" -c "fsync" \
                $SCRATCH_MNT/foo | _filter_xfs_io
    
        # For any of the following fsync calls, inode doesn't have the flag
        # BTRFS_INODE_NEEDS_FULL_SYNC set.
        for ((i = 1; i <= 500; i++)); do
            OFFSET=$((4096 * i))
            LEN=4096
            $XFS_IO_PROG -c "pwrite -S 0x01 $OFFSET $LEN" -c "fsync" \
                    $SCRATCH_MNT/foo | _filter_xfs_io
        done
    
        # Commit transaction and bump next transaction's id (to 7).
        sync
    
        # Truncate will set the BTRFS_INODE_NEEDS_FULL_SYNC flag in the btrfs's
        # inode runtime flags.
        $XFS_IO_PROG -c "truncate 2048000" $SCRATCH_MNT/foo
    
        # Commit transaction and bump next transaction's id (to 8).
        sync
    
        # Touch 1 extent item from the first leaf and 1 from the last leaf. The leaf
        # in the middle, containing only file extent items, isn't touched. So the
        # next fsync, when calling btrfs_search_forward(), won't visit that middle
        # leaf. First and 3rd leaf have now a generation with value 8, while the
        # middle leaf remains with a generation with value 6.
        $XFS_IO_PROG \
            -c "pwrite -S 0xee -b 4096 0 4096" \
            -c "pwrite -S 0xff -b 4096 2043904 4096" \
            -c "fsync" \
            $SCRATCH_MNT/foo | _filter_xfs_io
    
        _load_flakey_table $FLAKEY_DROP_WRITES
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        _unmount_flakey
    
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        # During mount, we'll replay the log created by the fsync above, and the file's
        # md5 digest should be the same we got before the unmount.
        _mount_flakey
        md5sum $SCRATCH_MNT/foo | _filter_scratch
        _unmount_flakey
        MOUNT_OPTIONS="$SAVE_MOUNT_OPTIONS"
    
        status=0
        exit
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarChris Mason <clm@fb.com>
    74121f7c
tree-log.c 116 KB