• Filipe Manana's avatar
    btrfs: skip unnecessary extent buffer sharedness checks during fiemap · b8f164e3
    Filipe Manana authored
    During fiemap, for each file extent we find, we must check if it's shared
    or not. The sharedness check starts by verifying if the extent is directly
    shared (its refcount in the extent tree is > 1), and if it is not directly
    shared, then we will check if every node in the subvolume b+tree leading
    from the root to the leaf that has the file extent item (in reverse order),
    is shared (through snapshots).
    
    However this second step is not needed if our extent was created in a
    transaction more recent than the last transaction where a snapshot of the
    inode's root happened, because it can't be shared indirectly (through
    shared subtrees) without a snapshot created in a more recent transaction.
    
    So grab the generation of the extent from the extent map and pass it to
    btrfs_is_data_extent_shared(), which will skip this second phase when the
    generation is more recent than the root's last snapshot value. Note that
    we skip this optimization if the extent map is the result of merging 2
    or more extent maps, because in this case its generation is the maximum
    of the generations of all merged extent maps.
    
    The fact the we use extent maps and they can be merged despite the
    underlying extents being distinct (different file extent items in the
    subvolume b+tree and different extent items in the extent b+tree), can
    result in some bugs when reporting shared extents. But this is a problem
    of the current implementation of fiemap relying on extent maps.
    One example where we get incorrect results is:
    
        $ cat fiemap-bug.sh
        #!/bin/bash
    
        DEV=/dev/sdj
        MNT=/mnt/sdj
    
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
    
        # Create a file with two 256K extents.
        # Since there is no other write activity, they will be contiguous,
        # and their extent maps merged, despite having two distinct extents.
        xfs_io -f -c "pwrite -S 0xab 0 256K" \
                  -c "fsync" \
                  -c "pwrite -S 0xcd 256K 256K" \
                  -c "fsync" \
                  $MNT/foo
    
        # Now clone only the second extent into another file.
        xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
    
        # Filefrag will report a single 512K extent, and say it's not shared.
        echo
        filefrag -v $MNT/foo
    
        umount $MNT
    
    Running the reproducer:
    
        $ ./fiemap-bug.sh
        wrote 262144/262144 bytes at offset 0
        256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
        wrote 262144/262144 bytes at offset 262144
        256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
        linked 262144/262144 bytes at offset 0
        256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
    
        Filesystem type is: 9123683e
        File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
         ext:     logical_offset:        physical_offset: length:   expected: flags:
           0:        0..     127:       3328..      3455:    128:             last,eof
        /mnt/sdj/foo: 1 extent found
    
    We end up reporting that we have a single 512K that is not shared, however
    we have two 256K extents, and the second one is shared. Changing the
    reproducer to clone instead the first extent into file 'bar', makes us
    report a single 512K extent that is shared, which is algo incorrect since
    we have two 256K extents and only the first one is shared.
    
    This is z problem that existed before this change, and remains after this
    change, as it can't be easily fixed. The next patch in the series reworks
    fiemap to primarily use file extent items instead of extent maps (except
    for checking for delalloc ranges), with the goal of improving its
    scalability and performance, but it also ends up fixing this particular
    bug caused by extent map merging.
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    b8f164e3
backref.h 10.8 KB