• Filipe Manana's avatar
    btrfs: make fiemap more efficient and accurate reporting extent sharedness · ac3c0d36
    Filipe Manana authored
    The current fiemap implementation does not scale very well with the number
    of extents a file has. This is both because the main algorithm to find out
    the extents has a high algorithmic complexity and because for each extent
    we have to check if it's shared. This second part, checking if an extent
    is shared, is significantly improved by the two previous patches in this
    patchset, while the first part is improved by this specific patch. Every
    now and then we get reports from users mentioning fiemap is too slow or
    even unusable for files with a very large number of extents, such as the
    two recent reports referred to by the Link tags at the bottom of this
    change log.
    
    To understand why the part of finding which extents a file has is very
    inefficient, consider the example of doing a full ranged fiemap against
    a file that has over 100K extents (normal for example for a file with
    more than 10G of data and using compression, which limits the extent size
    to 128K). When we enter fiemap at extent_fiemap(), the following happens:
    
    1) Before entering the main loop, we call get_extent_skip_holes() to get
       the first extent map. This leads us to btrfs_get_extent_fiemap(), which
       in turn calls btrfs_get_extent(), to find the first extent map that
       covers the file range [0, LLONG_MAX).
    
       btrfs_get_extent() will first search the inode's extent map tree, to
       see if we have an extent map there that covers the range. If it does
       not find one, then it will search the inode's subvolume b+tree for a
       fitting file extent item. After finding the file extent item, it will
       allocate an extent map, fill it in with information extracted from the
       file extent item, and add it to the inode's extent map tree (which
       requires a search for insertion in the tree).
    
    2) Then we enter the main loop at extent_fiemap(), emit the details of
       the extent, and call again get_extent_skip_holes(), with a start
       offset matching the end of the extent map we previously processed.
    
       We end up at btrfs_get_extent() again, will search the extent map tree
       and then search the subvolume b+tree for a file extent item if we could
       not find an extent map in the extent tree. We allocate an extent map,
       fill it in with the details in the file extent item, and then insert
       it into the extent map tree (yet another search in this tree).
    
    3) The second step is repeated over and over, until we have processed the
       whole file range. Each iteration ends at btrfs_get_extent(), which
       does a red black tree search on the extent map tree, then searches the
       subvolume b+tree, allocates an extent map and then does another search
       in the extent map tree in order to insert the extent map.
    
       In the best scenario we have all the extent maps already in the extent
       tree, and so for each extent we do a single search on a red black tree,
       so we have a complexity of O(n log n).
    
       In the worst scenario we don't have any extent map already loaded in
       the extent map tree, or have very few already there. In this case the
       complexity is much higher since we do:
    
       - A red black tree search on the extent map tree, which has O(log n)
         complexity, initially very fast since the tree is empty or very
         small, but as we end up allocating extent maps and adding them to
         the tree when we don't find them there, each subsequent search on
         the tree gets slower, since it's getting bigger and bigger after
         each iteration.
    
       - A search on the subvolume b+tree, also O(log n) complexity, but it
         has items for all inodes in the subvolume, not just items for our
         inode. Plus on a filesystem with concurrent operations on other
         inodes, we can block doing the search due to lock contention on
         b+tree nodes/leaves.
    
       - Allocate an extent map - this can block, and can also fail if we
         are under serious memory pressure.
    
       - Do another search on the extent maps red black tree, with the goal
         of inserting the extent map we just allocated. Again, after every
         iteration this tree is getting bigger by 1 element, so after many
         iterations the searches are slower and slower.
    
       - We will not need the allocated extent map anymore, so it's pointless
         to add it to the extent map tree. It's just wasting time and memory.
    
       In short we end up searching the extent map tree multiple times, on a
       tree that is growing bigger and bigger after each iteration. And
       besides that we visit the same leaf of the subvolume b+tree many times,
       since a leaf with the default size of 16K can easily have more than 200
       file extent items.
    
    This is very inefficient overall. This patch changes the algorithm to
    instead iterate over the subvolume b+tree, visiting each leaf only once,
    and only searching in the extent map tree for file ranges that have holes
    or prealloc extents, in order to figure out if we have delalloc there.
    It will never allocate an extent map and add it to the extent map tree.
    This is very similar to what was previously done for the lseek's hole and
    data seeking features.
    
    Also, the current implementation relying on extent maps for figuring out
    which extents we have is not correct. This is because extent maps can be
    merged even if they represent different extents - we do this to minimize
    memory utilization and keep extent map trees smaller. For example if we
    have two extents that are contiguous on disk, once we load the two extent
    maps, they get merged into a single one - however if only one of the
    extents is shared, we end up reporting both as shared or both as not
    shared, which is incorrect.
    
    This reproducer triggers that bug:
    
        $ cat fiemap-bug.sh
        #!/bin/bash
    
        DEV=/dev/sdj
        MNT=/mnt/sdj
    
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
    
        # Create a file with two 256K extents.
        # Since there is no other write activity, they will be contiguous,
        # and their extent maps merged, despite having two distinct extents.
        xfs_io -f -c "pwrite -S 0xab 0 256K" \
                  -c "fsync" \
                  -c "pwrite -S 0xcd 256K 256K" \
                  -c "fsync" \
                  $MNT/foo
    
        # Now clone only the second extent into another file.
        xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
    
        # Filefrag will report a single 512K extent, and say it's not shared.
        echo
        filefrag -v $MNT/foo
    
        umount $MNT
    
    Running the reproducer:
    
        $ ./fiemap-bug.sh
        wrote 262144/262144 bytes at offset 0
        256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
        wrote 262144/262144 bytes at offset 262144
        256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
        linked 262144/262144 bytes at offset 0
        256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
    
        Filesystem type is: 9123683e
        File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
         ext:     logical_offset:        physical_offset: length:   expected: flags:
           0:        0..     127:       3328..      3455:    128:             last,eof
        /mnt/sdj/foo: 1 extent found
    
    We end up reporting that we have a single 512K that is not shared, however
    we have two 256K extents, and the second one is shared. Changing the
    reproducer to clone instead the first extent into file 'bar', makes us
    report a single 512K extent that is shared, which is algo incorrect since
    we have two 256K extents and only the first one is shared.
    
    This patch is part of a larger patchset that is comprised of the following
    patches:
    
        btrfs: allow hole and data seeking to be interruptible
        btrfs: make hole and data seeking a lot more efficient
        btrfs: remove check for impossible block start for an extent map at fiemap
        btrfs: remove zero length check when entering fiemap
        btrfs: properly flush delalloc when entering fiemap
        btrfs: allow fiemap to be interruptible
        btrfs: rename btrfs_check_shared() to a more descriptive name
        btrfs: speedup checking for extent sharedness during fiemap
        btrfs: skip unnecessary extent buffer sharedness checks during fiemap
        btrfs: make fiemap more efficient and accurate reporting extent sharedness
    
    The patchset was tested on a machine running a non-debug kernel (Debian's
    default config) and compared the tests below on a branch without the
    patchset versus the same branch with the whole patchset applied.
    
    The following test for a large compressed file without holes:
    
        $ cat fiemap-perf-test.sh
        #!/bin/bash
    
        DEV=/dev/sdi
        MNT=/mnt/sdi
    
        mkfs.btrfs -f $DEV
        mount -o compress=lzo $DEV $MNT
    
        # 40G gives 327680 128K file extents (due to compression).
        xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
    
        umount $MNT
        mount -o compress=lzo $DEV $MNT
    
        start=$(date +%s%N)
        filefrag $MNT/foobar
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
        echo "fiemap took $dur milliseconds (metadata not cached)"
    
        start=$(date +%s%N)
        filefrag $MNT/foobar
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
        echo "fiemap took $dur milliseconds (metadata cached)"
    
        umount $MNT
    
    Before patchset:
    
        $ ./fiemap-perf-test.sh
        (...)
        /mnt/sdi/foobar: 327680 extents found
        fiemap took 3597 milliseconds (metadata not cached)
        /mnt/sdi/foobar: 327680 extents found
        fiemap took 2107 milliseconds (metadata cached)
    
    After patchset:
    
        $ ./fiemap-perf-test.sh
        (...)
        /mnt/sdi/foobar: 327680 extents found
        fiemap took 1214 milliseconds (metadata not cached)
        /mnt/sdi/foobar: 327680 extents found
        fiemap took 684 milliseconds (metadata cached)
    
    That's a speedup of about 3x for both cases (no metadata cached and all
    metadata cached).
    
    The test provided by Pavel (first Link tag at the bottom), which uses
    files with a large number of holes, was also used to measure the gains,
    and it consists on a small C program and a shell script to invoke it.
    The C program is the following:
    
        $ cat pavels-test.c
        #include <stdio.h>
        #include <unistd.h>
        #include <stdlib.h>
        #include <fcntl.h>
    
        #include <sys/stat.h>
        #include <sys/time.h>
        #include <sys/ioctl.h>
    
        #include <linux/fs.h>
        #include <linux/fiemap.h>
    
        #define FILE_INTERVAL (1<<13) /* 8Kb */
    
        long long interval(struct timeval t1, struct timeval t2)
        {
            long long val = 0;
            val += (t2.tv_usec - t1.tv_usec);
            val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
            return val;
        }
    
        int main(int argc, char **argv)
        {
            struct fiemap fiemap = {};
            struct timeval t1, t2;
            char data = 'a';
            struct stat st;
            int fd, off, file_size = FILE_INTERVAL;
    
            if (argc != 3 && argc != 2) {
                    printf("usage: %s <path> [size]\n", argv[0]);
                    return 1;
            }
    
            if (argc == 3)
                    file_size = atoi(argv[2]);
            if (file_size < FILE_INTERVAL)
                    file_size = FILE_INTERVAL;
            file_size -= file_size % FILE_INTERVAL;
    
            fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
            if (fd < 0) {
                perror("open");
                return 1;
            }
    
            for (off = 0; off < file_size; off += FILE_INTERVAL) {
                if (pwrite(fd, &data, 1, off) != 1) {
                    perror("pwrite");
                    close(fd);
                    return 1;
                }
            }
    
            if (ftruncate(fd, file_size)) {
                perror("ftruncate");
                close(fd);
                return 1;
            }
    
            if (fstat(fd, &st) < 0) {
                perror("fstat");
                close(fd);
                return 1;
            }
    
            printf("size: %ld\n", st.st_size);
            printf("actual size: %ld\n", st.st_blocks * 512);
    
            fiemap.fm_length = FIEMAP_MAX_OFFSET;
            gettimeofday(&t1, NULL);
            if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
                perror("fiemap");
                close(fd);
                return 1;
            }
            gettimeofday(&t2, NULL);
    
            printf("fiemap: fm_mapped_extents = %d\n",
                   fiemap.fm_mapped_extents);
            printf("time = %lld us\n", interval(t1, t2));
    
            close(fd);
            return 0;
        }
    
        $ gcc -o pavels_test pavels_test.c
    
    And the wrapper shell script:
    
        $ cat fiemap-pavels-test.sh
    
        #!/bin/bash
    
        DEV=/dev/sdi
        MNT=/mnt/sdi
    
        mkfs.btrfs -f -O no-holes $DEV
        mount $DEV $MNT
    
        echo
        echo "*********** 256M ***********"
        echo
    
        ./pavels-test $MNT/testfile $((1 << 28))
        echo
        ./pavels-test $MNT/testfile $((1 << 28))
    
        echo
        echo "*********** 512M ***********"
        echo
    
        ./pavels-test $MNT/testfile $((1 << 29))
        echo
        ./pavels-test $MNT/testfile $((1 << 29))
    
        echo
        echo "*********** 1G ***********"
        echo
    
        ./pavels-test $MNT/testfile $((1 << 30))
        echo
        ./pavels-test $MNT/testfile $((1 << 30))
    
        umount $MNT
    
    Running his reproducer before applying the patchset:
    
        *********** 256M ***********
    
        size: 268435456
        actual size: 134217728
        fiemap: fm_mapped_extents = 32768
        time = 4003133 us
    
        size: 268435456
        actual size: 134217728
        fiemap: fm_mapped_extents = 32768
        time = 4895330 us
    
        *********** 512M ***********
    
        size: 536870912
        actual size: 268435456
        fiemap: fm_mapped_extents = 65536
        time = 30123675 us
    
        size: 536870912
        actual size: 268435456
        fiemap: fm_mapped_extents = 65536
        time = 33450934 us
    
        *********** 1G ***********
    
        size: 1073741824
        actual size: 536870912
        fiemap: fm_mapped_extents = 131072
        time = 224924074 us
    
        size: 1073741824
        actual size: 536870912
        fiemap: fm_mapped_extents = 131072
        time = 217239242 us
    
    Running it after applying the patchset:
    
        *********** 256M ***********
    
        size: 268435456
        actual size: 134217728
        fiemap: fm_mapped_extents = 32768
        time = 29475 us
    
        size: 268435456
        actual size: 134217728
        fiemap: fm_mapped_extents = 32768
        time = 29307 us
    
        *********** 512M ***********
    
        size: 536870912
        actual size: 268435456
        fiemap: fm_mapped_extents = 65536
        time = 58996 us
    
        size: 536870912
        actual size: 268435456
        fiemap: fm_mapped_extents = 65536
        time = 59115 us
    
        *********** 1G ***********
    
        size: 1073741824
        actual size: 536870912
        fiemap: fm_mapped_extents = 116251
        time = 124141 us
    
        size: 1073741824
        actual size: 536870912
        fiemap: fm_mapped_extents = 131072
        time = 119387 us
    
    The speedup is massive, both on the first fiemap call and on the second
    one as well, as his test creates files with many holes and small extents
    (every extent follows a hole and precedes another hole).
    
    For the 256M file we go from 4 seconds down to 29 milliseconds in the
    first run, and then from 4.9 seconds down to 29 milliseconds again in the
    second run, a speedup of 138x and 169x, respectively.
    
    For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
    first run, and then from 33.5 seconds down to 59 milliseconds again in the
    second run, a speedup of 510x and 568x, respectively.
    
    For the 1G file, we go from 225 seconds down to 124 milliseconds in the
    first run, and then from 217 seconds down to 119 milliseconds in the
    second run, a speedup of 1815x and 1824x, respectively.
    Reported-by: default avatarPavel Tikhomirov <ptikhomirov@virtuozzo.com>
    Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/Reported-by: default avatarDominique MARTINET <dominique.martinet@atmark-techno.com>
    Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    ac3c0d36
ctree.h 140 KB