• Filipe Manana's avatar
    btrfs: fix race when detecting delalloc ranges during fiemap · 978b63f7
    Filipe Manana authored
    For fiemap we recently stopped locking the target extent range for the
    whole duration of the fiemap call, in order to avoid a deadlock in a
    scenario where the fiemap buffer happens to be a memory mapped range of
    the same file. This use case is very unlikely to be useful in practice but
    it may be triggered by fuzz testing (syzbot, etc).
    
    This however introduced a race that makes us miss delalloc ranges for
    file regions that are currently holes, so the caller of fiemap will not
    be aware that there's data for some file regions. This can be quite
    serious for some use cases - for example in coreutils versions before 9.0,
    the cp program used fiemap to detect holes and data in the source file,
    copying only regions with data (extents or delalloc) from the source file
    to the destination file in order to preserve holes (see the documentation
    for its --sparse command line option). This means that if cp was used
    with a source file that had delalloc in a hole, the destination file could
    end up without that data, which is effectively a data loss issue, if it
    happened to hit the race described below.
    
    The race happens like this:
    
    1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
       has delalloc in the file range [64M, 65M[, which is currently a hole;
    
    2) Fiemap locks the inode in shared mode, then starts iterating the
       inode's subvolume tree searching for file extent items, without having
       the whole fiemap target range locked in the inode's io tree - the
       change introduced recently by commit b0ad381f ("btrfs: fix
       deadlock with fiemap and extent locking"). It only locks ranges in
       the io tree when it finds a hole or prealloc extent since that
       commit;
    
    3) Note that fiemap clones each leaf before using it, and this is to
       avoid deadlocks when locking a file range in the inode's io tree and
       the fiemap buffer is memory mapped to some file, because writing
       to the page with btrfs_page_mkwrite() will wait on any ordered extent
       for the page's range and the ordered extent needs to lock the range
       and may need to modify the same leaf, therefore leading to a deadlock
       on the leaf;
    
    4) While iterating the file extent items in the cloned leaf before
       finding the hole in the range [64M, 65M[, the delalloc in that range
       is flushed and its ordered extent completes - meaning the corresponding
       file extent item is in the inode's subvolume tree, but not present in
       the cloned leaf that fiemap is iterating over;
    
    5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
       the cloned leaf (or a file extent item with disk_bytenr == 0 in case
       the NO_HOLES feature is not enabled), it will lock that file range in
       the inode's io tree and then search for delalloc by checking for the
       EXTENT_DELALLOC bit in the io tree for that range and ordered extents
       (with btrfs_find_delalloc_in_range()). But it finds nothing since the
       delalloc in that range was already flushed and the ordered extent
       completed and is gone - as a result fiemap will not report that there's
       delalloc or an extent for the range [64M, 65M[, so user space will be
       mislead into thinking that there's a hole in that range.
    
    This could actually be sporadically triggered with test case generic/094
    from fstests, which reports a missing extent/delalloc range like this:
    
      generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
          --- tests/generic/094.out	2020-06-10 19:29:03.830519425 +0100
          +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad	2024-02-28 11:00:00.381071525 +0000
          @@ -1,3 +1,9 @@
           QA output created by 094
           fiemap run with sync
           fiemap run without sync
          +ERROR: couldn't find extent at 7
          +map is 'HHDDHPPDPHPH'
          +logical: [       5..       6] phys:   301517..  301518 flags: 0x800 tot: 2
          +logical: [       8..       8] phys:   301520..  301520 flags: 0x800 tot: 1
          ...
          (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad'  to see the entire diff)
    
    So in order to fix this, while still avoiding deadlocks in the case where
    the fiemap buffer is memory mapped to the same file, change fiemap to work
    like the following:
    
    1) Always lock the whole range in the inode's io tree before starting to
       iterate the inode's subvolume tree searching for file extent items,
       just like we did before commit b0ad381f ("btrfs: fix deadlock with
       fiemap and extent locking");
    
    2) Now instead of writing to the fiemap buffer every time we have an extent
       to report, write instead to a temporary buffer (1 page), and when that
       buffer becomes full, stop iterating the file extent items, unlock the
       range in the io tree, release the search path, submit all the entries
       kept in that buffer to the fiemap buffer, and then resume the search
       for file extent items after locking again the remainder of the range in
       the io tree.
    
       The buffer having a size of a page, allows for 146 entries in a system
       with 4K pages. This is a large enough value to have a good performance
       by avoiding too many restarts of the search for file extent items.
       In other words this preserves the huge performance gains made in the
       last two years to fiemap, while avoiding the deadlocks in case the
       fiemap buffer is memory mapped to the same file (useless in practice,
       but possible and exercised by fuzz testing and syzbot).
    
    Fixes: b0ad381f ("btrfs: fix deadlock with fiemap and extent locking")
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    978b63f7
extent_io.c 144 KB