1. 05 Dec, 2022 40 commits
    • Filipe Manana's avatar
      btrfs: use cached state when looking for delalloc ranges with lseek · 3c32c721
      Filipe Manana authored
      During lseek (SEEK_HOLE/DATA), whenever we find a hole or prealloc extent,
      we will look for delalloc in that range, and one of the things we do for
      that is to find out ranges in the inode's io_tree marked with
      EXTENT_DELALLOC, using calls to count_range_bits().
      
      Typically there's a single, or few, searches in the io_tree for delalloc
      per lseek call. However it's common for applications to keep calling
      lseek with SEEK_HOLE and SEEK_DATA to find where extents and holes are in
      a file, read the extents and skip holes in order to avoid unnecessary IO
      and save disk space by preserving holes.
      
      One popular user is the cp utility from coreutils. Starting with coreutils
      9.0, cp uses SEEK_HOLE and SEEK_DATA to iterate over the extents of a
      file. Before 9.0, it used fiemap to figure out where holes and extents are
      in the source file. Another popular user is the tar utility when used with
      the --sparse / -S option to detect and preserve holes.
      
      Given that the pattern is to keep calling lseek with a start offset that
      matches the returned offset from the previous lseek call, we can benefit
      from caching the last extent state visited in count_range_bits() and use
      it for the next count_range_bits() from the next lseek call. Example,
      the following strace excerpt from running tar:
      
         $ strace tar cJSvf foo.tar.xz qemu_disk_file.raw
         (...)
         lseek(5, 125019574272, SEEK_HOLE)       = 125024989184
         lseek(5, 125024989184, SEEK_DATA)       = 125024993280
         lseek(5, 125024993280, SEEK_HOLE)       = 125025239040
         lseek(5, 125025239040, SEEK_DATA)       = 125025255424
         lseek(5, 125025255424, SEEK_HOLE)       = 125025353728
         lseek(5, 125025353728, SEEK_DATA)       = 125025357824
         lseek(5, 125025357824, SEEK_HOLE)       = 125026766848
         lseek(5, 125026766848, SEEK_DATA)       = 125026770944
         lseek(5, 125026770944, SEEK_HOLE)       = 125027053568
         (...)
      
      Shows that pattern, which is the same as with cp from coreutils 9.0+.
      
      So start using a cached state for the delalloc searches in lseek, and
      store it in struct file's private data so that it can be reused across
      lseek calls.
      
      This change is part of a patchset that is comprised of the following
      patches:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      
      The following test was run before and after applying the whole patchset:
      
         $ cat test-cp.sh
         #!/bin/bash
      
         DEV=/dev/sdh
         MNT=/mnt/sdh
      
         # coreutils 8.32, cp uses fiemap to detect holes and extents
         #CP_PROG=/usr/bin/cp
         # coreutils 9.1, cp uses SEEK_HOLE/DATA to detect holes and extents
         CP_PROG=/home/fdmanana/git/hub/coreutils/src/cp
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         FILE_SIZE=$((1024 * 1024 * 1024))
         echo "Creating file with a size of $((FILE_SIZE / 1024 / 1024))M"
         # Create a very sparse file, where each extent has a length of 4K and
         # is preceded by a 4K hole and followed by another 4K hole.
         start=$(date +%s%N)
         echo -n > $MNT/foobar
         for ((off = 0; off < $FILE_SIZE; off += 8192)); do
                 xfs_io -c "pwrite -S 0xab $off 4K" $MNT/foobar > /dev/null
                 echo -ne "\r$off / $FILE_SIZE ..."
         done
         end=$(date +%s%N)
         echo -e "\nFile created ($(( (end - start) / 1000000 )) milliseconds)"
      
         start=$(date +%s%N)
         $CP_PROG $MNT/foobar /dev/null
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "cp took $dur milliseconds with data/metadata cached and delalloc"
      
         # Flush all delalloc.
         sync
      
         start=$(date +%s%N)
         $CP_PROG $MNT/foobar /dev/null
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "cp took $dur milliseconds with data/metadata cached and no delalloc"
      
         # Unmount and mount again to test the case without any metadata
         # loaded in memory.
         umount $MNT
         mount $DEV $MNT
      
         start=$(date +%s%N)
         $CP_PROG $MNT/foobar /dev/null
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "cp took $dur milliseconds without data/metadata cached and no delalloc"
      
         umount $MNT
      
      The results, running on a box with a non-debug kernel (Debian's default
      kernel config), were the following:
      
      128M file, before patchset:
      
         cp took 16574 milliseconds with data/metadata cached and delalloc
         cp took 122 milliseconds with data/metadata cached and no delalloc
         cp took 20144 milliseconds without data/metadata cached and no delalloc
      
      128M file, after patchset:
      
         cp took 6277 milliseconds with data/metadata cached and delalloc
         cp took 109 milliseconds with data/metadata cached and no delalloc
         cp took 210 milliseconds without data/metadata cached and no delalloc
      
      512M file, before patchset:
      
         cp took 14369 milliseconds with data/metadata cached and delalloc
         cp took 429 milliseconds with data/metadata cached and no delalloc
         cp took 88034 milliseconds without data/metadata cached and no delalloc
      
      512M file, after patchset:
      
         cp took 12106 milliseconds with data/metadata cached and delalloc
         cp took 427 milliseconds with data/metadata cached and no delalloc
         cp took 824 milliseconds without data/metadata cached and no delalloc
      
      1G file, before patchset:
      
         cp took 10074 milliseconds with data/metadata cached and delalloc
         cp took 886 milliseconds with data/metadata cached and no delalloc
         cp took 181261 milliseconds without data/metadata cached and no delalloc
      
      1G file, after patchset:
      
         cp took 3320 milliseconds with data/metadata cached and delalloc
         cp took 880 milliseconds with data/metadata cached and no delalloc
         cp took 1801 milliseconds without data/metadata cached and no delalloc
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c32c721
    • Filipe Manana's avatar
      btrfs: use cached state when looking for delalloc ranges with fiemap · b3e744fe
      Filipe Manana authored
      During fiemap, whenever we find a hole or prealloc extent, we will look
      for delalloc in that range, and one of the things we do for that is to
      find out ranges in the inode's io_tree marked with EXTENT_DELALLOC, using
      calls to count_range_bits().
      
      Since we process file extents from left to right, if we have a file with
      several holes or prealloc extents, we benefit from keeping a cached extent
      state record for calls to count_range_bits(). Most of the time the last
      extent state record we visited in one call to count_range_bits() matches
      the first extent state record we will use in the next call to
      count_range_bits(), so there's a benefit here. So use an extent state
      record to cache results from count_range_bits() calls during fiemap.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3e744fe
    • Filipe Manana's avatar
      btrfs: update stale comment for count_range_bits() · 1ee51a06
      Filipe Manana authored
      The comment for count_range_bits() mentions that the search is fast if we
      are asking for a range with the EXTENT_DIRTY bit set. However that is no
      longer true since we don't use that bit and the optimization for that was
      removed in:
      
        commit 71528e9e ("btrfs: get rid of extent_io_tree::dirty_bytes")
      
      So remove that part of the comment mentioning the no longer existing
      optimized case, and, while at it, add proper documentation describing the
      purpose, arguments and return value of the function.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ee51a06
    • Filipe Manana's avatar
      btrfs: allow passing a cached state record to count_range_bits() · 8c6e53a7
      Filipe Manana authored
      An inode's io_tree can be quite large and there are cases where due to
      delalloc it can have thousands of extent state records, which makes the
      red black tree have a depth of 10 or more, making the operation of
      count_range_bits() slow if we repeatedly call it for a range that starts
      where, or after, the previous one we called it for. Such use cases are
      when searching for delalloc in a file range that corresponds to a hole or
      a prealloc extent, which is done during lseek SEEK_HOLE/DATA and fiemap.
      
      So introduce a cached state parameter to count_range_bits() which we use
      to store the last extent state record we visited, and then allow the
      caller to pass it again on its next call to count_range_bits(). The next
      patches in the series will make fiemap and lseek use the new parameter.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c6e53a7
    • Filipe Manana's avatar
      btrfs: remove no longer used btrfs_next_extent_map() · cfd7a17d
      Filipe Manana authored
      There are no more users of btrfs_next_extent_map(), the previous patch
      in the series ("btrfs: search for delalloc more efficiently during
      lseek/fiemap") removed the last usage of the function, so delete it.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cfd7a17d
    • Filipe Manana's avatar
      btrfs: search for delalloc more efficiently during lseek/fiemap · 8ddc8274
      Filipe Manana authored
      During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range
      that corresponds to a hole or a prealloc extent, we have to check if
      there's any delalloc in the range. We do it by searching for delalloc
      ranges in the inode's io_tree (for unflushed delalloc) and in the inode's
      extent map tree (for delalloc that is flushing).
      
      We avoid searching the extent map tree if the number of outstanding
      extents is 0, as in that case we can't have extent maps for our search
      range in the tree that correspond to delalloc that is flushing. However
      if we have any unflushed delalloc, due to buffered writes or mmap writes,
      then the outstanding extents counter is not 0 and we'll search the extent
      map tree. The tree may be large because it can have lots of extent maps
      that were loaded by reads or created by previous writes, therefore taking
      a significant time to search the tree, specially if have a file with a
      lot of holes and/or prealloc extents.
      
      We can improve on this by instead of searching the extent map tree,
      searching the ordered extents tree of the inode, since when delalloc is
      flushing we create an ordered extent along with the new extent map, while
      holding the respective file range locked in the inode's io_tree. The
      ordered extents tree is typically much smaller, since ordered extents have
      a short life and get removed from the tree once they are completed, while
      extent maps can stay for a very long time in the extent map tree, either
      created by previous writes or loaded by read operations.
      
      So use the ordered extents tree instead of the extent maps tree.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ddc8274
    • Filipe Manana's avatar
      btrfs: skip unnecessary delalloc searches during lseek/fiemap · af979fd6
      Filipe Manana authored
      During lseek (SEEK_HOLE/DATA) and fiemap, when processing a file range
      that corresponds to a hole or a prealloc extent, if we find that there is
      no delalloc marked in the inode's io_tree but there is delalloc due to
      an extent map in the io tree, then on the next iteration that calls
      find_delalloc_subrange() we can skip searching the io tree again, since
      on the first call we had no delalloc in the io tree for the whole range.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af979fd6
    • Filipe Manana's avatar
      btrfs: add an early exit when searching for delalloc range for lseek/fiemap · 40daf3e0
      Filipe Manana authored
      During fiemap and lseek (SEEK_HOLE/DATA), when looking for delalloc in a
      range corresponding to a hole or a prealloc extent, if we found the whole
      range marked as delalloc in the inode's io_tree, then we can terminate
      immediately and avoid searching the extent map tree. If not, and if the
      found delalloc starts at the same offset of our search start but ends
      before our search range's end, then we can adjust the search range for
      the search in the extent map tree. So implement those changes.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40daf3e0
    • Filipe Manana's avatar
      btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree · 2c8f5e8c
      Filipe Manana authored
      We don't need to set the EXTENT_UPDATE bit in an inode's io_tree to mark a
      range as uptodate, we rely on the pages themselves being uptodate - page
      reading is not triggered for already uptodate pages. Recently we removed
      most use of the EXTENT_UPTODATE for buffered IO with commit 52b029f4
      ("btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path"),
      but there were a few leftovers, namely when reading from holes and
      successfully finishing read repair.
      
      These leftovers are unnecessarily making an inode's tree larger and deeper,
      slowing down searches on it. So remove all the leftovers.
      
      This change is part of a patchset that has the goal to make performance
      better for applications that use lseek's SEEK_HOLE and SEEK_DATA modes to
      iterate over the extents of a file. Two examples are the cp program from
      coreutils 9.0+ and the tar program (when using its --sparse / -S option).
      A sample test and results are listed in the changelog of the last patch
      in the series:
      
        1/9 btrfs: remove leftover setting of EXTENT_UPTODATE state in an inode's io_tree
        2/9 btrfs: add an early exit when searching for delalloc range for lseek/fiemap
        3/9 btrfs: skip unnecessary delalloc searches during lseek/fiemap
        4/9 btrfs: search for delalloc more efficiently during lseek/fiemap
        5/9 btrfs: remove no longer used btrfs_next_extent_map()
        6/9 btrfs: allow passing a cached state record to count_range_bits()
        7/9 btrfs: update stale comment for count_range_bits()
        8/9 btrfs: use cached state when looking for delalloc ranges with fiemap
        9/9 btrfs: use cached state when looking for delalloc ranges with lseek
      Reported-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Link: https://lore.kernel.org/linux-btrfs/20221106073028.71F9.409509F4@e16-tech.com/
      Link: https://lore.kernel.org/linux-btrfs/CAL3q7H5NSVicm7nYBJ7x8fFkDpno8z3PYt5aPU43Bajc1H0h1Q@mail.gmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c8f5e8c
    • Qu Wenruo's avatar
      btrfs: move tree block parentness check into validate_extent_buffer() · 947a6299
      Qu Wenruo authored
      [BACKGROUND]
      Although both btrfs metadata and data has their read time verification
      done at endio time (btrfs_validate_metadata_buffer() and
      btrfs_verify_data_csum()), metadata has extra verification, mostly
      parentness check including first key/transid/owner_root/level, done at
      read_tree_block() and btrfs_read_extent_buffer().
      
      On the other hand, all the data verification is done at endio context.
      
      [ENHANCEMENT]
      This patch will make a new union in btrfs_bio, taking the space of the
      old data checksums, thus it will not increase the memory usage.
      
      With that extra btrfs_tree_parent_check inside btrfs_bio, we can just
      pass the check parameter into read_extent_buffer_pages(), and before
      submitting the bio, we can copy the check structure into btrfs_bio.
      
      And finally at endio time, we can grab btrfs_bio::parent_check and pass
      it to validate_extent_buffer(), to move the remaining checks into it.
      
      This brings the following benefits:
      
      - Much simpler btrfs_read_extent_buffer()
        Now it only needs to iterate through all mirrors.
      
      - Simpler read-time transid check
        Previously we go verify_parent_transid() after reading out the extent
        buffer.
        Now the transid check is done inside the endio function, no other
        code can modify the content.
        Thus no need to use the extent lock anymore.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      947a6299
    • Qu Wenruo's avatar
      btrfs: concentrate all tree block parentness check parameters into one structure · 789d6a3a
      Qu Wenruo authored
      There are several different tree block parentness check parameters used
      across several helpers:
      
      - level
        Mandatory
      
      - transid
        Under most cases it's mandatory, but there are several backref cases
        which skips this check.
      
      - owner_root
      - first_key
        Utilized by most top-down tree search routine. Otherwise can be
        skipped.
      
      Those four members are not always mandatory checks, and some of them are
      the same u64, which means if some arguments got swapped compiler will
      not catch it.
      
      Furthermore if we're going to further expand the parentness check, we
      need to modify quite some helpers just to add one more parameter.
      
      This patch will concentrate all these members into a structure called
      btrfs_tree_parent_check, and pass that structure for the following
      helpers:
      
      - btrfs_read_extent_buffer()
      - read_tree_block()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      789d6a3a
    • Anand Jain's avatar
      btrfs: move device->name RCU allocation and assign to btrfs_alloc_device() · bb21e302
      Anand Jain authored
      There is a repeating code section in the parent function after calling
      btrfs_alloc_device(), as below:
      
            name = rcu_string_strdup(path, GFP_...);
            if (!name) {
                    btrfs_free_device(device);
                    return ERR_PTR(-ENOMEM);
            }
            rcu_assign_pointer(device->name, name);
      
      Except in add_missing_dev() for obvious reasons.
      
      This patch consolidates that repeating code into the btrfs_alloc_device()
      itself so that the parent function doesn't have to duplicate code.
      This consolidation also helps to review issues regarding RCU lock
      violation with device->name.
      
      Parent function device_list_add() and add_missing_dev() use GFP_NOFS for
      the allocation, whereas the rest of the parent functions use GFP_KERNEL,
      so bring the NOFS allocation context using memalloc_nofs_save() in the
      function device_list_add() and add_missing_dev() is already doing it.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb21e302
    • David Sterba's avatar
      btrfs: constify input buffer parameter in compression code · 3e09b5b2
      David Sterba authored
      The input buffers passed down to compression must never be changed,
      switch type to u8 as it's a raw byte buffer and use const.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e09b5b2
    • Qu Wenruo's avatar
      btrfs: raid56: remove the old error tracking system · ad3daf1c
      Qu Wenruo authored
      Since all the recovery paths have been migrated to the new error bitmap
      based system, we can remove the old stripe number based system.
      
      This cleanup involves one behavior change:
      
      - Rebuild rbio can no longer be merged
        Previously a rebuild rbio (caused by retry after data csum mismatch)
        can be merged, if the error happens in the same stripe.
      
        But with the new error bitmap based solution, it's much harder to
        compare error bitmaps.
      
        So here we just don't merge rebuild rbio at all.
        This may introduce some performance impact at extreme corner cases,
        but we're willing to take it.
      
      Other than that, this patch will cleanup the following members:
      
      - rbio::faila
      - rbio::failb
        They will be replaced by per-vertical stripe check, which is more
        accurate.
      
      - rbio::error
        It will be replace by per-vertical stripe error bitmap check.
      
      - Allow get_rbio_vertical_errors() to accept NULL pointers for
        @faila and @failb
        Some call sites only want to check if we have errors beyond the
        tolerance.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad3daf1c
    • Qu Wenruo's avatar
      btrfs: raid56: migrate recovery and scrub recovery path to use error_bitmap · 75b47033
      Qu Wenruo authored
      Since we have rbio::error_bitmap to indicate exactly where the errors
      are (including read error and csum mismatch error), we can make recovery
      path more accurate.
      
      For example:
      
                   0        32K       64K
           Data 1  |XXXXXXXX|         |
           Data 2  |        |XXXXXXXXX|
           Parity  |        |         |
      
      1) Get csum mismatch when reading data 1 [0, 32K)
      
      2) Mark corresponding range error
         The old code will mark the whole data 1 stripe as error.
         While the new code will only mark data 1 [0, 32K) as error.
      
      3) Recovery path
         The old code will recover data 1 [0, 64K), all using Data 2 and
         parity.
      
         This means, Data 1 [32K, 64K) will be corrupted data, as data 2
         [32K, 64K) is already corrupted.
      
         While the new code will only recover data 1 [0, 32K), as only
         that range has error so far.
      
      This new behavior can avoid populating rbio cache with incorrect data.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      75b47033
    • Qu Wenruo's avatar
      btrfs: raid56: introduce btrfs_raid_bio::error_bitmap · 2942a50d
      Qu Wenruo authored
      Currently btrfs raid56 uses btrfs_raid_bio::faila and failb to indicate
      which stripe(s) had IO errors.
      
      But that has some problems:
      
      - If one sector failed csum check, the whole stripe where the corruption
        is will be marked error.
        This can reduce the chance we do recover, like this:
      
                0  4K 8K
        Data 1  |XX|  |
        Data 2  |  |XX|
        Parity  |  |  |
      
        In above case, 0~4K in data 1 should be recovered using data 2 and
        parity, while 4K~8K in data 2 should be recovered using data 1 and
        parity.
      
        Currently if we trigger read on 0~4K of data 1, we will also recover
        4K~8K of data 1 using corrupted data 2 and parity, causing wrong
        result in rbio cache.
      
      - Harder to expand for future M-N scheme
        As we're limited to just faila/b, two corruptions.
      
      - Harder to expand to handle extra csum errors
        This can be problematic if we start to do csum verification.
      
      This patch will introduce an extra @error_bitmap, where one bit
      represents error that happened for that sector.
      
      The choice to introduce a new error bitmap other than reusing
      sector_ptr, is to avoid extra search between rbio::stripe_sectors[] and
      rbio::bio_sectors[].
      
      Since we can submit bio using sectors from both sectors, doing proper
      search on both array will more complex.
      
      Although the new bitmap will take extra memory, later we can remove
      things like @error and faila/b to save some memory.
      
      Currently the new error bitmap and failab mechanism coexists, the error
      bitmap is only updated at endio time and recover entrance.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2942a50d
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_add_delayed_iput · e55cf7ca
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e55cf7ca
    • David Sterba's avatar
      btrfs: use btrfs_inode inside btrfs_verify_data_csum · 5fc24314
      David Sterba authored
      The function is mostly using internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5fc24314
    • David Sterba's avatar
      btrfs: use btrfs_inode inside compress_file_range · 99a01bd6
      David Sterba authored
      The function is mostly using internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99a01bd6
    • David Sterba's avatar
      btrfs: switch async_chunk::inode to btrfs_inode · 99a81a44
      David Sterba authored
      The async_chunk::inode structure is for internal interfaces so we should
      use the btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99a81a44
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_inherit_iflags · 7a0443f0
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a0443f0
    • David Sterba's avatar
      btrfs: pass btrfs_inode to inode_tree_add · 4c45a4f4
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c45a4f4
    • David Sterba's avatar
      btrfs: pass btrfs_inode to fixup_tree_root_location · 3c1b1c4c
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c1b1c4c
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_inode_by_name · d1de429b
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d1de429b
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_unlink_subvol · 5b7544cb
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5b7544cb
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_clear_delalloc_extent · bd54766e
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd54766e
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_split_delalloc_extent · 62798a49
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      62798a49
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_set_delalloc_extent · 4c5d166f
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c5d166f
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_merge_delalloc_extent · 2454151c
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2454151c
    • David Sterba's avatar
      btrfs: switch extent_io_tree::private_data to btrfs_inode and rename · 0988fc7b
      David Sterba authored
      The extent_io_tree::private_data was meant to be a preparatory work for
      the metadata inode rework but that never materialized. Now it's used
      only for an inode so it's better to change the appropriate type and
      rename it.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0988fc7b
    • David Sterba's avatar
      btrfs: drop private_data parameter from extent_io_tree_init · 35da5a7e
      David Sterba authored
      All callers except one pass NULL, so the parameter can be dropped and
      the inode::io_tree initialization can be open coded.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35da5a7e
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_delete_subvolume · 3c4f91e2
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c4f91e2
    • David Sterba's avatar
      btrfs: pass btrfs_inode to __unlink_start_trans · e569b1d5
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e569b1d5
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_check_data_csum · 621af94a
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      621af94a
    • David Sterba's avatar
      btrfs: switch btrfs_writepage_fixup::inode to btrfs_inode · 36eeaef5
      David Sterba authored
      The btrfs_writepage_fixup structure is for internal interfaces so we
      should use the btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36eeaef5
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_add_delalloc_inodes · 82ca5a04
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      82ca5a04
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_dirty_inode · 7152b425
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7152b425
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_inode_unlock · e5d4d75b
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e5d4d75b
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_inode_lock · 29b6352b
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      29b6352b
    • David Sterba's avatar
      btrfs: pass btrfs_inode to btrfs_truncate · d9dcae67
      David Sterba authored
      The function is for internal interfaces so we should use the
      btrfs_inode.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d9dcae67