1. 26 Sep, 2022 40 commits
    • Filipe Manana's avatar
      btrfs: make fiemap more efficient and accurate reporting extent sharedness · ac3c0d36
      Filipe Manana authored
      The current fiemap implementation does not scale very well with the number
      of extents a file has. This is both because the main algorithm to find out
      the extents has a high algorithmic complexity and because for each extent
      we have to check if it's shared. This second part, checking if an extent
      is shared, is significantly improved by the two previous patches in this
      patchset, while the first part is improved by this specific patch. Every
      now and then we get reports from users mentioning fiemap is too slow or
      even unusable for files with a very large number of extents, such as the
      two recent reports referred to by the Link tags at the bottom of this
      change log.
      
      To understand why the part of finding which extents a file has is very
      inefficient, consider the example of doing a full ranged fiemap against
      a file that has over 100K extents (normal for example for a file with
      more than 10G of data and using compression, which limits the extent size
      to 128K). When we enter fiemap at extent_fiemap(), the following happens:
      
      1) Before entering the main loop, we call get_extent_skip_holes() to get
         the first extent map. This leads us to btrfs_get_extent_fiemap(), which
         in turn calls btrfs_get_extent(), to find the first extent map that
         covers the file range [0, LLONG_MAX).
      
         btrfs_get_extent() will first search the inode's extent map tree, to
         see if we have an extent map there that covers the range. If it does
         not find one, then it will search the inode's subvolume b+tree for a
         fitting file extent item. After finding the file extent item, it will
         allocate an extent map, fill it in with information extracted from the
         file extent item, and add it to the inode's extent map tree (which
         requires a search for insertion in the tree).
      
      2) Then we enter the main loop at extent_fiemap(), emit the details of
         the extent, and call again get_extent_skip_holes(), with a start
         offset matching the end of the extent map we previously processed.
      
         We end up at btrfs_get_extent() again, will search the extent map tree
         and then search the subvolume b+tree for a file extent item if we could
         not find an extent map in the extent tree. We allocate an extent map,
         fill it in with the details in the file extent item, and then insert
         it into the extent map tree (yet another search in this tree).
      
      3) The second step is repeated over and over, until we have processed the
         whole file range. Each iteration ends at btrfs_get_extent(), which
         does a red black tree search on the extent map tree, then searches the
         subvolume b+tree, allocates an extent map and then does another search
         in the extent map tree in order to insert the extent map.
      
         In the best scenario we have all the extent maps already in the extent
         tree, and so for each extent we do a single search on a red black tree,
         so we have a complexity of O(n log n).
      
         In the worst scenario we don't have any extent map already loaded in
         the extent map tree, or have very few already there. In this case the
         complexity is much higher since we do:
      
         - A red black tree search on the extent map tree, which has O(log n)
           complexity, initially very fast since the tree is empty or very
           small, but as we end up allocating extent maps and adding them to
           the tree when we don't find them there, each subsequent search on
           the tree gets slower, since it's getting bigger and bigger after
           each iteration.
      
         - A search on the subvolume b+tree, also O(log n) complexity, but it
           has items for all inodes in the subvolume, not just items for our
           inode. Plus on a filesystem with concurrent operations on other
           inodes, we can block doing the search due to lock contention on
           b+tree nodes/leaves.
      
         - Allocate an extent map - this can block, and can also fail if we
           are under serious memory pressure.
      
         - Do another search on the extent maps red black tree, with the goal
           of inserting the extent map we just allocated. Again, after every
           iteration this tree is getting bigger by 1 element, so after many
           iterations the searches are slower and slower.
      
         - We will not need the allocated extent map anymore, so it's pointless
           to add it to the extent map tree. It's just wasting time and memory.
      
         In short we end up searching the extent map tree multiple times, on a
         tree that is growing bigger and bigger after each iteration. And
         besides that we visit the same leaf of the subvolume b+tree many times,
         since a leaf with the default size of 16K can easily have more than 200
         file extent items.
      
      This is very inefficient overall. This patch changes the algorithm to
      instead iterate over the subvolume b+tree, visiting each leaf only once,
      and only searching in the extent map tree for file ranges that have holes
      or prealloc extents, in order to figure out if we have delalloc there.
      It will never allocate an extent map and add it to the extent map tree.
      This is very similar to what was previously done for the lseek's hole and
      data seeking features.
      
      Also, the current implementation relying on extent maps for figuring out
      which extents we have is not correct. This is because extent maps can be
      merged even if they represent different extents - we do this to minimize
      memory utilization and keep extent map trees smaller. For example if we
      have two extents that are contiguous on disk, once we load the two extent
      maps, they get merged into a single one - however if only one of the
      extents is shared, we end up reporting both as shared or both as not
      shared, which is incorrect.
      
      This reproducer triggers that bug:
      
          $ cat fiemap-bug.sh
          #!/bin/bash
      
          DEV=/dev/sdj
          MNT=/mnt/sdj
      
          mkfs.btrfs -f $DEV
          mount $DEV $MNT
      
          # Create a file with two 256K extents.
          # Since there is no other write activity, they will be contiguous,
          # and their extent maps merged, despite having two distinct extents.
          xfs_io -f -c "pwrite -S 0xab 0 256K" \
                    -c "fsync" \
                    -c "pwrite -S 0xcd 256K 256K" \
                    -c "fsync" \
                    $MNT/foo
      
          # Now clone only the second extent into another file.
          xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
      
          # Filefrag will report a single 512K extent, and say it's not shared.
          echo
          filefrag -v $MNT/foo
      
          umount $MNT
      
      Running the reproducer:
      
          $ ./fiemap-bug.sh
          wrote 262144/262144 bytes at offset 0
          256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
          wrote 262144/262144 bytes at offset 262144
          256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
          linked 262144/262144 bytes at offset 0
          256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
      
          Filesystem type is: 9123683e
          File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
           ext:     logical_offset:        physical_offset: length:   expected: flags:
             0:        0..     127:       3328..      3455:    128:             last,eof
          /mnt/sdj/foo: 1 extent found
      
      We end up reporting that we have a single 512K that is not shared, however
      we have two 256K extents, and the second one is shared. Changing the
      reproducer to clone instead the first extent into file 'bar', makes us
      report a single 512K extent that is shared, which is algo incorrect since
      we have two 256K extents and only the first one is shared.
      
      This patch is part of a larger patchset that is comprised of the following
      patches:
      
          btrfs: allow hole and data seeking to be interruptible
          btrfs: make hole and data seeking a lot more efficient
          btrfs: remove check for impossible block start for an extent map at fiemap
          btrfs: remove zero length check when entering fiemap
          btrfs: properly flush delalloc when entering fiemap
          btrfs: allow fiemap to be interruptible
          btrfs: rename btrfs_check_shared() to a more descriptive name
          btrfs: speedup checking for extent sharedness during fiemap
          btrfs: skip unnecessary extent buffer sharedness checks during fiemap
          btrfs: make fiemap more efficient and accurate reporting extent sharedness
      
      The patchset was tested on a machine running a non-debug kernel (Debian's
      default config) and compared the tests below on a branch without the
      patchset versus the same branch with the whole patchset applied.
      
      The following test for a large compressed file without holes:
      
          $ cat fiemap-perf-test.sh
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f $DEV
          mount -o compress=lzo $DEV $MNT
      
          # 40G gives 327680 128K file extents (due to compression).
          xfs_io -f -c "pwrite -S 0xab -b 1M 0 20G" $MNT/foobar
      
          umount $MNT
          mount -o compress=lzo $DEV $MNT
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata not cached)"
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata cached)"
      
          umount $MNT
      
      Before patchset:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 3597 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 2107 milliseconds (metadata cached)
      
      After patchset:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 1214 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 684 milliseconds (metadata cached)
      
      That's a speedup of about 3x for both cases (no metadata cached and all
      metadata cached).
      
      The test provided by Pavel (first Link tag at the bottom), which uses
      files with a large number of holes, was also used to measure the gains,
      and it consists on a small C program and a shell script to invoke it.
      The C program is the following:
      
          $ cat pavels-test.c
          #include <stdio.h>
          #include <unistd.h>
          #include <stdlib.h>
          #include <fcntl.h>
      
          #include <sys/stat.h>
          #include <sys/time.h>
          #include <sys/ioctl.h>
      
          #include <linux/fs.h>
          #include <linux/fiemap.h>
      
          #define FILE_INTERVAL (1<<13) /* 8Kb */
      
          long long interval(struct timeval t1, struct timeval t2)
          {
              long long val = 0;
              val += (t2.tv_usec - t1.tv_usec);
              val += (t2.tv_sec - t1.tv_sec) * 1000 * 1000;
              return val;
          }
      
          int main(int argc, char **argv)
          {
              struct fiemap fiemap = {};
              struct timeval t1, t2;
              char data = 'a';
              struct stat st;
              int fd, off, file_size = FILE_INTERVAL;
      
              if (argc != 3 && argc != 2) {
                      printf("usage: %s <path> [size]\n", argv[0]);
                      return 1;
              }
      
              if (argc == 3)
                      file_size = atoi(argv[2]);
              if (file_size < FILE_INTERVAL)
                      file_size = FILE_INTERVAL;
              file_size -= file_size % FILE_INTERVAL;
      
              fd = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
              if (fd < 0) {
                  perror("open");
                  return 1;
              }
      
              for (off = 0; off < file_size; off += FILE_INTERVAL) {
                  if (pwrite(fd, &data, 1, off) != 1) {
                      perror("pwrite");
                      close(fd);
                      return 1;
                  }
              }
      
              if (ftruncate(fd, file_size)) {
                  perror("ftruncate");
                  close(fd);
                  return 1;
              }
      
              if (fstat(fd, &st) < 0) {
                  perror("fstat");
                  close(fd);
                  return 1;
              }
      
              printf("size: %ld\n", st.st_size);
              printf("actual size: %ld\n", st.st_blocks * 512);
      
              fiemap.fm_length = FIEMAP_MAX_OFFSET;
              gettimeofday(&t1, NULL);
              if (ioctl(fd, FS_IOC_FIEMAP, &fiemap) < 0) {
                  perror("fiemap");
                  close(fd);
                  return 1;
              }
              gettimeofday(&t2, NULL);
      
              printf("fiemap: fm_mapped_extents = %d\n",
                     fiemap.fm_mapped_extents);
              printf("time = %lld us\n", interval(t1, t2));
      
              close(fd);
              return 0;
          }
      
          $ gcc -o pavels_test pavels_test.c
      
      And the wrapper shell script:
      
          $ cat fiemap-pavels-test.sh
      
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f -O no-holes $DEV
          mount $DEV $MNT
      
          echo
          echo "*********** 256M ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 28))
          echo
          ./pavels-test $MNT/testfile $((1 << 28))
      
          echo
          echo "*********** 512M ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 29))
          echo
          ./pavels-test $MNT/testfile $((1 << 29))
      
          echo
          echo "*********** 1G ***********"
          echo
      
          ./pavels-test $MNT/testfile $((1 << 30))
          echo
          ./pavels-test $MNT/testfile $((1 << 30))
      
          umount $MNT
      
      Running his reproducer before applying the patchset:
      
          *********** 256M ***********
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 4003133 us
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 4895330 us
      
          *********** 512M ***********
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 30123675 us
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 33450934 us
      
          *********** 1G ***********
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 224924074 us
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 217239242 us
      
      Running it after applying the patchset:
      
          *********** 256M ***********
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 29475 us
      
          size: 268435456
          actual size: 134217728
          fiemap: fm_mapped_extents = 32768
          time = 29307 us
      
          *********** 512M ***********
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 58996 us
      
          size: 536870912
          actual size: 268435456
          fiemap: fm_mapped_extents = 65536
          time = 59115 us
      
          *********** 1G ***********
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 116251
          time = 124141 us
      
          size: 1073741824
          actual size: 536870912
          fiemap: fm_mapped_extents = 131072
          time = 119387 us
      
      The speedup is massive, both on the first fiemap call and on the second
      one as well, as his test creates files with many holes and small extents
      (every extent follows a hole and precedes another hole).
      
      For the 256M file we go from 4 seconds down to 29 milliseconds in the
      first run, and then from 4.9 seconds down to 29 milliseconds again in the
      second run, a speedup of 138x and 169x, respectively.
      
      For the 512M file we go from 30.1 seconds down to 59 milliseconds in the
      first run, and then from 33.5 seconds down to 59 milliseconds again in the
      second run, a speedup of 510x and 568x, respectively.
      
      For the 1G file, we go from 225 seconds down to 124 milliseconds in the
      first run, and then from 217 seconds down to 119 milliseconds in the
      second run, a speedup of 1815x and 1824x, respectively.
      Reported-by: default avatarPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/Reported-by: default avatarDominique MARTINET <dominique.martinet@atmark-techno.com>
      Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac3c0d36
    • Filipe Manana's avatar
      btrfs: skip unnecessary extent buffer sharedness checks during fiemap · b8f164e3
      Filipe Manana authored
      During fiemap, for each file extent we find, we must check if it's shared
      or not. The sharedness check starts by verifying if the extent is directly
      shared (its refcount in the extent tree is > 1), and if it is not directly
      shared, then we will check if every node in the subvolume b+tree leading
      from the root to the leaf that has the file extent item (in reverse order),
      is shared (through snapshots).
      
      However this second step is not needed if our extent was created in a
      transaction more recent than the last transaction where a snapshot of the
      inode's root happened, because it can't be shared indirectly (through
      shared subtrees) without a snapshot created in a more recent transaction.
      
      So grab the generation of the extent from the extent map and pass it to
      btrfs_is_data_extent_shared(), which will skip this second phase when the
      generation is more recent than the root's last snapshot value. Note that
      we skip this optimization if the extent map is the result of merging 2
      or more extent maps, because in this case its generation is the maximum
      of the generations of all merged extent maps.
      
      The fact the we use extent maps and they can be merged despite the
      underlying extents being distinct (different file extent items in the
      subvolume b+tree and different extent items in the extent b+tree), can
      result in some bugs when reporting shared extents. But this is a problem
      of the current implementation of fiemap relying on extent maps.
      One example where we get incorrect results is:
      
          $ cat fiemap-bug.sh
          #!/bin/bash
      
          DEV=/dev/sdj
          MNT=/mnt/sdj
      
          mkfs.btrfs -f $DEV
          mount $DEV $MNT
      
          # Create a file with two 256K extents.
          # Since there is no other write activity, they will be contiguous,
          # and their extent maps merged, despite having two distinct extents.
          xfs_io -f -c "pwrite -S 0xab 0 256K" \
                    -c "fsync" \
                    -c "pwrite -S 0xcd 256K 256K" \
                    -c "fsync" \
                    $MNT/foo
      
          # Now clone only the second extent into another file.
          xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
      
          # Filefrag will report a single 512K extent, and say it's not shared.
          echo
          filefrag -v $MNT/foo
      
          umount $MNT
      
      Running the reproducer:
      
          $ ./fiemap-bug.sh
          wrote 262144/262144 bytes at offset 0
          256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
          wrote 262144/262144 bytes at offset 262144
          256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
          linked 262144/262144 bytes at offset 0
          256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
      
          Filesystem type is: 9123683e
          File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
           ext:     logical_offset:        physical_offset: length:   expected: flags:
             0:        0..     127:       3328..      3455:    128:             last,eof
          /mnt/sdj/foo: 1 extent found
      
      We end up reporting that we have a single 512K that is not shared, however
      we have two 256K extents, and the second one is shared. Changing the
      reproducer to clone instead the first extent into file 'bar', makes us
      report a single 512K extent that is shared, which is algo incorrect since
      we have two 256K extents and only the first one is shared.
      
      This is z problem that existed before this change, and remains after this
      change, as it can't be easily fixed. The next patch in the series reworks
      fiemap to primarily use file extent items instead of extent maps (except
      for checking for delalloc ranges), with the goal of improving its
      scalability and performance, but it also ends up fixing this particular
      bug caused by extent map merging.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8f164e3
    • Filipe Manana's avatar
      btrfs: speedup checking for extent sharedness during fiemap · 12a824dc
      Filipe Manana authored
      One of the most expensive tasks performed during fiemap is to check if
      an extent is shared. This task has two major steps:
      
      1) Check if the data extent is shared. This implies checking the extent
         item in the extent tree, checking delayed references, etc. If we
         find the data extent is directly shared, we terminate immediately;
      
      2) If the data extent is not directly shared (its extent item has a
         refcount of 1), then it may be shared if we have snapshots that share
         subtrees of the inode's subvolume b+tree. So we check if the leaf
         containing the file extent item is shared, then its parent node, then
         the parent node of the parent node, etc, until we reach the root node
         or we find one of them is shared - in which case we stop immediately.
      
      During fiemap we process the extents of a file from left to right, from
      file offset 0 to EOF. This means that we iterate b+tree leaves from left
      to right, and has the implication that we keep repeating that second step
      above several times for the same b+tree path of the inode's subvolume
      b+tree.
      
      For example, if we have two file extent items in leaf X, and the path to
      leaf X is A -> B -> C -> X, then when we try to determine if the data
      extent referenced by the first extent item is shared, we check if the data
      extent is shared - if it's not, then we check if leaf X is shared, if not,
      then we check if node C is shared, if not, then check if node B is shared,
      if not than check if node A is shared. When we move to the next file
      extent item, after determining the data extent is not shared, we repeat
      the checks for X, C, B and A - doing all the expensive searches in the
      extent tree, delayed refs, etc. If we have thousands of tile extents, then
      we keep repeating the sharedness checks for the same paths over and over.
      
      On a file that has no shared extents or only a small portion, it's easy
      to see that this scales terribly with the number of extents in the file
      and the sizes of the extent and subvolume b+trees.
      
      This change eliminates the repeated sharedness check on extent buffers
      by caching the results of the last path used. The results can be used as
      long as no snapshots were created since they were cached (for not shared
      extent buffers) or no roots were dropped since they were cached (for
      shared extent buffers). This greatly reduces the time spent by fiemap for
      files with thousands of extents and/or large extent and subvolume b+trees.
      
      Example performance test:
      
          $ cat fiemap-perf-test.sh
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f $DEV
          mount -o compress=lzo $DEV $MNT
      
          # 40G gives 327680 128K file extents (due to compression).
          xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
          umount $MNT
          mount -o compress=lzo $DEV $MNT
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata not cached)"
      
          start=$(date +%s%N)
          filefrag $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "fiemap took $dur milliseconds (metadata cached)"
      
          umount $MNT
      
      Before this patch:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 3597 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 2107 milliseconds (metadata cached)
      
      After this patch:
      
          $ ./fiemap-perf-test.sh
          (...)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 1646 milliseconds (metadata not cached)
          /mnt/sdi/foobar: 327680 extents found
          fiemap took 698 milliseconds (metadata cached)
      
      That's about 2.2x faster when no metadata is cached, and about 3x faster
      when all metadata is cached. On a real filesystem with many other files,
      data, directories, etc, the b+trees will be 2 or 3 levels higher,
      therefore this optimization will have a higher impact.
      
      Several reports of a slow fiemap show up often, the two Link tags below
      refer to two recent reports of such slowness. This patch, together with
      the next ones in the series, is meant to address that.
      
      Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
      Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12a824dc
    • Filipe Manana's avatar
      btrfs: rename btrfs_check_shared() to a more descriptive name · 8eedadda
      Filipe Manana authored
      The function btrfs_check_shared() is supposed to be used to check if a
      data extent is shared, but its name is too generic, may easily cause
      confusion in the sense that it may be used for metadata extents.
      
      So rename it to btrfs_is_data_extent_shared(), which will also make it
      less confusing after the next change that adds a backref lookup cache for
      the b+tree nodes that lead to the leaf that contains the file extent item
      that points to the target data extent.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8eedadda
    • Filipe Manana's avatar
      btrfs: allow fiemap to be interruptible · 09fbc1c8
      Filipe Manana authored
      Doing fiemap on a file with a very large number of extents can take a very
      long time, and we have reports of it being too slow (two recent examples
      in the Link tags below), so make it interruptible.
      
      Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
      Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      09fbc1c8
    • Filipe Manana's avatar
      btrfs: properly flush delalloc when entering fiemap · 33a86cfa
      Filipe Manana authored
      If the flag FIEMAP_FLAG_SYNC is passed to fiemap, it means all delalloc
      should be flushed and writeback complete. We call the generic helper
      fiemap_prep() which does a filemap_write_and_wait() in case that flag is
      given, however that is not enough if we have compression. Because a
      single filemap_fdatawrite_range() only starts compression (in an async
      thread) and therefore returns before the compression is done and writeback
      is started.
      
      So make btrfs_fiemap(), actually wait for all writeback to start and
      complete if FIEMAP_FLAG_SYNC is set. We start and wait for writeback
      on the whole possible file range, from 0 to LLONG_MAX, because that is
      what the generic code at fiemap_prep() does.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33a86cfa
    • Filipe Manana's avatar
      btrfs: remove zero length check when entering fiemap · 9a42bbae
      Filipe Manana authored
      There's no point to check for a 0 length at extent_fiemap(), as before
      calling it, we called fiemap_prep() at btrfs_fiemap(), which already
      checks for a zero length and returns the same -EINVAL error. So remove
      the pointless check.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a42bbae
    • Filipe Manana's avatar
      btrfs: remove check for impossible block start for an extent map at fiemap · f12eec9a
      Filipe Manana authored
      During fiemap we are testing if an extent map has a block start with a
      value of EXTENT_MAP_LAST_BYTE, but that is never set on an extent map,
      and never was according to git history. So remove that useless check.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f12eec9a
    • Filipe Manana's avatar
      btrfs: make hole and data seeking a lot more efficient · b6e83356
      Filipe Manana authored
      The current implementation of hole and data seeking for llseek does not
      scale well in regards to the number of extents and the distance between
      the start offset and the next hole or extent. This is due to a very high
      algorithmic complexity. Often we also get reports of btrfs' hole and data
      seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
      tag at the bottom).
      
      In order to better understand it, lets consider the case where the start
      offset is 0, we are seeking for a hole and the file size is 16G. Between
      file offset 0 and the first hole in the file there are 100K extents - this
      is common for large files, specially if we have compression enabled, since
      the maximum extent size is limited to 128K. The steps take by the main
      loop of the current algorithm are the following:
      
      1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
         calls btrfs_get_extent(). This will first lookup for an extent map in
         the inode's extent map tree (a red black tree). If the extent map is
         not loaded in memory, then it will do a lookup for the corresponding
         file extent item in the subvolume's b+tree, create an extent map based
         on the contents of the file extent item and then add the extent map to
         the extent map tree of the inode;
      
      2) The second iteration calls btrfs_get_extent_fiemap() again, this time
         with a start offset matching the end offset of the previous extent.
         Again, btrfs_get_extent() will first search the extent map tree, and
         if it doesn't find an extent map there, it will again search in the
         b+tree of the subvolume for a matching file extent item, build an
         extent map based on the file extent item, and add the extent map to
         to the extent map tree of the inode;
      
      3) This repeats over and over until we find the first hole (when seeking
         for holes) or until we find the first extent (when seeking for data).
      
         If there no extent maps loaded in memory for each iteration, then on
         each iteration we do 1 extent map tree search, 1 b+tree search, plus
         1 more extent map tree traversal to insert an extent map - plus we
         allocate memory for the extent map.
      
         On each iteration we are growing the size of the extent map tree,
         making each future search slower, and also visiting the same b+tree
         leaves over and over again - taking into account with the default leaf
         size of 16K we can fit more than 200 file extent items in a leaf - so
         we can visit the same b+tree leaf 200+ times, on each visit walking
         down a path from the root to the leaf.
      
      So it's easy to see that what we have now doesn't scale well. Also, it
      loads an extent map for every file extent item into memory, which is not
      efficient - we should add extents maps only when doing IO (writing or
      reading file data).
      
      This change implements a new algorithm which scales much better, and
      works like this:
      
      1) We iterate over the subvolume's b+tree, visiting each leaf that has
         file extent items once and only once;
      
      2) For any file extent items found, that don't represent holes or prealloc
         extents, it will not search the extent map tree - there's no need at
         all for that - an extent map is just an in-memory representation of a
         file extent item;
      
      3) When a hole is found, or a prealloc extent, it will check if there's
         delalloc for its range. For this it will search for EXTENT_DELALLOC
         bits in the inode's io tree and check the extent map tree - this is
         for accounting for unflushed delalloc and for flushed delalloc (the
         period between running delalloc and ordered extent completion),
         respectively. This is similar to what the current implementation does
         when it finds a hole or prealloc extent, but without creating extent
         maps and adding them to the extent map tree in case they are not
         loaded in memory;
      
      4) It never allocates extent maps, or adds extent maps to the inode's
         extent map tree. This not only saves memory and time (from the tree
         insertions and allocations), but also eliminates the possibility of
         -ENOMEM due to allocating too many extent maps.
      
      Part of this new code will also be used later for fiemap (which also
      suffers similar scalability problems).
      
      The following test example can be used to quickly measure the efficiency
      before and after this patch:
      
          $ cat test-seek-hole.sh
          #!/bin/bash
      
          DEV=/dev/sdi
          MNT=/mnt/sdi
      
          mkfs.btrfs -f $DEV
      
          mount -o compress=lzo $DEV $MNT
      
          # 16G file -> 131073 compressed extents.
          xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar
      
          # Leave a 1M hole at file offset 15G.
          xfs_io -c "fpunch 15G 1M" $MNT/foobar
      
          # Unmount and mount again, so that we can test when there's no
          # metadata cached in memory.
          umount $MNT
          mount -o compress=lzo $DEV $MNT
      
          # Test seeking for hole from offset 0 (hole is at offset 15G).
      
          start=$(date +%s%N)
          xfs_io -c "seek -h 0" $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "Took $dur milliseconds to seek first hole (metadata not cached)"
          echo
      
          start=$(date +%s%N)
          xfs_io -c "seek -h 0" $MNT/foobar
          end=$(date +%s%N)
          dur=$(( (end - start) / 1000000 ))
          echo "Took $dur milliseconds to seek first hole (metadata cached)"
          echo
      
          umount $MNT
      
      Before this change:
      
          $ ./test-seek-hole.sh
          (...)
          Whence	Result
          HOLE	16106127360
          Took 176 milliseconds to seek first hole (metadata not cached)
      
          Whence	Result
          HOLE	16106127360
          Took 17 milliseconds to seek first hole (metadata cached)
      
      After this change:
      
          $ ./test-seek-hole.sh
          (...)
          Whence	Result
          HOLE	16106127360
          Took 43 milliseconds to seek first hole (metadata not cached)
      
          Whence	Result
          HOLE	16106127360
          Took 13 milliseconds to seek first hole (metadata cached)
      
      That's about 4x faster when no metadata is cached and about 30% faster
      when all metadata is cached.
      
      In practice the differences may often be significantly higher, either due
      to a higher number of extents in a file or because the subvolume's b+tree
      is much bigger than in this example, where we only have one file.
      
      Link: https://lwn.net/Articles/718805/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6e83356
    • Filipe Manana's avatar
      btrfs: allow hole and data seeking to be interruptible · aed0ca18
      Filipe Manana authored
      Doing hole or data seeking on a file with a very large number of extents
      can take a long time, and we have reports of it being too slow (such as
      at LSFMM from 2017, see the Link below). So make it interruptible.
      
      Link: https://lwn.net/Articles/718805/Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aed0ca18
    • zhang songyi's avatar
      btrfs: remove the unnecessary result variables · bd64f622
      zhang songyi authored
      Return the sysfs_emit() and iterate_object_props() directly instead of
      using unnecessary variables.
      Reported-by: default avatarZeal Robot <zealci@zte.com.cn>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarzhang songyi <zhang.songyi@zte.com.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd64f622
    • Qu Wenruo's avatar
      btrfs: separate BLOCK_GROUP_TREE compat RO flag from EXTENT_TREE_V2 · 1c56ab99
      Qu Wenruo authored
      The problem of long mount time caused by block group item search is
      already known for some time, and the solution of block group tree has
      been proposed.
      
      There is really no need to bound this feature into extent tree v2, just
      introduce compat RO flag, BLOCK_GROUP_TREE, to correctly solve the
      problem.
      
      All the code handling block group root is already in the upstream
      kernel, thus this patch really only needs to introduce the new compat RO
      flag.
      
      This patch introduces one extra artificial limitation on block group
      tree feature, that free space cache v2 and no-holes feature must be
      enabled to use this new compat RO feature.
      
      This artificial requirement is mostly to reduce the test combinations,
      and can be a guideline for future features, to mostly rely on the latest
      default features.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c56ab99
    • Qu Wenruo's avatar
      btrfs: don't save block group root into super block · 14033b08
      Qu Wenruo authored
      The extent tree v2 needs a new root for storing all block group items,
      the whole feature hasn't been finished yet so we can afford to do some
      changes.
      
      My initial proposal years ago just added a new tree rootid, and load it
      from tree root, just like what we did for quota/free space tree/uuid/extent
      roots.
      
      But the extent tree v2 patches introduced a completely new way to store
      block group tree root into super block which is arguably wasteful.
      
      Currently there are only 3 trees stored in super blocks, and they all
      have their valid reasons:
      
      - Chunk root
        Needed for bootstrap.
      
      - Tree root
        Really the entry point for all trees.
      
      - Log root
        This is special as log root has to be updated out of existing
        transaction mechanism.
      
      There is not even any reason to put block group root into super blocks,
      the block group tree is updated at the same time as the old extent tree,
      no need for extra bootstrap/out-of-transaction update.
      
      So just move block group root from super block into tree root.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      14033b08
    • Qu Wenruo's avatar
      btrfs: enhance unsupported compat RO flags handling · 81d5d614
      Qu Wenruo authored
      Currently there are two corner cases not handling compat RO flags
      correctly:
      
      - Remount
        We can still mount the fs RO with compat RO flags, then remount it RW.
        We should not allow any write into a fs with unsupported RO flags.
      
      - Still try to search block group items
        In fact, behavior/on-disk format change to extent tree should not
        need a full incompat flag.
      
        And since we can ensure fs with unsupported RO flags never got any
        writes (with above case fixed), then we can even skip block group
        items search at mount time.
      
      This patch will enhance the unsupported RO compat flags by:
      
      - Reject read-write remount if there are unsupported RO compat flags
      
      - Go dummy block group items directly for unsupported RO compat flags
        In fact, only changes to chunk/subvolume/root/csum trees should go
        incompat flags.
      
      The latter part should allow future change to extent tree to be compat
      RO flags.
      
      Thus this patch also needs to be backported to all stable trees.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81d5d614
    • Qu Wenruo's avatar
      btrfs: dump all space infos if we abort transaction due to ENOSPC · 8e327b9c
      Qu Wenruo authored
      We have hit some transaction abort due to -ENOSPC internally.
      
      Normally we should always reserve enough space for metadata for every
      transaction, thus hitting -ENOSPC should really indicate some cases we
      didn't expect.
      
      But unfortunately current error reporting will only give a kernel
      warning and stack trace, not really helpful to debug what's causing the
      problem.
      
      And mount option debug_enospc can only help when user can reproduce the
      problem, but under most cases, such transaction abort by -ENOSPC is
      really hard to reproduce.
      
      So this patch will dump all space infos (data, metadata, system) when we
      abort the first transaction with -ENOSPC.
      
      This should at least provide some clue to us.
      
      The example of a dump would look like this:
      
        BTRFS: Transaction aborted (error -28)
        WARNING: CPU: 8 PID: 3366 at fs/btrfs/transaction.c:2137 btrfs_commit_transaction+0xf81/0xfb0 [btrfs]
        <call trace skipped>
        ---[ end trace 0000000000000000 ]---
        BTRFS info (device dm-1: state A): dumping space info:
        BTRFS info (device dm-1: state A): space_info DATA has 6791168 free, is not full
        BTRFS info (device dm-1: state A): space_info total=8388608, used=1597440, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
        BTRFS info (device dm-1: state A): space_info METADATA has 257114112 free, is not full
        BTRFS info (device dm-1: state A): space_info total=268435456, used=131072, pinned=180224, reserved=65536, may_use=10878976, readonly=65536 zone_unusable=0
        BTRFS info (device dm-1: state A): space_info SYSTEM has 8372224 free, is not full
        BTRFS info (device dm-1: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
        BTRFS info (device dm-1: state A): global_block_rsv: size 3670016 reserved 3670016
        BTRFS info (device dm-1: state A): trans_block_rsv: size 0 reserved 0
        BTRFS info (device dm-1: state A): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device dm-1: state A): delayed_block_rsv: size 4063232 reserved 4063232
        BTRFS info (device dm-1: state A): delayed_refs_rsv: size 3145728 reserved 3145728
        BTRFS: error (device dm-1: state A) in btrfs_commit_transaction:2137: errno=-28 No space left
        BTRFS info (device dm-1: state EA): forced readonly
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e327b9c
    • Qu Wenruo's avatar
      btrfs: output human readable space info flag · 25a860c4
      Qu Wenruo authored
      For btrfs_space_info, its flags has only 4 possible values:
      
      - BTRFS_BLOCK_GROUP_SYSTEM
      - BTRFS_BLOCK_GROUP_METADATA | BTRFS_BLOCK_GROUP_DATA
      - BTRFS_BLOCK_GROUP_METADATA
      - BTRFS_BLOCK_GROUP_DATA
      
      Make the output more human readable, now it looks like:
      
        BTRFS info (device dm-1: state A): space_info METADATA has 251494400 free, is not full
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25a860c4
    • Qu Wenruo's avatar
      btrfs: check superblock to ensure the fs was not modified at thaw time · a05d3c91
      Qu Wenruo authored
      [BACKGROUND]
      There is an incident report that, one user hibernated the system, with
      one btrfs on removable device still mounted.
      
      Then by some incident, the btrfs got mounted and modified by another
      system/OS, then back to the hibernated system.
      
      After resuming from the hibernation, new write happened into the victim btrfs.
      
      Now the fs is completely broken, since the underlying btrfs is no longer
      the same one before the hibernation, and the user lost their data due to
      various transid mismatch.
      
      [REPRODUCER]
      We can emulate the situation using the following small script:
      
        truncate -s 1G $dev
        mkfs.btrfs -f $dev
        mount $dev $mnt
        fsstress -w -d $mnt -n 500
        sync
        xfs_freeze -f $mnt
        cp $dev $dev.backup
      
        # There is no way to mount the same cloned fs on the same system,
        # as the conflicting fsid will be rejected by btrfs.
        # Thus here we have to wipe the fs using a different btrfs.
        mkfs.btrfs -f $dev.backup
      
        dd if=$dev.backup of=$dev bs=1M
        xfs_freeze -u $mnt
        fsstress -w -d $mnt -n 20
        umount $mnt
        btrfs check $dev
      
      The final fsck will fail due to some tree blocks has incorrect fsid.
      
      This is enough to emulate the problem hit by the unfortunate user.
      
      [ENHANCEMENT]
      Although such case should not be that common, it can still happen from
      time to time.
      
      From the view of btrfs, we can detect any unexpected super block change,
      and if there is any unexpected change, we just mark the fs read-only,
      and thaw the fs.
      
      By this we can limit the damage to minimal, and I hope no one would lose
      their data by this anymore.
      Suggested-by: default avatarGoffredo Baroncelli <kreijack@libero.it>
      Link: https://lore.kernel.org/linux-btrfs/83bf3b4b-7f4c-387a-b286-9251e3991e34@bluemole.com/Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a05d3c91
    • Christoph Hellwig's avatar
      btrfs: stop allocation a btrfs_io_context for simple I/O · 928ff3be
      Christoph Hellwig authored
      The I/O context structure is only used to pass the btrfs_device to
      the end I/O handler for I/Os that go to a single device.
      
      Stop allocating the I/O context for these cases by passing the optional
      btrfs_io_stripe argument to __btrfs_map_block to query the mapping
      information and then using a fast path submission and I/O completion
      handler.  As the old btrfs_io_context based I/O submission path is
      only used for mirrored writes, rename the functions to make that
      clear and stop setting the btrfs_bio device and mirror_num field
      that is only used for reads.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      928ff3be
    • Christoph Hellwig's avatar
      btrfs: add fast path for single device io in __btrfs_map_block · 03793cbb
      Christoph Hellwig authored
      There is no need for most of the btrfs_io_context when doing I/O to a
      single device.  To support such I/O without the extra btrfs_io_context
      allocation, turn the mirror_num argument into a pointer so that it can
      be used to output the selected mirror number, and add an optional
      argument that points to a btrfs_io_stripe structure, which will be
      filled with a single extent if provided by the caller.
      
      In that case the btrfs_io_context allocation can be skipped as all
      information for the single device I/O is provided in the mirror_num
      argument and the on-stack btrfs_io_stripe.  A caller that makes use of
      this new argument will be added in the next commit.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      03793cbb
    • Christoph Hellwig's avatar
      btrfs: decide bio cloning inside submit_stripe_bio · 28793b19
      Christoph Hellwig authored
      Remove the orig_bio argument as it can be derived from the bioc, and
      the clone argument as it can be calculated from bioc and dev_nr.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28793b19
    • Christoph Hellwig's avatar
      btrfs: factor out low-level bio setup from submit_stripe_bio · 32747c44
      Christoph Hellwig authored
      Split out a low-level btrfs_submit_dev_bio helper that just submits
      the bio without any cloning decisions or setting up the end I/O handler
      for future reuse by a different caller.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32747c44
    • Christoph Hellwig's avatar
      btrfs: give struct btrfs_bio a real end_io handler · 917f32a2
      Christoph Hellwig authored
      Currently btrfs_bio end I/O handling is a bit of a mess.  The bi_end_io
      handler and bi_private pointer of the embedded struct bio are both used
      to handle the completion of the high-level btrfs_bio and for the I/O
      completion for the low-level device that the embedded bio ends up being
      sent to.
      
      To support this bi_end_io and bi_private are saved into the
      btrfs_io_context structure and then restored after the bio sent to the
      underlying device has completed the actual I/O.
      
      Untangle this by adding an end I/O handler and private data to struct
      btrfs_bio for the high-level btrfs_bio based completions, and leave the
      actual bio bi_end_io handler and bi_private pointer entirely to the
      low-level device I/O.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      917f32a2
    • Christoph Hellwig's avatar
      btrfs: properly abstract the parity raid bio handling · f1c29379
      Christoph Hellwig authored
      The parity raid write/recover functionality is currently not very well
      abstracted from the bio submission and completion handling in volumes.c:
      
       - the raid56 code directly completes the original btrfs_bio fed into
         btrfs_submit_bio instead of dispatching back to volumes.c
       - the raid56 code consumes the bioc and bio_counter references taken
         by volumes.c, which also leads to special casing of the calls from
         the scrub code into the raid56 code
      
      To fix this up supply a bi_end_io handler that calls back into the
      volumes.c machinery, which then puts the bioc, decrements the bio_counter
      and completes the original bio, and updates the scrub code to also
      take ownership of the bioc and bio_counter in all cases.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f1c29379
    • Christoph Hellwig's avatar
      btrfs: use chained bios when cloning · c3a62baf
      Christoph Hellwig authored
      The stripes_pending in the btrfs_io_context counts number of inflight
      low-level bios for an upper btrfs_bio.  For reads this is generally
      one as reads are never cloned, while for writes we can trivially use
      the bio remaining mechanisms that is used for chained bios.
      
      To be able to make use of that mechanism, split out a separate trivial
      end_io handler for the cloned bios that does a minimal amount of error
      tracking and which then calls bio_endio on the original bio to transfer
      control to that, with the remaining counter making sure it is completed
      last.  This then allows to merge btrfs_end_bioc into the original bio
      bi_end_io handler.
      
      To make this all work all error handling needs to happen through the
      bi_end_io handler, which requires a small amount of reshuffling in
      submit_stripe_bio so that the bio is cloned already by the time the
      suitability of the device is checked.
      
      This reduces the size of the btrfs_io_context and prepares splitting
      the btrfs_bio at the stripe boundary.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c3a62baf
    • Christoph Hellwig's avatar
      btrfs: don't take a bio_counter reference for cloned bios · 2bbc72f1
      Christoph Hellwig authored
      Stop grabbing an extra bio_counter reference for each clone bio in a
      mirrored write and instead just release the one original reference in
      btrfs_end_bioc once all the bios for a single btrfs_bio have completed
      instead of at the end of btrfs_submit_bio once all bios have been
      submitted.
      
      This means the reference is now carried by the "upper" btrfs_bio only
      instead of each lower bio.
      
      Also remove the now unused btrfs_bio_counter_inc_noblocked helper.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2bbc72f1
    • Christoph Hellwig's avatar
      btrfs: pass the operation to btrfs_bio_alloc · 6b42f5e3
      Christoph Hellwig authored
      Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset
      set does.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b42f5e3
    • Christoph Hellwig's avatar
      btrfs: move btrfs_bio allocation to volumes.c · d45cfb88
      Christoph Hellwig authored
      volumes.c is the place that implements the storage layer using the
      btrfs_bio structure, so move the bio_set and allocation helpers there
      as well.
      
      To make up for the new initialization boilerplate, merge the two
      init/exit helpers in extent_io.c into a single one.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d45cfb88
    • Christoph Hellwig's avatar
      btrfs: don't create integrity bioset for btrfs_bioset · 1e408af3
      Christoph Hellwig authored
      btrfs never uses bio integrity data itself, so don't allocate
      the integrity pools for btrfs_bioset.
      
      This patch is a revert of the commit b208c2f7 ("btrfs: Fix crash due
      to not allocating integrity data for a set").  The integrity data pool
      is not needed, the bio-integrity code now handles allocating the
      integrity payload without that.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e408af3
    • Josef Bacik's avatar
      btrfs: remove use btrfs_remove_free_space_cache instead of variant · fc80f7ac
      Josef Bacik authored
      We are calling __btrfs_remove_free_space_cache everywhere to cleanup the
      block group free space, however we can just use
      btrfs_remove_free_space_cache and pass in the block group in all of
      these places.  Then we can remove __btrfs_remove_free_space_cache and
      rename __btrfs_remove_free_space_cache_locked to
      __btrfs_remove_free_space_cache.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc80f7ac
    • Josef Bacik's avatar
      btrfs: call __btrfs_remove_free_space_cache_locked on cache load failure · 8a1ae278
      Josef Bacik authored
      Now that lockdep is staying enabled through our entire CI runs I started
      seeing the following stack in generic/475
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 2171864 at fs/btrfs/discard.c:604 btrfs_discard_update_discardable+0x98/0xb0
      CPU: 1 PID: 2171864 Comm: kworker/u4:0 Not tainted 5.19.0-rc8+ #789
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Workqueue: btrfs-cache btrfs_work_helper
      RIP: 0010:btrfs_discard_update_discardable+0x98/0xb0
      RSP: 0018:ffffb857c2f7bad0 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff8c85c605c200 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: ffffffff86807c5b RDI: ffffffff868a831e
      RBP: ffff8c85c4c54000 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff8c85c66932f0 R11: 0000000000000001 R12: ffff8c85c3899010
      R13: ffff8c85d5be4f40 R14: ffff8c85c4c54000 R15: ffff8c86114bfa80
      FS:  0000000000000000(0000) GS:ffff8c863bd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f2e7f168160 CR3: 000000010289a004 CR4: 0000000000370ee0
      Call Trace:
      
       __btrfs_remove_free_space_cache+0x27/0x30
       load_free_space_cache+0xad2/0xaf0
       caching_thread+0x40b/0x650
       ? lock_release+0x137/0x2d0
       btrfs_work_helper+0xf2/0x3e0
       ? lock_is_held_type+0xe2/0x140
       process_one_work+0x271/0x590
       ? process_one_work+0x590/0x590
       worker_thread+0x52/0x3b0
       ? process_one_work+0x590/0x590
       kthread+0xf0/0x120
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x1f/0x30
      
      This is the code
      
              ctl = block_group->free_space_ctl;
              discard_ctl = &block_group->fs_info->discard_ctl;
      
              lockdep_assert_held(&ctl->tree_lock);
      
      We have a temporary free space ctl for loading the free space cache in
      order to avoid having allocations happening while we're loading the
      cache.  When we hit an error we free it all up, however this also calls
      btrfs_discard_update_discardable, which requires
      block_group->free_space_ctl->tree_lock to be held.  However this is our
      temporary ctl so this lock isn't held.  Fix this by calling
      __btrfs_remove_free_space_cache_locked instead so that we only clean up
      the entries and do not mess with the discardable stats.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a1ae278
    • Filipe Manana's avatar
      btrfs: fix race between quota enable and quota rescan ioctl · 331cd946
      Filipe Manana authored
      When enabling quotas, at btrfs_quota_enable(), after committing the
      transaction, we change fs_info->quota_root to point to the quota root we
      created and set BTRFS_FS_QUOTA_ENABLED at fs_info->flags. Then we try
      to start the qgroup rescan worker, first by initializing it with a call
      to qgroup_rescan_init() - however if that fails we end up freeing the
      quota root but we leave fs_info->quota_root still pointing to it, this
      can later result in a use-after-free somewhere else.
      
      We have previously set the flags BTRFS_FS_QUOTA_ENABLED and
      BTRFS_QGROUP_STATUS_FLAG_ON, so we can only fail with -EINPROGRESS at
      btrfs_quota_enable(), which is possible if someone already called the
      quota rescan ioctl, and therefore started the rescan worker.
      
      So fix this by ignoring an -EINPROGRESS and asserting we can't get any
      other error.
      Reported-by: default avatarYe Bin <yebin10@huawei.com>
      Link: https://lore.kernel.org/linux-btrfs/20220823015931.421355-1-yebin10@huawei.com/
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      331cd946
    • Maciej S. Szmigiero's avatar
      btrfs: don't print information about space cache or tree every remount · dbecac26
      Maciej S. Szmigiero authored
      btrfs currently prints information about space cache or free space tree
      being in use on every remount, regardless whether such remount actually
      enabled or disabled one of these features.
      
      This is actually unnecessary since providing remount options changing the
      state of these features will explicitly print the appropriate notice.
      
      Let's instead print such unconditional information just on an initial mount
      to avoid filling the kernel log when, for example, laptop-mode-tools
      remount the fs on some events.
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dbecac26
    • Filipe Manana's avatar
      btrfs: simplify error handling at btrfs_del_root_ref() · 1fdbd03d
      Filipe Manana authored
      At btrfs_del_root_ref() we are using two return variables, named 'ret'
      and 'err'. This makes it harder to follow and easier to return the wrong
      value in case an error happens - the previous patch in the series, which
      has the subject "btrfs: fix silent failure when deleting root
      reference", fixed a bug due to confusion created by these two variables.
      
      So change the function to use a single variable for tracking the return
      value of the function, using only 'ret', which is consistent with most
      of the codebase.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fdbd03d
    • Omar Sandoval's avatar
      btrfs: get rid of block group caching progress logic · 48ff7083
      Omar Sandoval authored
      struct btrfs_caching_ctl::progress and struct
      btrfs_block_group::last_byte_to_unpin were previously needed to ensure
      that unpin_extent_range() didn't return a range to the free space cache
      before the caching thread had a chance to cache that range. However, the
      commit "btrfs: fix space cache corruption and potential double
      allocations" made it so that we always synchronously cache the block
      group at the time that we pin the extent, so this machinery is no longer
      necessary.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48ff7083
    • BingJing Chang's avatar
      btrfs: send: fix failures when processing inodes with no links · 9ed0a72e
      BingJing Chang authored
      There is a bug causing send failures when processing an orphan directory
      with no links. In commit 46b2f459 ("Btrfs: fix send failure when
      root has deleted files still open")', the orphan inode issue was
      addressed. The send operation fails with a ENOENT error because of any
      attempts to generate a path for the inode with a link count of zero.
      Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
      set if the current inode has a link count of zero for bypassing some
      unnecessary steps. And a helper function btrfs_unlink_all_paths() was
      introduced and called to clean up old paths found in the parent
      snapshot. However, not only regular files but also directories can be
      orphan inodes. So if the send operation meets an orphan directory, it
      will issue a wrong unlink command for that directory now. Soon the
      receive operation fails with a EISDIR error. Besides, the send operation
      also fails with a ENOENT error later when it tries to generate a path of
      it.
      
      Similar example but making an orphan dir for an incremental send:
      
        $ btrfs subvolume create vol
        $ mkdir vol/dir
        $ touch vol/dir/foo
      
        $ btrfs subvolume snapshot -r vol snap1
        $ btrfs subvolume snapshot -r vol snap2
      
        # Turn the second snapshot to RW mode and delete the whole dir while
        # holding an open file descriptor on it.
        $ btrfs property set snap2 ro false
        $ exec 73<snap2/dir
        $ rm -rf snap2/dir
      
        # Set the second snapshot back to RO mode and do an incremental send.
        $ btrfs property set snap2 ro true
        $ mkdir receive_dir
        $ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
        At subvol snap2
        At snapshot snap2
        ERROR: send ioctl failed with -2: No such file or directory
        ERROR: unlink dir failed. Is a directory
      
      Actually, orphan inodes are more common use cases in cascading backups.
      (Please see the illustration below.) In a cascading backup, a user wants
      to replicate a couple of snapshots from Machine A to Machine B and from
      Machine B to Machine C. Machine B doesn't take any RO snapshots for
      sending. All a receiver does is create an RW snapshot of its parent
      snapshot, apply the send stream and turn it into RO mode at the end.
      Even if all paths of some inodes are deleted in applying the send
      stream, these inodes would not be deleted and become orphans after
      changing the subvolume from RW to RO. Moreover, orphan inodes can occur
      not only in send snapshots but also in parent snapshots because Machine
      B may do a batch replication of a couple of snapshots.
      
      An illustration for cascading backups:
      
        Machine A (snapshot {1..n}) --> Machine B --> Machine C
      
      The idea to solve the problem is to delete all the items of orphan
      inodes before using these snapshots for sending. I used to think that
      the reasonable timing for doing that is during the ioctl of changing the
      subvolume from RW to RO because it sounds good that we will not modify
      the fs tree of a RO snapshot anymore. However, attempting to do the
      orphan cleanup in the ioctl would be pointless. Because if someone is
      holding an open file descriptor on the inode, the reference count of the
      inode will never drop to 0. Then iput() cannot trigger eviction, which
      finally deletes all the items of it. So we try to extend the original
      patch to handle orphans in send/parent snapshots. Here are several cases
      that need to be considered:
      
      Case 1: BTRFS_COMPARE_TREE_NEW
             |  send snapshot  | action
       --------------------------------
       nlink |        0        | ignore
      
      In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
      it means that a new inode is found in the send snapshot and it doesn't
      appear in the parent snapshot. Since this inode has a link count of zero
      (It's an orphan and there're no paths for it.), we can leverage
      sctx->ignore_cur_inode in the original patch to prevent it from being
      created.
      
      Case 2: BTRFS_COMPARE_TREE_DELETED
             | parent snapshot | action
       ----------------------------------
       nlink |        0        | as usual
      
      In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
      result, it means that the inode only appears in the parent snapshot.
      As usual, the send operation will try to delete all its paths. However,
      this inode has a link count of zero, so no paths of it will be found. No
      deletion operations will be issued. We don't need to change any logic.
      
      Case 3: BTRFS_COMPARE_TREE_CHANGED
                 |       | parent snapshot | send snapshot | action
       -----------------------------------------------------------------------
       subcase 1 | nlink |        0        |       0       | ignore
       subcase 2 | nlink |       >0        |       0       | new_gen(deletion)
       subcase 3 | nlink |        0        |      >0       | new_gen(creation)
      
      In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
      it means that the inode appears in both snapshots. Here are 3 subcases.
      
      First, when the inode has link counts of zero in both snapshots. Since
      there are no paths for this inode in (source/destination) parent
      snapshots and we don't care about whether there is also an orphan inode
      in destination or not, we can set sctx->ignore_cur_inode on to prevent
      it from being created.
      
      For the second and the third subcases, if there are paths in one
      snapshot and there're no paths in the other snapshot for this inode. We
      can treat this inode as a new generation. We can also leverage the logic
      handling a new generation of an inode with small adjustments. Then it
      will delete all old paths and create a new inode with new attributes and
      paths only when there's a positive link count in the send snapshot.
      
      In subcase 2, the send operation only needs to delete all old paths as
      in the parent snapshot. But it may require more operations for a
      directory to remove its old paths. If a not-empty directory is going to
      be deleted (because it has a link count of zero in the send snapshot)
      but there are files/directories with bigger inode numbers under it, the
      send operation will need to rename it to its orphan name first. After
      processing and deleting the last item under this directory, the send
      operation will check this directory, aka the parent directory of the
      last item, again and issue a rmdir operation to remove it finally.
      
      Therefore, we also need to treat inodes with a link count of zero as if
      they didn't exist in get_cur_inode_state(), which is used in
      process_recorded_refs(). By doing this, when checking a directory with
      orphan names after the last item under it has been deleted, the send
      operation now can properly issue a rmdir operation. Otherwise, without
      doing this, the orphan directory with an orphan name would be kept here
      at the end due to the existing inode with a link count of zero being
      found.
      
      In subcase 3, as in case 2, no old paths would be found, so no deletion
      operations will be issued. The send operation will only create a new one
      for that inode.
      
      Note that subcase 3 is not common. That's because it's easy to reduce
      the hard links of an inode, but once all valid paths are removed,
      there are no valid paths for creating other hard links. The only way to
      do that is trying to send an older snapshot after a newer snapshot has
      been sent.
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ed0a72e
    • BingJing Chang's avatar
      btrfs: send: refactor arguments of get_inode_info() · 7e93f6dc
      BingJing Chang authored
      Refactor get_inode_info() to populate all wanted fields on an output
      structure. Besides, also introduce a helper function called
      get_inode_gen(), which is commonly used.
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e93f6dc
    • Ethan Lien's avatar
      btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path · 52b029f4
      Ethan Lien authored
      After we copied data to page cache in buffered I/O, we
      1. Insert a EXTENT_UPTODATE state into inode's io_tree, by
         endio_readpage_release_extent(), set_extent_delalloc() or
         set_extent_defrag().
      2. Set page uptodate before we unlock the page.
      
      But the only place we check io_tree's EXTENT_UPTODATE state is in
      btrfs_do_readpage(). We know we enter btrfs_do_readpage() only when we
      have a non-uptodate page, so it is unnecessary to set EXTENT_UPTODATE.
      
      For example, when performing a buffered random read:
      
      	fio --rw=randread --ioengine=libaio --direct=0 --numjobs=4 \
      		--filesize=32G --size=4G --bs=4k --name=job \
      		--filename=/mnt/file --name=job
      
      Then check how many extent_state in io_tree:
      
      	cat /proc/slabinfo | grep btrfs_extent_state | awk '{print $2}'
      
      w/o this patch, we got 640567 btrfs_extent_state.
      w/  this patch, we got    204 btrfs_extent_state.
      
      Maintaining such a big tree brings overhead since every I/O needs to insert
      EXTENT_LOCKED, insert EXTENT_UPTODATE, then remove EXTENT_LOCKED. And in
      every insert or remove, we need to lock io_tree, do tree search, alloc or
      dealloc extent states. By removing unnecessary EXTENT_UPTODATE, we keep
      io_tree in a minimal size and reduce overhead when performing buffered I/O.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52b029f4
    • Filipe Manana's avatar
      btrfs: simplify adding and replacing references during log replay · 7059c658
      Filipe Manana authored
      During log replay, when adding/replacing inode references, there are two
      special cases that have special code for them:
      
      1) When we have an inode with two or more hardlinks in the same directory,
         therefore two or more names encoded in the same inode reference item,
         and one of the hard links gets renamed to the old name of another hard
         link - that is, the index number for a name changes. This was added in
         commit 0d836392 ("Btrfs: fix mount failure after fsync due to
         hard link recreation"), and is covered by test case generic/502 from
         fstests;
      
      2) When we have several inodes that got renamed to an old name of some
         other inode, in a cascading style. The code to deal with this special
         case was added in commit 6b5fc433 ("Btrfs: fix fsync after
         succession of renames of different files"), and is covered by test
         cases generic/526 and generic/527 from fstests.
      
      Both cases can be deal with by making sure __add_inode_ref() is always
      called by add_inode_ref() for every name encoded in the inode reference
      item, and not just for the first name that has a conflict. With such
      change we no longer need that special casing for the two cases mentioned
      before. So do those changes.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7059c658
    • David Sterba's avatar
      btrfs: sysfs: show discard stats and tunables in non-debug build · fb731430
      David Sterba authored
      When discard=async was introduced there were also sysfs knobs and stats
      for debugging and tuning, hidden under CONFIG_BTRFS_DEBUG. The defaults
      have been set and so far seem to satisfy all users on a range of
      workloads. As there are not only tunables (like iops or kbps) but also
      stats tracking amount of discardable bytes, that should be available
      when the async discard is on (otherwise it's not).
      
      The stats are moved from the per-fs debug directory, so it's under
        /sys/fs/btrfs/FSID/discard
      
      - discard_bitmap_bytes - amount of discarded bytes from data tracked as
                               bitmaps
      - discard_extent_bytes - dtto but as extents
      - discard_bytes_saved -
      - discardable_bytes - amount of bytes that can be discarded
      - discardable_extents - number of extents to be discarded
      - iops_limit - tunable limit of number of discard IOs to be issued
      - kbps_limit - tunable limit of kilobytes per second issued as discard IO
      - max_discard_size - tunable limit for size of one IO discard request
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb731430
    • Filipe Manana's avatar
      btrfs: use delayed items when logging a directory · 30b80f3c
      Filipe Manana authored
      When logging a directory we start by flushing all its delayed items.
      That results in adding dir index items to the subvolume btree, for new
      dentries, and removing dir index items from the subvolume btree for any
      dentries that were deleted.
      
      This makes it straightforward to log a directory simply by iterating over
      all the modified subvolume btree leaves, especially when we used to log
      both dir index keys and dir item keys (before commit 339d0354
      ("btrfs: only copy dir index keys when logging a directory") and when we
      used to copy old dir index entries for leaves modified in the current
      transaction (before commit 732d591a ("btrfs: stop copying old dir
      items when logging a directory")).
      
      From an efficiency point of view this has a couple of drawbacks:
      
      1) Adds extra latency, due to copying delayed items to the subvolume btree
         and deleting dir index items from the btree.
      
         Further if there are other tasks accessing the btree, which is common
         (syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks,
         etc, finishing an ordered extent, etc), lock contention can cause
         further delays, both to the task logging a directory and to the other
         tasks accessing the btree;
      
      2) More time spent overall flushing delayed items, if after logging the
         directory further changes are done to the directory in the same
         transaction.
      
         For example, if we add 10 dentries to a directory, fsync it, add more
         10 dentries, fsync it again, then add more 10 dentries and fsync it
         again, then we end up inserting 3 batches of 10 items to the subvolume
         btree. With the changes from this patch, we flush all the delayed items
         to the btree only once - a single batch of 30 items, and outside the
         logging code (transaction commit or when delayed items are flushed
         asynchronously).
      
      This change simply skips the flushing of delayed items every time we log a
      directory. Instead we copy the delayed insertion items directly to the log
      tree and delete delayed deletion items directly from the log tree.
      Therefore avoiding changing first the subvolume btree and then scanning it
      for new items to copy from it to the log tree and detecting deletions
      by observing gaps in consecutive dir index keys in subvolume btree leaves.
      
      Running the following tests on a non-debug kernel (Debian's default kernel
      config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram,
      produced the results below.
      
      The results compare a branch without this patch and all the other patches
      it depends on versus the same branch with the patchset applied.
      
      The patchset is comprised of the following patches:
      
        btrfs: don't drop dir index range items when logging a directory
        btrfs: remove the root argument from log_new_dir_dentries()
        btrfs: update stale comment for log_new_dir_dentries()
        btrfs: free list element sooner at log_new_dir_dentries()
        btrfs: avoid memory allocation at log_new_dir_dentries() for common case
        btrfs: remove root argument from btrfs_delayed_item_reserve_metadata()
        btrfs: store index number instead of key in struct btrfs_delayed_item
        btrfs: remove unused logic when looking up delayed items
        btrfs: shrink the size of struct btrfs_delayed_item
        btrfs: search for last logged dir index if it's not cached in the inode
        btrfs: move need_log_inode() to above log_conflicting_inodes()
        btrfs: move log_new_dir_dentries() above btrfs_log_inode()
        btrfs: log conflicting inodes without holding log mutex of the initial inode
        btrfs: skip logging parent dir when conflicting inode is not a dir
        btrfs: use delayed items when logging a directory
      
      Custom test script for testing time spent at btrfs_log_inode():
      
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
      
         # Total number of files to create in the test directory.
         NUM_FILES=10000
         # Fsync after creating or renaming N files.
         FSYNC_AFTER=100
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount -o ssd $DEV $MNT
      
         TEST_DIR=$MNT/testdir
         mkdir $TEST_DIR
      
         echo "Creating files..."
         for ((i = 1; i <= $NUM_FILES; i++)); do
                 echo -n > $TEST_DIR/file_$i
                 if (( ($i % $FSYNC_AFTER) == 0 )); then
                         xfs_io -c "fsync" $TEST_DIR
                 fi
         done
      
         sync
      
         echo "Renaming files..."
         for ((i = 1; i <= $NUM_FILES; i++)); do
                 mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed
                 if (( ($i % $FSYNC_AFTER) == 0 )); then
                         xfs_io -c "fsync" $TEST_DIR
                 fi
         done
      
         umount $MNT
      
      And using the following bpftrace script to capture the total time that is
      spent at btrfs_log_inode():
      
         #!/usr/bin/bpftrace
      
         k:btrfs_log_inode
         {
                 @start_log_inode[tid] = nsecs;
         }
      
         kr:btrfs_log_inode
         /@start_log_inode[tid]/
         {
                 $dur = (nsecs - @start_log_inode[tid]) / 1000;
                 @btrfs_log_inode_total_time = sum($dur);
                 delete(@start_log_inode[tid]);
         }
      
         END
         {
                 clear(@start_log_inode);
         }
      
      Result before applying patchset:
      
         @btrfs_log_inode_total_time: 622642
      
      Result after applying patchset:
      
         @btrfs_log_inode_total_time: 354134    (-43.1% time spent)
      
      The following dbench script was also used for testing:
      
         #!/bin/bash
      
         NUM_JOBS=$(nproc --all)
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-O no-holes -R free-space-tree"
      
         echo "performance" | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV
         mount $MOUNT_OPTIONS $DEV $MNT
      
         dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS
      
         umount $MNT
      
      Before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    3322265     0.034    21.032
       Close        2440562     0.002     0.994
       Rename        140664     1.150   269.633
       Unlink        670796     1.093   269.678
       Deltree           96     5.481    15.510
       Mkdir             48     0.004     0.052
       Qpathinfo    3010924     0.014     8.127
       Qfileinfo     528055     0.001     0.518
       Qfsinfo       552113     0.003     0.372
       Sfileinfo     270575     0.005     0.688
       Find         1164176     0.052    13.931
       WriteX       1658537     0.019     5.918
       ReadX        5207412     0.003     1.034
       LockX          10818     0.003     0.079
       UnlockX        10818     0.002     0.313
       Flush         232811     1.027   269.735
      
      Throughput 869.867 MB/sec (sync dirs)  12 clients  12 procs  max_latency=269.741 ms
      
      After patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4152738     0.029    20.863
       Close        3050770     0.002     1.119
       Rename        175829     0.871   211.741
       Unlink        838447     0.845   211.724
       Deltree          120     4.798    14.162
       Mkdir             60     0.003     0.005
       Qpathinfo    3763807     0.011     4.673
       Qfileinfo     660111     0.001     0.400
       Qfsinfo       690141     0.003     0.429
       Sfileinfo     338260     0.005     0.725
       Find         1455273     0.046     6.787
       WriteX       2073307     0.017     5.690
       ReadX        6509193     0.003     1.171
       LockX          13522     0.003     0.077
       UnlockX        13522     0.002     0.125
       Flush         291044     0.811   211.631
      
      Throughput 1089.27 MB/sec (sync dirs)  12 clients  12 procs  max_latency=211.750 ms
      
      (+25.2% throughput, -21.5% max latency)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30b80f3c