1. 23 Mar, 2020 40 commits
    • Filipe Manana's avatar
      btrfs: make ranged full fsyncs more efficient · 0a8068a3
      Filipe Manana authored
      Commit 0c713cba ("Btrfs: fix race between ranged fsync and writeback
      of adjacent ranges") fixed a bug where we could end up with file extent
      items in a log tree that represent file ranges that overlap due to a race
      between the hole detection of a ranged full fsync and writeback for a
      different file range.
      
      The problem was solved by forcing any ranged full fsync to become a
      non-ranged full fsync - setting the range start to 0 and the end offset to
      LLONG_MAX. This was a simple solution because the code that detected and
      marked holes was very complex, it used to be done at copy_items() and
      implied several searches on the fs/subvolume tree. The drawback of that
      solution was that we started to flush delalloc for the entire file and
      wait for all the ordered extents to complete for ranged full fsyncs
      (including ordered extents covering ranges completely outside the given
      range). Fortunatelly ranged full fsyncs are not the most common case
      (hopefully for most workloads).
      
      However a later fix for detecting and marking holes was made by commit
      0e56315c ("Btrfs: fix missing hole after hole punching and fsync
      when using NO_HOLES") and it simplified a lot the detection of holes,
      and now copy_items() no longer does it and we do it in a much more simple
      way at btrfs_log_holes().
      
      This makes it now possible to simply make the code that detects holes to
      operate only on the initial range and no longer need to operate on the
      whole file, while also avoiding the need to flush delalloc for the entire
      file and wait for ordered extents that cover ranges that don't overlap the
      given range.
      
      Another special care is that we must skip file extent items that fall
      entirely outside the fsync range when copying inode items from the
      fs/subvolume tree into the log tree - this is to avoid races with ordered
      extent completion for extents falling outside the fsync range, which could
      cause us to end up with file extent items in the log tree that have
      overlapping ranges - for example if the fsync range is [1Mb, 2Mb], when
      we copy inode items we could copy an extent item for the range [0, 512K],
      then release the search path and before moving to the next leaf, an
      ordered extent for a range of [256Kb, 512Kb] completes - this would
      cause us to copy the new extent item for range [256Kb, 512Kb] into the
      log tree after we have copied one for the range [0, 512Kb] - the extents
      overlap, resulting in a corruption.
      
      So this change just does these steps:
      
      1) When the NO_HOLES feature is enabled it leaves the initial range
         intact - no longer sets it to [0, LLONG_MAX] when the full sync bit
         is set in the inode. If NO_HOLES is not enabled, always set the range
         to a full, just like before this change, to avoid missing file extent
         items representing holes after replaying the log (for both full and
         fast fsyncs);
      
      2) Make the hole detection code to operate only on the fsync range;
      
      3) Make the code that copies items from the fs/subvolume tree to skip
         copying file extent items that cover a range completely outside the
         range of the fsync.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a8068a3
    • Filipe Manana's avatar
      btrfs: factor out inode items copy loop from btrfs_log_inode() · da447009
      Filipe Manana authored
      The function btrfs_log_inode() is quite large and so is its loop which
      iterates the inode items from the fs/subvolume tree and copies them into
      a log tree. Because this is a large loop inside a very large function
      and because an upcoming patch in this series needs to add some more logic
      inside that loop, move the loop into a helper function to make it a bit
      more manageable.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da447009
    • Filipe Manana's avatar
      btrfs: add helper to get the end offset of a file extent item · a5eeb3d1
      Filipe Manana authored
      Getting the end offset for a file extent item requires a bit of code since
      the extent can be either inline or regular/prealloc. There are some places
      all over the code base that open code this logic and in another patch
      later in this series it will be needed again. Therefore encapsulate this
      logic in a helper function and use it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a5eeb3d1
    • Filipe Manana's avatar
      btrfs: fix missing file extent item for hole after ranged fsync · 95418ed1
      Filipe Manana authored
      When doing a fast fsync for a range that starts at an offset greater than
      zero, we can end up with a log that when replayed causes the respective
      inode miss a file extent item representing a hole if we are not using the
      NO_HOLES feature. This is because for fast fsyncs we don't log any extents
      that cover a range different from the one requested in the fsync.
      
      Example scenario to trigger it:
      
        $ mkfs.btrfs -O ^no-holes -f /dev/sdd
        $ mount /dev/sdd /mnt
      
        # Create a file with a single 256K and fsync it to clear to full sync
        # bit in the inode - we want the msync below to trigger a fast fsync.
        $ xfs_io -f -c "pwrite -S 0xab 0 256K" -c "fsync" /mnt/foo
      
        # Force a transaction commit and wipe out the log tree.
        $ sync
      
        # Dirty 768K of data, increasing the file size to 1Mb, and flush only
        # the range from 256K to 512K without updating the log tree
        # (sync_file_range() does not trigger fsync, it only starts writeback
        # and waits for it to finish).
      
        $ xfs_io -c "pwrite -S 0xcd 256K 768K" /mnt/foo
        $ xfs_io -c "sync_range -abw 256K 256K" /mnt/foo
      
        # Now dirty the range from 768K to 1M again and sync that range.
        $ xfs_io -c "mmap -w 768K 256K"        \
                 -c "mwrite -S 0xef 768K 256K" \
                 -c "msync -s 768K 256K"       \
                 -c "munmap"                   \
                 /mnt/foo
      
        <power fail>
      
        # Mount to replay the log.
        $ mount /dev/sdd /mnt
        $ umount /mnt
      
        $ btrfs check /dev/sdd
        Opening filesystem to check...
        Checking filesystem on /dev/sdd
        UUID: 482fb574-b288-478e-a190-a9c44a78fca6
        [1/7] checking root items
        [2/7] checking extents
        [3/7] checking free space cache
        [4/7] checking fs roots
        root 5 inode 257 errors 100, file extent discount
        Found file extent holes:
             start: 262144, len: 524288
        ERROR: errors found in fs roots
        found 720896 bytes used, error(s) found
        total csum bytes: 512
        total tree bytes: 131072
        total fs tree bytes: 32768
        total extent tree bytes: 16384
        btree space waste bytes: 123514
        file data blocks allocated: 589824
          referenced 589824
      
      Fix this issue by setting the range to full (0 to LLONG_MAX) when the
      NO_HOLES feature is not enabled. This results in extra work being done
      but it gives the guarantee we don't end up with missing holes after
      replaying the log.
      
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95418ed1
    • Nikolay Borisov's avatar
      btrfs: account ticket size at add/delete time · db161806
      Nikolay Borisov authored
      Instead of iterating all pending tickets on the normal/priority list to
      sum their total size the cost can be amortized across ticket addition/
      removal. This turns O(n) + O(m) (where n is the size of the normal list
      and m of the priority list) into O(1). This will mostly have effect in
      workloads that experience heavy flushing.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db161806
    • Roman Gushchin's avatar
      btrfs: implement migratepage callback for data pages · f8e66081
      Roman Gushchin authored
      Currently btrfs doesn't provide a migratepage callback for data pages.
      It means that fallback_migrate_page() is used to migrate btrfs pages.
      
      fallback_migrate_page() cannot move dirty pages, instead it tries to
      flush them (in sync mode) or just fails (in async mode).
      
      In the sync mode pages which are scheduled to be processed by
      btrfs_writepage_fixup_worker() can't be effectively flushed by the
      migration code, because there is no established way to wait for the
      completion of the delayed work.
      
      It all leads to page migration failures.
      
      To fix it the patch implements a btrs-specific migratepage callback,
      which is similar to iomap_migrate_page() used by some other fs, except
      it does take care of the PagePrivate2 flag which is used for data
      ordering purposes.
      Reviewed-by: default avatarChris Mason <clm@fb.com>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f8e66081
    • Nikolay Borisov's avatar
      btrfs: Remove block_rsv parameter from btrfs_drop_snapshot · 0078a9f9
      Nikolay Borisov authored
      It's no longer used following 30d40577 ("btrfs: reloc: Also queue
      orphan reloc tree for cleanup to avoid BUG_ON()"), so just remove it.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0078a9f9
    • Nikolay Borisov's avatar
      btrfs: Remove __ prefix from btrfs_block_rsv_release · 63f018be
      Nikolay Borisov authored
      Currently the non-prefixed version is a simple wrapper used to hide
      the 4th argument of the prefixed version. This doesn't bring much value
      in practice and only makes the code harder to follow by adding another
      level of indirection. Rectify this by removing the __ prefix and
      have only one public function to release bytes from a block reservation.
      No semantic changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63f018be
    • Qu Wenruo's avatar
      btrfs: relocation: Check cancel request after each extent found · f31ea088
      Qu Wenruo authored
      When relocating data block groups with tons of small extents, or large
      metadata block groups, there can be over 200,000 extents.
      
      We will iterate all extents of such block group in relocate_block_group(),
      where iteration itself can be kinda time-consuming.
      
      So when user want to cancel the balance, the extent iteration loop can
      be another target.
      
      This patch will add the cancelling check in the extent iteration loop of
      relocate_block_group() to make balance cancelling faster.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f31ea088
    • Qu Wenruo's avatar
      btrfs: relocation: Check cancel request after each data page read · 7f913c7c
      Qu Wenruo authored
      When relocating a data extents with large large data extents, we spend
      most of our time in relocate_file_extent_cluster() at stage "moving data
      extents":
      
       1)               |  btrfs_relocate_block_group [btrfs]() {
       1)               |    relocate_file_extent_cluster [btrfs]() {
       1) $ 6586769 us  |    }
       1) + 18.260 us   |    relocate_file_extent_cluster [btrfs]();
       1) + 15.770 us   |    relocate_file_extent_cluster [btrfs]();
       1) $ 8916340 us  |  }
       1)               |  btrfs_relocate_block_group [btrfs]() {
       1)               |    relocate_file_extent_cluster [btrfs]() {
       1) $ 11611586 us |    }
       1) + 16.930 us   |    relocate_file_extent_cluster [btrfs]();
       1) + 15.870 us   |    relocate_file_extent_cluster [btrfs]();
       1) $ 14986130 us |  }
      
      To make data relocation cancelling quicker, add extra balance cancelling
      check after each page read in relocate_file_extent_cluster().
      
      Cleanup and error handling uses the same mechanism as if the whole
      process finished
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f913c7c
    • Qu Wenruo's avatar
      btrfs: relocation: add error injection points for cancelling balance · 726a3421
      Qu Wenruo authored
      Introduce a new error injection point, should_cancel_balance().
      
      It's just a wrapper of atomic_read(&fs_info->balance_cancel_req), but
      allows us to override the return value.
      
      Currently there are only one locations using this function:
      
      - btrfs_balance()
        It checks cancel before each block group.
      
      There are other locations checking fs_info->balance_cancel_req, but they
      are not used as an indicator to exit, so there is no need to use the
      wrapper.
      
      But there will be more locations coming, and some locations can cause
      kernel panic if not handled properly.  So introduce this error injection
      to provide better test interface.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      726a3421
    • Filipe Manana's avatar
      Btrfs: implement full reflink support for inline extents · 05a5a762
      Filipe Manana authored
      There are a few cases where we don't allow cloning an inline extent into
      the destination inode, returning -EOPNOTSUPP to user space. This was done
      to prevent several types of file corruption and because it's not very
      straightforward to deal with these cases, as they can't rely on simply
      copying the inline extent between leaves. Such cases require copying the
      inline extent's data into the respective page of the destination inode.
      
      Not supporting these cases makes it harder and more cumbersome to write
      applications/libraries that work on any filesystem with reflink support,
      since all these cases for which btrfs fails with -EOPNOTSUPP work just
      fine on xfs for example. These unsupported cases are also not documented
      anywhere and explaining which exact cases fail require a bit of too
      technical understanding of btrfs's internal (inline extents and when and
      where can they exist in a file), so it's not really user friendly.
      
      Also some test cases from fstests that use fsx, such as generic/522 for
      example, can sporadically fail because they trigger one of these cases,
      and fsx expects all operations to succeed.
      
      This change adds supports for cloning all these cases by copying the
      inline extent's data into the respective page of the destination inode.
      
      With this change test case btrfs/112 from fstests fails because it
      expects some clone operations to fail, so it will be updated. Also a
      new test case that exercises all these previously unsupported cases
      will be added to fstests.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05a5a762
    • Filipe Manana's avatar
      Btrfs: simplify inline extent handling when doing reflinks · a61e1e0d
      Filipe Manana authored
      We can not reflink parts of an inline extent, we must always reflink the
      whole inline extent. We know that inline extents always start at file
      offset 0 and that can never represent an amount of data larger then the
      filesystem's sector size (both compressed and uncompressed). We also have
      had the constraints that reflink operations must have a start offset that
      is aligned to the sector size and an end offset that is also aligned or
      it ends the inode's i_size, so there's no way for user space to be able
      to do a reflink operation that will refer to only a part of an inline
      extent.
      
      Initially there was a bug in the inlining code that could allow compressed
      inline extents that encoded more than 1 page, but that was fixed in 2008
      by commit 70b99e69 ("Btrfs: Compression corner fixes") since that
      was problematic.
      
      So remove all the extent cloning code that deals with the possibility
      of cloning only partial inline extents.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a61e1e0d
    • Filipe Manana's avatar
      Btrfs: move all reflink implementation code into its own file · 6a177381
      Filipe Manana authored
      The reflink code is quite large and has been living in ioctl.c since ever.
      It has grown over the years after many bug fixes and improvements, and
      since I'm planning on making some further improvements on it, it's time
      to get it better organized by moving into its own file, reflink.c
      (similar to what xfs does for example).
      
      This change only moves the code out of ioctl.c into the new file, it
      doesn't do any other change.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a177381
    • Gustavo A. R. Silva's avatar
      btrfs: scrub: Replace zero-length array with flexible-array member · a8753ee3
      Gustavo A. R. Silva authored
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array
      member[1][2], introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning in
      case the flexible array does not occur last in the structure, which will
      help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by this
      change:
      
        "Flexible array members have incomplete type, and so the sizeof operator
         may not be applied. As a quirk of the original implementation of
         zero-length arrays, sizeof evaluates to zero." [1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8753ee3
    • Gustavo A. R. Silva's avatar
      btrfs: rcu-string: Replace zero-length array with flexible-array member · 7593f4c5
      Gustavo A. R. Silva authored
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array
      member[1][2], introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning in
      case the flexible array does not occur last in the structure, which will
      help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by this
      change:
      
       "Flexible array members have incomplete type, and so the sizeof operator
        may not be applied. As a quirk of the original implementation of
        zero-length arrays, sizeof evaluates to zero." [1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7593f4c5
    • Gustavo A. R. Silva's avatar
      btrfs: delayed-inode: Replace zero-length array with flexible-array member · 17b238ac
      Gustavo A. R. Silva authored
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array
      member[1][2], introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning in
      case the flexible array does not occur last in the structure, which will
      help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by this
      change:
      
       "Flexible array members have incomplete type, and so the sizeof operator
        may not be applied. As a quirk of the original implementation of
        zero-length arrays, sizeof evaluates to zero." [1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      17b238ac
    • Madhuparna Bhowmik's avatar
      btrfs: add RCU locks around block group initialization · 29566c9c
      Madhuparna Bhowmik authored
      The space_info list is normally RCU protected and should be traversed
      with rcu_read_lock held. There's a warning
      
        [29.104756] WARNING: suspicious RCU usage
        [29.105046] 5.6.0-rc4-next-20200305 #1 Not tainted
        [29.105231] -----------------------------
        [29.105401] fs/btrfs/block-group.c:2011 RCU-list traversed in non-reader section!!
      
      pointing out that the locking is missing in btrfs_read_block_groups.
      However this is not necessary as the list traversal happens at mount
      time when there's no other thread potentially accessing the list.
      
      To fix the warning and for consistency let's add the RCU lock/unlock,
      the code won't be affected much as it's doing some lightweight
      operations.
      Reported-by: default avatarGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarMadhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      29566c9c
    • Nikolay Borisov's avatar
      btrfs: Open code insert_extent_backref · 65cd6d9e
      Nikolay Borisov authored
      No need to add a level of indirection for hiding a simple 'if'. Open
      code insert_extent_backref in its sole caller. No functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65cd6d9e
    • Nikolay Borisov's avatar
      btrfs: Remove impossible BUG_ON in get_tree_block_key · c6600d9a
      Nikolay Borisov authored
      relocate_tree_blocks calls get_tree_block_key for a block iff that block
      has its ->key_ready equal false. Thus the BUG_ON in the latter function
      cannot ever be triggered so remove it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c6600d9a
    • David Sterba's avatar
      btrfs: balance: factor out convert profile validation · 5ba366c3
      David Sterba authored
      The validation follows the same steps for all three block group types,
      the existing helper validate_convert_profile can be enhanced and do more
      of the common things.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5ba366c3
    • David Sterba's avatar
      btrfs: return void from csum_tree_block · c67b3892
      David Sterba authored
      Now that csum_tree_block is not returning any errors, we can make
      csum_tree_block return void and simplify callers.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c67b3892
    • David Sterba's avatar
      btrfs: simplify tree block checksumming loop · e9be5a30
      David Sterba authored
      Thw whole point of csum_tree_block is to iterate over all extent buffer
      pages and pass it to checksumming functions. The bytes where checksum is
      stored must be skipped, thus map_private_extent_buffer. This complicates
      further offset calculations.
      
      As the first page will be always present, checksum the relevant bytes
      unconditionally and then do a simple iteration over the remaining pages.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9be5a30
    • David Sterba's avatar
      btrfs: inline checksum name and driver definitions · 59a0fcdb
      David Sterba authored
      There's an unnecessary indirection in the checksum definition table,
      pointer and the string itself. The strings are short and the overall
      size of one entry is now 24 bytes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59a0fcdb
    • Nikolay Borisov's avatar
      btrfs: Rename __btrfs_alloc_chunk to btrfs_alloc_chunk · 11c67b1a
      Nikolay Borisov authored
      Having btrfs_alloc_chunk doesn't bring any value since it
      encapsulates a lockdep assert and a call to find_next_chunk. Simply
      rename the internal __btrfs_alloc_chunk function to the public one
      and remove it's 2nd parameter as all callers always pass the return
      value of find_next_chunk. Finally, migrate the call to
      lockdep_assert_held so as to not lose the check.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      11c67b1a
    • Josef Bacik's avatar
      btrfs: fix btrfs_calc_reclaim_metadata_size calculation · fa121a26
      Josef Bacik authored
      I noticed while running my snapshot torture test that we were getting a
      lot of metadata chunks allocated with very little actually used.
      Digging into this we would commit the transaction, still not have enough
      space, and then force a chunk allocation.
      
      I noticed that we were barely flushing any delalloc at all, despite the
      fact that we had around 13gib of outstanding delalloc reservations.  It
      turns out this is because of our btrfs_calc_reclaim_metadata_size()
      calculation.  It _only_ takes into account the outstanding ticket sizes,
      which isn't the whole story.  In this particular workload we're slowly
      filling up the disk, which means our overcommit space will suddenly
      become a lot less, and our outstanding reservations will be well more
      than what we can handle.  However we are only flushing based on our
      ticket size, which is much less than we need to actually reclaim.
      
      So fix btrfs_calc_reclaim_metadata_size() to take into account the
      overage in the case that we've gotten less available space suddenly.
      This makes it so we attempt to reclaim a lot more delalloc space, which
      allows us to make our reservations and we no longer are allocating a
      bunch of needless metadata chunks.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa121a26
    • Filipe Manana's avatar
      Btrfs: fix crash during unmount due to race with delayed inode workers · f0cc2cd7
      Filipe Manana authored
      During unmount we can have a job from the delayed inode items work queue
      still running, that can lead to at least two bad things:
      
      1) A crash, because the worker can try to create a transaction just
         after the fs roots were freed;
      
      2) A transaction leak, because the worker can create a transaction
         before the fs roots are freed and just after we committed the last
         transaction and after we stopped the transaction kthread.
      
      A stack trace example of the crash:
      
       [79011.691214] kernel BUG at lib/radix-tree.c:982!
       [79011.692056] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       [79011.693180] CPU: 3 PID: 1394 Comm: kworker/u8:2 Tainted: G        W         5.6.0-rc2-btrfs-next-54 #2
       (...)
       [79011.696789] Workqueue: btrfs-delayed-meta btrfs_work_helper [btrfs]
       [79011.697904] RIP: 0010:radix_tree_tag_set+0xe7/0x170
       (...)
       [79011.702014] RSP: 0018:ffffb3c84a317ca0 EFLAGS: 00010293
       [79011.702949] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
       [79011.704202] RDX: ffffb3c84a317cb0 RSI: ffffb3c84a317ca8 RDI: ffff8db3931340a0
       [79011.705463] RBP: 0000000000000005 R08: 0000000000000005 R09: ffffffff974629d0
       [79011.706756] R10: ffffb3c84a317bc0 R11: 0000000000000001 R12: ffff8db393134000
       [79011.708010] R13: ffff8db3931340a0 R14: ffff8db393134068 R15: 0000000000000001
       [79011.709270] FS:  0000000000000000(0000) GS:ffff8db3b6a00000(0000) knlGS:0000000000000000
       [79011.710699] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [79011.711710] CR2: 00007f22c2a0a000 CR3: 0000000232ad4005 CR4: 00000000003606e0
       [79011.712958] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [79011.714205] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [79011.715448] Call Trace:
       [79011.715925]  record_root_in_trans+0x72/0xf0 [btrfs]
       [79011.716819]  btrfs_record_root_in_trans+0x4b/0x70 [btrfs]
       [79011.717925]  start_transaction+0xdd/0x5c0 [btrfs]
       [79011.718829]  btrfs_async_run_delayed_root+0x17e/0x2b0 [btrfs]
       [79011.719915]  btrfs_work_helper+0xaa/0x720 [btrfs]
       [79011.720773]  process_one_work+0x26d/0x6a0
       [79011.721497]  worker_thread+0x4f/0x3e0
       [79011.722153]  ? process_one_work+0x6a0/0x6a0
       [79011.722901]  kthread+0x103/0x140
       [79011.723481]  ? kthread_create_worker_on_cpu+0x70/0x70
       [79011.724379]  ret_from_fork+0x3a/0x50
       (...)
      
      The following diagram shows a sequence of steps that lead to the crash
      during ummount of the filesystem:
      
              CPU 1                                             CPU 2                                CPU 3
      
       btrfs_punch_hole()
         btrfs_btree_balance_dirty()
           btrfs_balance_delayed_items()
             --> sees
                 fs_info->delayed_root->items
                 with value 200, which is greater
                 than
                 BTRFS_DELAYED_BACKGROUND (128)
                 and smaller than
                 BTRFS_DELAYED_WRITEBACK (512)
             btrfs_wq_run_delayed_node()
               --> queues a job for
                   fs_info->delayed_workers to run
                   btrfs_async_run_delayed_root()
      
                                                                                                  btrfs_async_run_delayed_root()
                                                                                                    --> job queued by CPU 1
      
                                                                                                    --> starts picking and running
                                                                                                        delayed nodes from the
                                                                                                        prepare_list list
      
                                                       close_ctree()
      
                                                         btrfs_delete_unused_bgs()
      
                                                         btrfs_commit_super()
      
                                                           btrfs_join_transaction()
                                                             --> gets transaction N
      
                                                           btrfs_commit_transaction(N)
                                                             --> set transaction state
                                                              to TRANTS_STATE_COMMIT_START
      
                                                                                                   btrfs_first_prepared_delayed_node()
                                                                                                     --> picks delayed node X through
                                                                                                         the prepared_list list
      
                                                             btrfs_run_delayed_items()
      
                                                               btrfs_first_delayed_node()
                                                                 --> also picks delayed node X
                                                                     but through the node_list
                                                                     list
      
                                                               __btrfs_commit_inode_delayed_items()
                                                                  --> runs all delayed items from
                                                                      this node and drops the
                                                                      node's item count to 0
                                                                      through call to
                                                                      btrfs_release_delayed_inode()
      
                                                               --> finishes running any remaining
                                                                   delayed nodes
      
                                                             --> finishes transaction commit
      
                                                         --> stops cleaner and transaction threads
      
                                                         btrfs_free_fs_roots()
                                                           --> frees all roots and removes them
                                                               from the radix tree
                                                               fs_info->fs_roots_radix
      
                                                                                                   btrfs_join_transaction()
                                                                                                     start_transaction()
                                                                                                       btrfs_record_root_in_trans()
                                                                                                         record_root_in_trans()
                                                                                                           radix_tree_tag_set()
                                                                                                             --> crashes because
                                                                                                                 the root is not in
                                                                                                                 the radix tree
                                                                                                                 anymore
      
      If the worker is able to call btrfs_join_transaction() before the unmount
      task frees the fs roots, we end up leaking a transaction and all its
      resources, since after the call to btrfs_commit_super() and stopping the
      transaction kthread, we don't expect to have any transaction open anymore.
      
      When this situation happens the worker has a delayed node that has no
      more items to run, since the task calling btrfs_run_delayed_items(),
      which is doing a transaction commit, picks the same node and runs all
      its items first.
      
      We can not wait for the worker to complete when running delayed items
      through btrfs_run_delayed_items(), because we call that function in
      several phases of a transaction commit, and that could cause a deadlock
      because the worker calls btrfs_join_transaction() and the task doing the
      transaction commit may have already set the transaction state to
      TRANS_STATE_COMMIT_DOING.
      
      Also it's not possible to get into a situation where only some of the
      items of a delayed node are added to the fs/subvolume tree in the current
      transaction and the remaining ones in the next transaction, because when
      running the items of a delayed inode we lock its mutex, effectively
      waiting for the worker if the worker is running the items of the delayed
      node already.
      
      Since this can only cause issues when unmounting a filesystem, fix it in
      a simple way by waiting for any jobs on the delayed workers queue before
      calling btrfs_commit_supper() at close_ctree(). This works because at this
      point no one can call btrfs_btree_balance_dirty() or
      btrfs_balance_delayed_items(), and if we end up waiting for any worker to
      complete, btrfs_commit_super() will commit the transaction created by the
      worker.
      
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f0cc2cd7
    • Naohiro Aota's avatar
      btrfs: factor out prepare_allocation() for extent allocation · 7e895409
      Naohiro Aota authored
      This function finally factor out prepare_allocation() form
      find_free_extent(). This function is called before the allocation loop
      and a specific allocator function like prepare_allocation_clustered()
      should initialize their private information and can set proper hint_byte
      to indicate where to start the allocation with.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e895409
    • Naohiro Aota's avatar
      btrfs: skip LOOP_NO_EMPTY_SIZE if not clustered allocation · 45d8e033
      Naohiro Aota authored
      LOOP_NO_EMPTY_SIZE is solely dedicated for clustered allocation. So, we
      can skip this stage and give up the allocation.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      45d8e033
    • Naohiro Aota's avatar
      btrfs: factor out chunk_allocation_failed() for extent allocation · c70e2139
      Naohiro Aota authored
      Factor out chunk_allocation_failed() from
      find_free_extent_update_loop().  This function is called when it failed
      to allocate a chunk. The function can modify "ffe_ctl->loop" and return
      0 to continue with the next stage.  Or, it can return -ENOSPC to give up
      here.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c70e2139
    • Naohiro Aota's avatar
      btrfs: drop unnecessary arguments from find_free_extent_update_loop() · 15b7ee65
      Naohiro Aota authored
      Now that, we don't use last_ptr and use_cluster in the function. Drop
      these arguments from it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      15b7ee65
    • Naohiro Aota's avatar
      btrfs: factor out found_extent() for extent allocation · 0ab9724b
      Naohiro Aota authored
      Factor out found_extent() from find_free_extent_update_loop(). This
      function is called when a proper extent is found and before returning
      from find_free_extent().  Hook functions like found_extent_clustered()
      should save information for a next allocation.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0ab9724b
    • Naohiro Aota's avatar
      btrfs: factor out release_block_group() · baba5062
      Naohiro Aota authored
      Factor out release_block_group() from find_free_extent(). This function
      is called when it gives up an allocation from a block group. Each
      allocation policy should reset its information for an allocation in
      the next block group.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      baba5062
    • Naohiro Aota's avatar
      btrfs: drop unnecessary arguments from clustered allocation functions · 897cae79
      Naohiro Aota authored
      Now that, find_free_extent_clustered() and find_free_extent_unclustered()
      can access "last_ptr" from the "clustered" variable, we can drop it from
      the arguments.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      897cae79
    • Naohiro Aota's avatar
      btrfs: factor out do_allocation() for extent allocation · c668690d
      Naohiro Aota authored
      Factor out do_allocation() from find_free_extent(). This function do an
      actual extent allocation in a given block group. The ffe_ctl->policy is
      used to determine the actual allocator function to use.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c668690d
    • Naohiro Aota's avatar
      btrfs: move variables for clustered allocation into find_free_extent_ctl · c10859be
      Naohiro Aota authored
      Move "last_ptr" and "use_cluster" into struct find_free_extent_ctl, so
      that hook functions for clustered allocator can use these variables.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c10859be
    • Naohiro Aota's avatar
      btrfs: move hint_byte into find_free_extent_ctl · ea544149
      Naohiro Aota authored
      This commit moves hint_byte into find_free_extent_ctl, so that we can
      modify the hint_byte in the other functions. This will help us split
      find_free_extent further. This commit also renames the function argument
      "hint_byte" to "hint_byte_orig" to avoid misuse.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea544149
    • Naohiro Aota's avatar
      btrfs: introduce extent allocation policy · cb2f96f8
      Naohiro Aota authored
      This commit introduces extent allocation policy for btrfs. This policy
      controls how btrfs allocate an extents from block groups.  There is no
      functional change introduced with this commit.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb2f96f8
    • Naohiro Aota's avatar
      btrfs: parameterize dev_extent_min for chunk allocation · 6aafb303
      Naohiro Aota authored
      Currently, we ignore a device whose available space is less than
      "BTRFS_STRIPE_LEN * dev_stripes". This is a lower limit for current
      allocation policy (to maximize the number of stripes). This commit
      parameterizes dev_extent_min, so that other policies can set their own
      lower limitat to ignore a device with insufficient space.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6aafb303
    • Naohiro Aota's avatar
      btrfs: factor out create_chunk() · dce580ca
      Naohiro Aota authored
      Factor out create_chunk() from __btrfs_alloc_chunk(). This function
      finally creates a chunk. There is no functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dce580ca