1. 23 Aug, 2022 3 commits
    • Anand Jain's avatar
      btrfs: replace: drop assert for suspended replace · 59a39919
      Anand Jain authored
      If the filesystem mounts with the replace-operation in a suspended state
      and try to cancel the suspended replace-operation, we hit the assert. The
      assert came from the commit fe97e2e1 ("btrfs: dev-replace: replace's
      scrub must not be running in suspended state") that was actually not
      required. So just remove it.
      
       $ mount /dev/sda5 /btrfs
      
          BTRFS info (device sda5): cannot continue dev_replace, tgtdev is missing
          BTRFS info (device sda5): you may cancel the operation after 'mount -o degraded'
      
       $ mount -o degraded /dev/sda5 /btrfs <-- success.
      
       $ btrfs replace cancel /btrfs
      
          kernel: assertion failed: ret != -ENOTCONN, in fs/btrfs/dev-replace.c:1131
          kernel: ------------[ cut here ]------------
          kernel: kernel BUG at fs/btrfs/ctree.h:3750!
      
      After the patch:
      
       $ btrfs replace cancel /btrfs
      
          BTRFS info (device sda5): suspended dev_replace from /dev/sda5 (devid 1) to <missing disk> canceled
      
      Fixes: fe97e2e1 ("btrfs: dev-replace: replace's scrub must not be running in suspended state")
      CC: stable@vger.kernel.org # 5.0+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59a39919
    • Filipe Manana's avatar
      btrfs: fix silent failure when deleting root reference · 47bf225a
      Filipe Manana authored
      At btrfs_del_root_ref(), if btrfs_search_slot() returns an error, we end
      up returning from the function with a value of 0 (success). This happens
      because the function returns the value stored in the variable 'err',
      which is 0, while the error value we got from btrfs_search_slot() is
      stored in the 'ret' variable.
      
      So fix it by setting 'err' with the error value.
      
      Fixes: 8289ed9f ("btrfs: replace the BUG_ON in btrfs_del_root_ref with proper error handling")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47bf225a
    • Omar Sandoval's avatar
      btrfs: fix space cache corruption and potential double allocations · ced8ecf0
      Omar Sandoval authored
      When testing space_cache v2 on a large set of machines, we encountered a
      few symptoms:
      
      1. "unable to add free space :-17" (EEXIST) errors.
      2. Missing free space info items, sometimes caught with a "missing free
         space info for X" error.
      3. Double-accounted space: ranges that were allocated in the extent tree
         and also marked as free in the free space tree, ranges that were
         marked as allocated twice in the extent tree, or ranges that were
         marked as free twice in the free space tree. If the latter made it
         onto disk, the next reboot would hit the BUG_ON() in
         add_new_free_space().
      4. On some hosts with no on-disk corruption or error messages, the
         in-memory space cache (dumped with drgn) disagreed with the free
         space tree.
      
      All of these symptoms have the same underlying cause: a race between
      caching the free space for a block group and returning free space to the
      in-memory space cache for pinned extents causes us to double-add a free
      range to the space cache. This race exists when free space is cached
      from the free space tree (space_cache=v2) or the extent tree
      (nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
      struct btrfs_block_group::last_byte_to_unpin and struct
      btrfs_block_group::progress are supposed to protect against this race,
      but commit d0c2f4fa ("btrfs: make concurrent fsyncs wait less when
      waiting for a transaction commit") subtly broke this by allowing
      multiple transactions to be unpinning extents at the same time.
      
      Specifically, the race is as follows:
      
      1. An extent is deleted from an uncached block group in transaction A.
      2. btrfs_commit_transaction() is called for transaction A.
      3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
         ref for the deleted extent.
      4. __btrfs_free_extent() -> do_free_extent_accounting() ->
         add_to_free_space_tree() adds the deleted extent back to the free
         space tree.
      5. do_free_extent_accounting() -> btrfs_update_block_group() ->
         btrfs_cache_block_group() queues up the block group to get cached.
         block_group->progress is set to block_group->start.
      6. btrfs_commit_transaction() for transaction A calls
         switch_commit_roots(). It sets block_group->last_byte_to_unpin to
         block_group->progress, which is block_group->start because the block
         group hasn't been cached yet.
      7. The caching thread gets to our block group. Since the commit roots
         were already switched, load_free_space_tree() sees the deleted extent
         as free and adds it to the space cache. It finishes caching and sets
         block_group->progress to U64_MAX.
      8. btrfs_commit_transaction() advances transaction A to
         TRANS_STATE_SUPER_COMMITTED.
      9. fsync calls btrfs_commit_transaction() for transaction B. Since
         transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
         commit is for fsync, it advances.
      10. btrfs_commit_transaction() for transaction B calls
          switch_commit_roots(). This time, the block group has already been
          cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
      11. btrfs_commit_transaction() for transaction A calls
          btrfs_finish_extent_commit(), which calls unpin_extent_range() for
          the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
          transaction B!), so it adds the deleted extent to the space cache
          again!
      
      This explains all of our symptoms above:
      
      * If the sequence of events is exactly as described above, when the free
        space is re-added in step 11, it will fail with EEXIST.
      * If another thread reallocates the deleted extent in between steps 7
        and 11, then step 11 will silently re-add that space to the space
        cache as free even though it is actually allocated. Then, if that
        space is allocated *again*, the free space tree will be corrupted
        (namely, the wrong item will be deleted).
      * If we don't catch this free space tree corruption, it will continue
        to get worse as extents are deleted and reallocated.
      
      The v1 space_cache is synchronously loaded when an extent is deleted
      (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
      with load_cache_only=1), so it is not normally affected by this bug.
      However, as noted above, if we fail to load the space cache, we will
      fall back to caching from the extent tree and may hit this bug.
      
      The easiest fix for this race is to also make caching from the free
      space tree or extent tree synchronous. Josef tested this and found no
      performance regressions.
      
      A few extra changes fall out of this change. Namely, this fix does the
      following, with step 2 being the crucial fix:
      
      1. Factor btrfs_caching_ctl_wait_done() out of
         btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
         that we already hold a reference to.
      2. Change the call in btrfs_cache_block_group() of
         btrfs_wait_space_cache_v1_finished() to
         btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
         space_cache option.
      3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
         space_cache_v1_done().
      4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
         `bool wait` to more accurately describe its new meaning.
      5. Change a few callers which had a separate call to
         btrfs_wait_block_group_cache_done() to use wait = true instead.
      6. Make btrfs_wait_block_group_cache_done() static now that it's not
         used outside of block-group.c anymore.
      
      Fixes: d0c2f4fa ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
      CC: stable@vger.kernel.org # 5.12+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ced8ecf0
  2. 22 Aug, 2022 5 commits
    • Josef Bacik's avatar
      btrfs: don't allow large NOWAIT direct reads · 79d3d1d1
      Josef Bacik authored
      Dylan and Jens reported a problem where they had an io_uring test that
      was returning short reads, and bisected it to ee5b46a3 ("btrfs:
      increase direct io read size limit to 256 sectors").
      
      The root cause is their test was doing larger reads via io_uring with
      NOWAIT and async.  This was triggering a page fault during the direct
      read, however the first page was able to work just fine and thus we
      submitted a 4k read for a larger iocb.
      
      Btrfs allows for partial IO's in this case specifically because we don't
      allow page faults, and thus we'll attempt to do any io that we can,
      submit what we could, come back and fault in the rest of the range and
      try to do the remaining IO.
      
      However for !is_sync_kiocb() we'll call ->ki_complete() as soon as the
      partial dio is done, which is incorrect.  In the sync case we can exit
      the iomap code, submit more io's, and return with the amount of IO we
      were able to complete successfully.
      
      We were always doing short reads in this case, but for NOWAIT we were
      getting saved by the fact that we were limiting direct reads to
      sectorsize, and if we were larger than that we would return EAGAIN.
      
      Fix the regression by simply returning EAGAIN in the NOWAIT case with
      larger reads, that way io_uring can retry and get the larger IO and have
      the fault logic handle everything properly.
      
      This still leaves the AIO short read case, but that existed before this
      change.  The way to properly fix this would be to handle partial iocb
      completions, but that's a lot of work, for now deal with the regression
      in the most straightforward way possible.
      Reported-by: default avatarDylan Yudaken <dylany@fb.com>
      Fixes: ee5b46a3 ("btrfs: increase direct io read size limit to 256 sectors")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79d3d1d1
    • Qu Wenruo's avatar
      btrfs: don't merge pages into bio if their page offset is not contiguous · 4a445b7b
      Qu Wenruo authored
      [BUG]
      Zygo reported on latest development branch, he could hit
      ASSERT()/BUG_ON() caused crash when doing RAID5 recovery (intentionally
      corrupt one disk, and let btrfs to recover the data during read/scrub).
      
      And The following minimal reproducer can cause extent state leakage at
      rmmod time:
      
        mkfs.btrfs -f -d raid5 -m raid5 $dev1 $dev2 $dev3 -b 1G > /dev/null
        mount $dev1 $mnt
        fsstress -w -d $mnt -n 25 -s 1660807876
        sync
        fssum -A -f -w /tmp/fssum.saved $mnt
        umount $mnt
      
        # Wipe the dev1 but keeps its super block
        xfs_io -c "pwrite -S 0x0 1m 1023m" $dev1
        mount $dev1 $mnt
        fssum -r /tmp/fssum.saved $mnt > /dev/null
        umount $mnt
        rmmod btrfs
      
      This will lead to the following extent states leakage:
      
        BTRFS: state leak: start 499712 end 503807 state 5 in tree 1 refs 1
        BTRFS: state leak: start 495616 end 499711 state 5 in tree 1 refs 1
        BTRFS: state leak: start 491520 end 495615 state 5 in tree 1 refs 1
        BTRFS: state leak: start 487424 end 491519 state 5 in tree 1 refs 1
        BTRFS: state leak: start 483328 end 487423 state 5 in tree 1 refs 1
        BTRFS: state leak: start 479232 end 483327 state 5 in tree 1 refs 1
        BTRFS: state leak: start 475136 end 479231 state 5 in tree 1 refs 1
        BTRFS: state leak: start 471040 end 475135 state 5 in tree 1 refs 1
      
      [CAUSE]
      Since commit 7aa51232 ("btrfs: pass a btrfs_bio to
      btrfs_repair_one_sector"), we always use btrfs_bio->file_offset to
      determine the file offset of a page.
      
      But that usage assume that, one bio has all its page having a continuous
      page offsets.
      
      Unfortunately that's not true, btrfs only requires the logical bytenr
      contiguous when assembling its bios.
      
      From above script, we have one bio looks like this:
      
        fssum-27671  submit_one_bio: bio logical=217739264 len=36864
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=466944 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=724992 <<<
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=729088
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=733184
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=737280
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=741376
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=745472
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=749568
        fssum-27671  submit_one_bio:   r/i=5/261 page_offset=753664
      
      Note that the 1st and the 2nd page has non-contiguous page offsets.
      
      This means, at repair time, we will have completely wrong file offset
      passed in:
      
         kworker/u32:2-19927  btrfs_repair_one_sector: r/i=5/261 page_off=729088 file_off=475136 bio_offset=8192
      
      Since the file offset is incorrect, we latter incorrectly set the extent
      states, and no way to really release them.
      
      Thus later it causes the leakage.
      
      In fact, this can be even worse, since the file offset is incorrect, we
      can hit cases like the incorrect file offset belongs to a HOLE, and
      later cause btrfs_num_copies() to trigger error, finally hit
      BUG_ON()/ASSERT() later.
      
      [FIX]
      Add an extra condition in btrfs_bio_add_page() for uncompressed IO.
      
      Now we will have more strict requirement for bio pages:
      
      - They should all have the same mapping
        (the mapping check is already implied by the call chain)
      
      - Their logical bytenr should be adjacent
        This is the same as the old condition.
      
      - Their page_offset() (file offset) should be adjacent
        This is the new check.
        This would result a slightly increased amount of bios from btrfs
        (needs holes and inside the same stripe boundary to trigger).
      
        But this would greatly reduce the confusion, as it's pretty common
        to assume a btrfs bio would only contain continuous page cache.
      
      Later we may need extra cleanups, as we no longer needs to handle gaps
      between page offsets in endio functions.
      
      Currently this should be the minimal patch to fix commit 7aa51232
      ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector").
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Fixes: 7aa51232 ("btrfs: pass a btrfs_bio to btrfs_repair_one_sector")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a445b7b
    • Filipe Manana's avatar
      btrfs: update generation of hole file extent item when merging holes · e6e3dec6
      Filipe Manana authored
      When punching a hole into a file range that is adjacent with a hole and we
      are not using the no-holes feature, we expand the range of the adjacent
      file extent item that represents a hole, to save metadata space.
      
      However we don't update the generation of hole file extent item, which
      means a full fsync will not log that file extent item if the fsync happens
      in a later transaction (since commit 7f30c072 ("btrfs: stop copying
      old file extents when doing a full fsync")).
      
      For example, if we do this:
      
          $ mkfs.btrfs -f -O ^no-holes /dev/sdb
          $ mount /dev/sdb /mnt
          $ xfs_io -f -c "pwrite -S 0xab 2M 2M" /mnt/foobar
          $ sync
      
      We end up with 2 file extent items in our file:
      
      1) One that represents the hole for the file range [0, 2M), with a
         generation of 7;
      
      2) Another one that represents an extent covering the range [2M, 4M).
      
      After that if we do the following:
      
          $ xfs_io -c "fpunch 2M 2M" /mnt/foobar
      
      We end up with a single file extent item in the file, which represents a
      hole for the range [0, 4M) and with a generation of 7 - because we end
      dropping the data extent for range [2M, 4M) and then update the file
      extent item that represented the hole at [0, 2M), by increasing
      length from 2M to 4M.
      
      Then doing a full fsync and power failing:
      
          $ xfs_io -c "fsync" /mnt/foobar
          <power failure>
      
      will result in the full fsync not logging the file extent item that
      represents the hole for the range [0, 4M), because its generation is 7,
      which is lower than the generation of the current transaction (8).
      As a consequence, after mounting again the filesystem (after log replay),
      the region [2M, 4M) does not have a hole, it still points to the
      previous data extent.
      
      So fix this by always updating the generation of existing file extent
      items representing holes when we merge/expand them. This solves the
      problem and it's the same approach as when we merge prealloc extents that
      got written (at btrfs_mark_extent_written()). Setting the generation to
      the current transaction's generation is also what we do when merging
      the new hole extent map with the previous one or the next one.
      
      A test case for fstests, covering both cases of hole file extent item
      merging (to the left and to the right), will be sent soon.
      
      Fixes: 7f30c072 ("btrfs: stop copying old file extents when doing a full fsync")
      CC: stable@vger.kernel.org # 5.18+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e6e3dec6
    • Zixuan Fu's avatar
      btrfs: fix possible memory leak in btrfs_get_dev_args_from_path() · 9ea0106a
      Zixuan Fu authored
      In btrfs_get_dev_args_from_path(), btrfs_get_bdev_and_sb() can fail if
      the path is invalid. In this case, btrfs_get_dev_args_from_path()
      returns directly without freeing args->uuid and args->fsid allocated
      before, which causes memory leak.
      
      To fix these possible leaks, when btrfs_get_bdev_and_sb() fails,
      btrfs_put_dev_args_from_path() is called to clean up the memory.
      Reported-by: default avatarTOTE Robot <oslab@tsinghua.edu.cn>
      Fixes: faa775c4 ("btrfs: add a btrfs_get_dev_args_from_path helper")
      CC: stable@vger.kernel.org # 5.16
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarZixuan Fu <r33s3n6@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ea0106a
    • Goldwyn Rodrigues's avatar
      btrfs: check if root is readonly while setting security xattr · b5111127
      Goldwyn Rodrigues authored
      For a filesystem which has btrfs read-only property set to true, all
      write operations including xattr should be denied. However, security
      xattr can still be changed even if btrfs ro property is true.
      
      This happens because xattr_permission() does not have any restrictions
      on security.*, system.*  and in some cases trusted.* from VFS and
      the decision is left to the underlying filesystem. See comments in
      xattr_permission() for more details.
      
      This patch checks if the root is read-only before performing the set
      xattr operation.
      
      Testcase:
      
        DEV=/dev/vdb
        MNT=/mnt
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
        echo "file one" > $MNT/f1
      
        setfattr -n "security.one" -v 2 $MNT/f1
        btrfs property set /mnt ro true
      
        setfattr -n "security.one" -v 1 $MNT/f1
      
        umount $MNT
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b5111127
  3. 17 Aug, 2022 6 commits
    • Josef Bacik's avatar
      btrfs: tree-checker: check for overlapping extent items · 899b7f69
      Josef Bacik authored
      We're seeing a weird problem in production where we have overlapping
      extent items in the extent tree.  It's unclear where these are coming
      from, and in debugging we realized there's no check in the tree checker
      for this sort of problem.  Add a check to the tree-checker to make sure
      that the extents do not overlap each other.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      899b7f69
    • Filipe Manana's avatar
      btrfs: fix warning during log replay when bumping inode link count · 769030e1
      Filipe Manana authored
      During log replay, at add_link(), we may increment the link count of
      another inode that has a reference that conflicts with a new reference
      for the inode currently being processed.
      
      During log replay, at add_link(), we may drop (unlink) a reference from
      some inode in the subvolume tree if that reference conflicts with a new
      reference found in the log for the inode we are currently processing.
      
      After the unlink, If the link count has decreased from 1 to 0, then we
      increment the link count to prevent the inode from being deleted if it's
      evicted by an iput() call, because we may have references to add to that
      inode later on (and we will fixup its link count later during log replay).
      
      However incrementing the link count from 0 to 1 triggers a warning:
      
        $ cat fs/inode.c
        (...)
        void inc_nlink(struct inode *inode)
        {
              if (unlikely(inode->i_nlink == 0)) {
                       WARN_ON(!(inode->i_state & I_LINKABLE));
                       atomic_long_dec(&inode->i_sb->s_remove_count);
              }
        (...)
      
      The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
      never set during log replay.
      
      Most of the time, the warning isn't triggered even if we dropped the last
      reference of the conflicting inode, and this is because:
      
      1) The conflicting inode was previously marked for fixup, through a call
         to link_to_fixup_dir(), which increments the inode's link count;
      
      2) And the last iput() on the inode has not triggered eviction of the
         inode, nor was eviction triggered after the iput(). So at add_link(),
         even if we unlink the last reference of the inode, its link count ends
         up being 1 and not 0.
      
      So this means that if eviction is triggered after link_to_fixup_dir() is
      called, at add_link() we will read the inode back from the subvolume tree
      and have it with a correct link count, matching the number of references
      it has on the subvolume tree. So if when we are at add_link() the inode
      has exactly one reference only, its link count is 1, and after the unlink
      its link count becomes 0.
      
      So fix this by using set_nlink() instead of inc_nlink(), as the former
      accepts a transition from 0 to 1 and it's what we use in other similar
      contexts (like at link_to_fixup_dir().
      
      Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
      bump the link count from 0 to 1.
      
      The warning is actually harmless, but it may scare users. Josef also ran
      into it recently.
      
      CC: stable@vger.kernel.org # 5.1+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      769030e1
    • Filipe Manana's avatar
      btrfs: fix lost error handling when looking up extended ref on log replay · 7a6b75b7
      Filipe Manana authored
      During log replay, when processing inode references, if we get an error
      when looking up for an extended reference at __add_inode_ref(), we ignore
      it and proceed, returning success (0) if no other error happens after the
      lookup. This is obviously wrong because in case an extended reference
      exists and it encodes some name not in the log, we need to unlink it,
      otherwise the filesystem state will not match the state it had after the
      last fsync.
      
      So just make __add_inode_ref() return an error it gets from the extended
      reference lookup.
      
      Fixes: f186373f ("btrfs: extended inode refs")
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7a6b75b7
    • Josef Bacik's avatar
      btrfs: fix lockdep splat with reloc root extent buffers · b40130b2
      Josef Bacik authored
      We have been hitting the following lockdep splat with btrfs/187 recently
      
        WARNING: possible circular locking dependency detected
        5.19.0-rc8+ #775 Not tainted
        ------------------------------------------------------
        btrfs/752500 is trying to acquire lock:
        ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        but task is already holding lock:
        ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_init_new_buffer+0x7d/0x2c0
      	 btrfs_alloc_tree_block+0x120/0x3b0
      	 __btrfs_cow_block+0x136/0x600
      	 btrfs_cow_block+0x10b/0x230
      	 btrfs_search_slot+0x53b/0xb70
      	 btrfs_lookup_inode+0x2a/0xa0
      	 __btrfs_update_delayed_inode+0x5f/0x280
      	 btrfs_async_run_delayed_root+0x24c/0x290
      	 btrfs_work_helper+0xf2/0x3e0
      	 process_one_work+0x271/0x590
      	 worker_thread+0x52/0x3b0
      	 kthread+0xf0/0x120
      	 ret_from_fork+0x1f/0x30
      
        -> #1 (btrfs-tree-01){++++}-{3:3}:
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_search_slot+0x3c3/0xb70
      	 do_relocation+0x10c/0x6b0
      	 relocate_tree_blocks+0x317/0x6d0
      	 relocate_block_group+0x1f1/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
      	 __lock_acquire+0x1122/0x1e10
      	 lock_acquire+0xc2/0x2d0
      	 down_write_nested+0x41/0x80
      	 __btrfs_tree_lock+0x24/0x110
      	 btrfs_lock_root_node+0x31/0x50
      	 btrfs_search_slot+0x1cb/0xb70
      	 replace_path+0x541/0x9f0
      	 merge_reloc_root+0x1d6/0x610
      	 merge_reloc_roots+0xe2/0x260
      	 relocate_block_group+0x2c8/0x560
      	 btrfs_relocate_block_group+0x23e/0x400
      	 btrfs_relocate_chunk+0x4c/0x140
      	 btrfs_balance+0x755/0xe40
      	 btrfs_ioctl+0x1ea2/0x2c90
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
        other info that might help us debug this:
      
        Chain exists of:
          btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(btrfs-tree-01/1);
      				 lock(btrfs-tree-01);
      				 lock(btrfs-tree-01/1);
          lock(btrfs-treloc-02#2);
      
         *** DEADLOCK ***
      
        7 locks held by btrfs/752500:
         #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
         #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
         #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
         #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
         #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
         #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110
      
        stack backtrace:
        CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
      
         dump_stack_lvl+0x56/0x73
         check_noncircular+0xd6/0x100
         ? lock_is_held_type+0xe2/0x140
         __lock_acquire+0x1122/0x1e10
         lock_acquire+0xc2/0x2d0
         ? __btrfs_tree_lock+0x24/0x110
         down_write_nested+0x41/0x80
         ? __btrfs_tree_lock+0x24/0x110
         __btrfs_tree_lock+0x24/0x110
         btrfs_lock_root_node+0x31/0x50
         btrfs_search_slot+0x1cb/0xb70
         ? lock_release+0x137/0x2d0
         ? _raw_spin_unlock+0x29/0x50
         ? release_extent_buffer+0x128/0x180
         replace_path+0x541/0x9f0
         merge_reloc_root+0x1d6/0x610
         merge_reloc_roots+0xe2/0x260
         relocate_block_group+0x2c8/0x560
         btrfs_relocate_block_group+0x23e/0x400
         btrfs_relocate_chunk+0x4c/0x140
         btrfs_balance+0x755/0xe40
         btrfs_ioctl+0x1ea2/0x2c90
         ? lock_is_held_type+0xe2/0x140
         ? lock_is_held_type+0xe2/0x140
         ? __x64_sys_ioctl+0x88/0xc0
         __x64_sys_ioctl+0x88/0xc0
         do_syscall_64+0x38/0x90
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      This isn't necessarily new, it's just tricky to hit in practice.  There
      are two competing things going on here.  With relocation we create a
      snapshot of every fs tree with a reloc tree.  Any extent buffers that
      get initialized here are initialized with the reloc root lockdep key.
      However since it is a snapshot, any blocks that are currently in cache
      that originally belonged to the fs tree will have the normal tree
      lockdep key set.  This creates the lock dependency of
      
        reloc tree -> normal tree
      
      for the extent buffer locking during the first phase of the relocation
      as we walk down the reloc root to relocate blocks.
      
      However this is problematic because the final phase of the relocation is
      merging the reloc root into the original fs root.  This involves
      searching down to any keys that exist in the original fs root and then
      swapping the relocated block and the original fs root block.  We have to
      search down to the fs root first, and then go search the reloc root for
      the block we need to replace.  This creates the dependency of
      
        normal tree -> reloc tree
      
      which is why lockdep complains.
      
      Additionally even if we were to fix this particular mismatch with a
      different nesting for the merge case, we're still slotting in a block
      that has a owner of the reloc root objectid into a normal tree, so that
      block will have its lockdep key set to the tree reloc root, and create a
      lockdep splat later on when we wander into that block from the fs root.
      
      Unfortunately the only solution here is to make sure we do not set the
      lockdep key to the reloc tree lockdep key normally, and then reset any
      blocks we wander into from the reloc root when we're doing the merged.
      
      This solves the problem of having mixed tree reloc keys intermixed with
      normal tree keys, and then allows us to make sure in the merge case we
      maintain the lock order of
      
        normal tree -> reloc tree
      
      We handle this by setting a bit on the reloc root when we do the search
      for the block we want to relocate, and any block we search into or COW
      at that point gets set to the reloc tree key.  This works correctly
      because we only ever COW down to the parent node, so we aren't resetting
      the key for the block we're linking into the fs root.
      
      With this patch we no longer have the lockdep splat in btrfs/187.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b40130b2
    • Josef Bacik's avatar
      btrfs: move lockdep class helpers to locking.c · 0a27a047
      Josef Bacik authored
      These definitions exist in disk-io.c, which is not related to the
      locking.  Move this over to locking.h/c where it makes more sense.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a27a047
    • Zixuan Fu's avatar
      btrfs: unset reloc control if transaction commit fails in prepare_to_relocate() · 85f02d6c
      Zixuan Fu authored
      In btrfs_relocate_block_group(), the rc is allocated.  Then
      btrfs_relocate_block_group() calls
      
      relocate_block_group()
        prepare_to_relocate()
          set_reloc_control()
      
      that assigns rc to the variable fs_info->reloc_ctl. When
      prepare_to_relocate() returns, it calls
      
      btrfs_commit_transaction()
        btrfs_start_dirty_block_groups()
          btrfs_alloc_path()
            kmem_cache_zalloc()
      
      which may fail for example (or other errors could happen). When the
      failure occurs, btrfs_relocate_block_group() detects the error and frees
      rc and doesn't set fs_info->reloc_ctl to NULL. After that, in
      btrfs_init_reloc_root(), rc is retrieved from fs_info->reloc_ctl and
      then used, which may cause a use-after-free bug.
      
      This possible bug can be triggered by calling btrfs_ioctl_balance()
      before calling btrfs_ioctl_defrag().
      
      To fix this possible bug, in prepare_to_relocate(), check if
      btrfs_commit_transaction() fails. If the failure occurs,
      unset_reloc_control() is called to set fs_info->reloc_ctl to NULL.
      
      The error log in our fault-injection testing is shown as follows:
      
        [   58.751070] BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
        ...
        [   58.753577] Call Trace:
        ...
        [   58.755800]  kasan_report+0x45/0x60
        [   58.756066]  btrfs_init_reloc_root+0x7ca/0x920 [btrfs]
        [   58.757304]  record_root_in_trans+0x792/0xa10 [btrfs]
        [   58.757748]  btrfs_record_root_in_trans+0x463/0x4f0 [btrfs]
        [   58.758231]  start_transaction+0x896/0x2950 [btrfs]
        [   58.758661]  btrfs_defrag_root+0x250/0xc00 [btrfs]
        [   58.759083]  btrfs_ioctl_defrag+0x467/0xa00 [btrfs]
        [   58.759513]  btrfs_ioctl+0x3c95/0x114e0 [btrfs]
        ...
        [   58.768510] Allocated by task 23683:
        [   58.768777]  ____kasan_kmalloc+0xb5/0xf0
        [   58.769069]  __kmalloc+0x227/0x3d0
        [   58.769325]  alloc_reloc_control+0x10a/0x3d0 [btrfs]
        [   58.769755]  btrfs_relocate_block_group+0x7aa/0x1e20 [btrfs]
        [   58.770228]  btrfs_relocate_chunk+0xf1/0x760 [btrfs]
        [   58.770655]  __btrfs_balance+0x1326/0x1f10 [btrfs]
        [   58.771071]  btrfs_balance+0x3150/0x3d30 [btrfs]
        [   58.771472]  btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
        [   58.771902]  btrfs_ioctl+0x4caa/0x114e0 [btrfs]
        ...
        [   58.773337] Freed by task 23683:
        ...
        [   58.774815]  kfree+0xda/0x2b0
        [   58.775038]  free_reloc_control+0x1d6/0x220 [btrfs]
        [   58.775465]  btrfs_relocate_block_group+0x115c/0x1e20 [btrfs]
        [   58.775944]  btrfs_relocate_chunk+0xf1/0x760 [btrfs]
        [   58.776369]  __btrfs_balance+0x1326/0x1f10 [btrfs]
        [   58.776784]  btrfs_balance+0x3150/0x3d30 [btrfs]
        [   58.777185]  btrfs_ioctl_balance+0xd84/0x1410 [btrfs]
        [   58.777621]  btrfs_ioctl+0x4caa/0x114e0 [btrfs]
        ...
      Reported-by: default avatarTOTE Robot <oslab@tsinghua.edu.cn>
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarZixuan Fu <r33s3n6@gmail.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      85f02d6c
  4. 27 Jul, 2022 1 commit
    • Josef Bacik's avatar
      btrfs: reset RO counter on block group if we fail to relocate · 74944c87
      Josef Bacik authored
      With the automatic block group reclaim code we will preemptively try to
      mark the block group RO before we start the relocation.  We do this to
      make sure we should actually try to relocate the block group.
      
      However if we hit an error during the actual relocation we won't clean
      up our RO counter and the block group will remain RO.  This was observed
      internally with file systems reporting less space available from df when
      we had failed background relocations.
      
      Fix this by doing the dec_ro in the error case.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74944c87
  5. 25 Jul, 2022 25 commits
    • Christoph Hellwig's avatar
      btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read · 0b078d9d
      Christoph Hellwig authored
      This flag was used to communicate that the low-level compression code
      already did verify the checksum to the high-level I/O completion code.
      
      But it has been unused for a long time as the upper btrfs_bio for the
      decompressed data had a NULL csum pointer basically since that pointer
      existed and the code already checks for that a little later.
      
      Note that this does not affect the other use of the checked flag, which
      is only used for the COW fixup worker.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b078d9d
    • Christoph Hellwig's avatar
      btrfs: fix repair of compressed extents · 81bd9328
      Christoph Hellwig authored
      Currently the checksum of compressed extents is verified based on the
      compressed data and the lower btrfs_bio, but the actual repair process
      is driven by end_bio_extent_readpage on the upper btrfs_bio for the
      decompressed data.
      
      This has a bunch of issues, including not being able to properly
      communicate the failed mirror up in case that the I/O submission got
      preempted, a general loss of if an error was an I/O error or a checksum
      verification failure, but most importantly that this design causes
      btrfs_clean_io_failure to eventually write back the uncompressed good
      data onto the disk sectors that are supposed to contain compressed data.
      
      Fix this by moving the repair to the lower btrfs_bio.  To do so, a fair
      amount of code has to be reshuffled:
      
       a) the lower btrfs_bio now needs a valid csum pointer.  The easiest way
          to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
          the btrfs_bio management of csums.  For a compressed_bio that is
          split into multiple btrfs_bios this means additional memory
          allocations, but the code becomes a lot more regular.
       b) checksum verification now runs directly on the lower btrfs_bio instead
          of the compressed_bio.  This actually nicely simplifies the end I/O
          processing.
       c) btrfs_repair_one_sector can't just look up the logical address for
          the file offset any more, as there is no corresponding relative
          offsets that apply to the file offset and the logic address for
          compressed extents.  Instead require that the saved bvec_iter in the
          btrfs_bio is filled out for all read bios and use that, which again
          removes a fair amount of code.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81bd9328
    • Christoph Hellwig's avatar
      btrfs: remove the start argument to check_data_csum and export · 7959bd44
      Christoph Hellwig authored
      Derive the value of start from the btrfs_bio now that ->file_offset is
      always valid.  Also export and rename the function so it's available
      outside of inode.c as we'll need that soon.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7959bd44
    • Christoph Hellwig's avatar
      btrfs: pass a btrfs_bio to btrfs_repair_one_sector · 7aa51232
      Christoph Hellwig authored
      Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector,
      and remove the start and failed_mirror arguments in favor of deriving
      them from the btrfs_bio.  For this to work ensure that the file_offset
      field is also initialized for buffered I/O.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7aa51232
    • Christoph Hellwig's avatar
      btrfs: simplify the pending I/O counting in struct compressed_bio · 524bcd1e
      Christoph Hellwig authored
      Instead of counting the sectors just count the bios, with an extra
      reference held during submission.  This significantly simplifies the
      submission side error handling.
      
      This slightly changes completion and error handling of
      btrfs_submit_compressed_{read,write} because with the old code the
      compressed_bio could have been completed in
      submit_compressed_{read,write} only if there was an error during
      submission for one of the lower bio, whilst with the new code there is a
      chance for this to happen even for successful submission if the all the
      lower bios complete before the end of the function is reached.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      524bcd1e
    • Christoph Hellwig's avatar
      btrfs: repair all known bad mirrors · c144c63f
      Christoph Hellwig authored
      When there is more than a single level of redundancy there can also be
      multiple bad mirrors, and the current read repair code only repairs the
      last bad one.
      
      Restructure btrfs_repair_one_sector so that it records the originally
      failed mirror and the number of copies, and then repair all known bad
      copies until we reach the originally failed copy in clean_io_failure.
      Note that this also means the read repair reads will always start from
      the next bad mirror and not mirror 0.
      
      This fixes btrfs/265 in xfstests.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c144c63f
    • Christoph Hellwig's avatar
      btrfs: merge btrfs_dev_stat_print_on_error with its only caller · d28beb3e
      Christoph Hellwig authored
      Fold it into the only caller.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d28beb3e
    • Filipe Manana's avatar
      btrfs: join running log transaction when logging new name · 723df2bc
      Filipe Manana authored
      When logging a new name, in case of a rename, we pin the log before
      changing it. We then either delete a directory entry from the log or
      insert a key range item to mark the old name for deletion on log replay.
      
      However when doing one of those log changes we may have another task that
      started writing out the log (at btrfs_sync_log()) and it started before
      we pinned the log root. So we may end up changing a log tree while its
      writeback is being started by another task syncing the log. This can lead
      to inconsistencies in a log tree and other unexpected results during log
      replay, because we can get some committed node pointing to a node/leaf
      that ends up not getting written to disk before the next log commit.
      
      The problem, conceptually, started to happen in commit 88d2beec
      ("btrfs: avoid logging all directory changes during renames"), because
      there we started to update the log without joining its current transaction
      first.
      
      However the problem only became visible with commit 259c4b96
      ("btrfs: stop doing unnecessary log updates during a rename"), and that is
      because we used to pin the log at btrfs_rename() and then before entering
      btrfs_log_new_name(), when unlinking the old dentry, we ended up at
      btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both
      of them join the current log transaction, effectively waiting for any log
      transaction writeout (due to acquiring the root's log_mutex). This made it
      safe even after leaving the current log transaction, because we remained
      with the log pinned when we called btrfs_log_new_name().
      
      Then in commit 259c4b96 ("btrfs: stop doing unnecessary log updates
      during a rename"), we removed the log pinning from btrfs_rename() and
      stopped calling btrfs_del_inode_ref_in_log() and
      btrfs_del_dir_entries_in_log() during the rename, and started to do all
      the needed work at btrfs_log_new_name(), but without joining the current
      log transaction, only pinning the log, which is racy because another task
      may have started writeout of the log tree right before we pinned the log.
      
      Both commits landed in kernel 5.18, so it doesn't make any practical
      difference which should be blamed, but I'm blaming the second commit only
      because with the first one, by chance, the problem did not happen due to
      the fact we joined the log transaction after pinning the log and unpinned
      it only after calling btrfs_log_new_name().
      
      So make btrfs_log_new_name() join the current log transaction instead of
      pinning it, so that we never do log updates if it's writeout is starting.
      
      Fixes: 259c4b96 ("btrfs: stop doing unnecessary log updates during a rename")
      CC: stable@vger.kernel.org # 5.18+
      Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Tested-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      723df2bc
    • Nikolay Borisov's avatar
      btrfs: simplify error handling in btrfs_lookup_dentry · fc8b235f
      Nikolay Borisov authored
      In btrfs_lookup_dentry releasing the reference of the sub_root and the
      running orphan cleanup should only happen if the dentry found actually
      represents a subvolume. This can only be true in the 'else' branch as
      otherwise either fixup_tree_root_location returned an ENOENT error, in
      which case sub_root wouldn't have been changed or if we got a different
      errno this means btrfs_get_fs_root couldn't have executed successfully
      again meaning sub_root will equal to root. So simplify all the branches
      by moving the code into the 'else'.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc8b235f
    • Filipe Manana's avatar
      btrfs: send: always use the rbtree based inode ref management infrastructure · 0d8869fb
      Filipe Manana authored
      After the patch "btrfs: send: fix sending link commands for existing file
      paths", we now have two infrastructures to detect and eliminate duplicated
      inode references (due to names that got removed and re-added between the
      send and parent snapshots):
      
      1) One that works on a single inode ref/extref item;
      
      2) A new one that works acrosss all ref/extref items for an inode, and
         it's also more efficient because even in the single ref/extref item
         case, it does not do a linear search for all the names encoded in the
         ref/extref item, it uses red black trees to speedup up the search.
      
      There's no good reason to keep both infrastructures, we can use the new
      one everywhere, and it's always more efficient.
      
      So remove the old infrastructure and change all sites that are using it
      to use the new one.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d8869fb
    • BingJing Chang's avatar
      btrfs: send: fix sending link commands for existing file paths · 3aa5bd36
      BingJing Chang authored
      There is a bug sending link commands for existing file paths. When we're
      processing an inode, we go over all references. All the new file paths are
      added to the "new_refs" list. And all the deleted file paths are added to
      the "deleted_refs" list. In the end, when we finish processing the inode,
      we iterate over all the items in the "new_refs" list and send link commands
      for those file paths. After that, we go over all the items in the
      "deleted_refs" list and send unlink commands for them. If there are
      duplicated file paths in both lists, we will try to create them before we
      remove them. Then the receiver gets an -EEXIST error when trying the link
      operations.
      
      Example for having duplicated file paths in both list:
      
        $ btrfs subvolume create vol
      
        # create a file and 2000 hard links to the same inode
        $ touch vol/foo
        $ for i in {1..2000}; do link vol/foo vol/$i ; done
      
        # take a snapshot for a parent snapshot
        $ btrfs subvolume snapshot -r vol snap1
      
        # remove 2000 hard links and re-create the last 1000 links
        $ for i in {1..2000}; do rm vol/$i; done;
        $ for i in {1001..2000}; do link vol/foo vol/$i; done
      
        # take another one for a send snapshot
        $ btrfs subvolume snapshot -r vol snap2
      
        $ mkdir receive_dir
        $ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
        At subvol snap2
        link 1238 -> foo
        ERROR: link 1238 -> foo failed: File exists
      
      In this case, we will have the same file paths added to both lists. In the
      parent snapshot, reference paths {1..1237} are stored in inode references,
      but reference paths {1238..2000} are stored in inode extended references.
      In the send snapshot, all reference paths {1001..2000} are stored in inode
      references. During the incremental send, we process their inode references
      first. In record_changed_ref(), we iterate all its inode references in the
      send/parent snapshot. For every inode reference, we also use find_iref() to
      check whether the same file path also appears in the parent/send snapshot
      or not. Inode references {1238..2000} which appear in the send snapshot but
      not in the parent snapshot are added to the "new_refs" list. On the other
      hand, Inode references {1..1000} which appear in the parent snapshot but
      not in the send snapshot are added to the "deleted_refs" list. Next, when
      we process their inode extended references, reference paths {1238..2000}
      are added to the "deleted_refs" list because all of them only appear in the
      parent snapshot. Now two lists contain items as below:
      "new_refs" list: {1238..2000}
      "deleted_refs" list: {1..1000}, {1238..2000}
      
      Reference paths {1238..2000} appear in both lists. And as the processing
      order mentioned about before, the receiver gets an -EEXIST error when trying
      the link operations.
      
      To fix the bug, the idea is to process the "deleted_refs" list before
      the "new_refs" list. However, it's not easy to reshuffle the processing
      order. For one reason, if we do so, we may unlink all the existing paths
      first, there's no valid path anymore for links. And it's inefficient
      because we do a bunch of unlinks followed by links for the same paths.
      Moreover, it makes less sense to have duplications in both lists. A
      reference path cannot not only be regarded as new but also has been seen in
      the past, or we won't call it a new path. However, it's also not a good
      idea to make find_iref() check a reference against all inode references
      and all inode extended references because it may result in large disk
      reads.
      
      So we introduce two rbtrees to make the references easier for lookups.
      And we also introduce record_new_ref_if_needed() and
      record_deleted_ref_if_needed() for changed_ref() to check and remove
      duplicated references early.
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3aa5bd36
    • BingJing Chang's avatar
      btrfs: send: introduce recorded_ref_alloc and recorded_ref_free · 71ecfc13
      BingJing Chang authored
      Introduce wrappers to allocate and free recorded_ref structures.
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71ecfc13
    • Naohiro Aota's avatar
      btrfs: zoned: wait until zone is finished when allocation didn't progress · 2ce543f4
      Naohiro Aota authored
      When the allocated position doesn't progress, we cannot submit IOs to
      finish a block group, but there should be ongoing IOs that will finish a
      block group. So, in that case, we wait for a zone to be finished and retry
      the allocation after that.
      
      Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
      indicate we need a zone finish to have proceeded. The flag is set when the
      allocator detected it cannot activate a new block group. And, it is cleared
      once a zone is finished.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2ce543f4
    • Naohiro Aota's avatar
      btrfs: zoned: write out partially allocated region · 898793d9
      Naohiro Aota authored
      cow_file_range() works in an all-or-nothing way: if it fails to allocate an
      extent for a part of the given region, it gives up all the region including
      the successfully allocated parts. On cow_file_range(), run_delalloc_zoned()
      writes data for the region only when it successfully allocate all the
      region.
      
      This all-or-nothing allocation and write-out are problematic when available
      space in all the block groups are get tight with the active zone
      restriction. btrfs_reserve_extent() try hard to utilize the left space in
      the active block groups and gives up finally and fails with
      -ENOSPC. However, if we send IOs for the successfully allocated region, we
      can finish a zone and can continue on the rest of the allocation on a newly
      allocated block group.
      
      This patch implements the partial write-out for run_delalloc_zoned(). With
      this patch applied, cow_file_range() returns -EAGAIN to tell the caller to
      do something to progress the further allocation, and tells the successfully
      allocated region with done_offset. Furthermore, the zoned extent allocator
      returns -EAGAIN to tell cow_file_range() going back to the caller side.
      
      Actually, we still need to wait for an IO to complete to continue the
      allocation. The next patch implements that part.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      898793d9
    • Naohiro Aota's avatar
      btrfs: zoned: activate necessary block group · b6a98021
      Naohiro Aota authored
      There are two places where allocating a chunk is not enough. These two
      places are trying to ensure the space by allocating a chunk. To meet the
      condition for active_total_bytes, we also need to activate a block group
      there.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6a98021
    • Naohiro Aota's avatar
      btrfs: zoned: activate metadata block group on flush_space · b0931513
      Naohiro Aota authored
      For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE}
      means we don't have enough space left in the active_total_bytes. Before
      allocating a new chunk, we can try to activate an existing block group
      in this case.
      
      Also, allocating a chunk is not enough to grant a ticket for metadata
      space on zoned filesystem we need to activate the block group to
      increase the active_total_bytes.
      
      btrfs_zoned_activate_one_bg() implements the activation feature. It will
      activate a block group by (maybe) finishing a block group. It will give up
      activating a block group if it cannot finish any block group.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0931513
    • Naohiro Aota's avatar
      btrfs: zoned: disable metadata overcommit for zoned · 79417d04
      Naohiro Aota authored
      The metadata overcommit makes the space reservation flexible but it is also
      harmful to active zone tracking. Since we cannot finish a block group from
      the metadata allocation context, we might not activate a new block group
      and might not be able to actually write out the overcommit reservations.
      
      So, disable metadata overcommit for zoned filesystems. We will ensure
      the reservations are under active_total_bytes in the following patches.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79417d04
    • Naohiro Aota's avatar
      btrfs: zoned: introduce space_info->active_total_bytes · 6a921de5
      Naohiro Aota authored
      The active_total_bytes, like the total_bytes, accounts for the total bytes
      of active block groups in the space_info.
      
      With an introduction of active_total_bytes, we can check if the reserved
      bytes can be written to the block groups without activating a new block
      group. The check is necessary for metadata allocation on zoned
      filesystem. We cannot finish a block group, which may require waiting
      for the current transaction, from the metadata allocation context.
      Instead, we need to ensure the ongoing allocation (reserved bytes) fits
      in active block groups.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a921de5
    • Naohiro Aota's avatar
      btrfs: zoned: finish least available block group on data bg allocation · 393f646e
      Naohiro Aota authored
      When we run out of active zones and no sufficient space is left in any
      block groups, we need to finish one block group to make room to activate a
      new block group.
      
      However, we cannot do this for metadata block groups because we can cause a
      deadlock by waiting for a running transaction commit. So, do that only for
      a data block group.
      
      Furthermore, the block group to be finished has two requirements. First,
      the block group must not have reserved bytes left. Having reserved bytes
      means we have an allocated region but did not yet send bios for it. If that
      region is allocated by the thread calling btrfs_zone_finish(), it results
      in a deadlock.
      
      Second, the block group to be finished must not be a SYSTEM block
      group. Finishing a SYSTEM block group easily breaks further chunk
      allocation by nullifying the SYSTEM free space.
      
      In a certain case, we cannot find any zone finish candidate or
      btrfs_zone_finish() may fail. In that case, we fall back to split the
      allocation bytes and fill the last spaces left in the block groups.
      
      CC: stable@vger.kernel.org # 5.16+
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      393f646e
    • Naohiro Aota's avatar
      btrfs: let can_allocate_chunk return error · bb9950d3
      Naohiro Aota authored
      For the later patch, convert the return type from bool to int and return
      errors. No functional changes.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb9950d3
    • Naohiro Aota's avatar
      btrfs: use fs_info->max_extent_size in get_extent_max_capacity() · d7601566
      Naohiro Aota authored
      Use fs_info->max_extent_size also in get_extent_max_capacity() for the
      completeness. This is only used for defrag and not really necessary to fix
      the metadata reservation size. But, it still suppresses unnecessary defrag
      operations.
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7601566
    • Naohiro Aota's avatar
      btrfs: convert count_max_extents() to use fs_info->max_extent_size · 7d7672bc
      Naohiro Aota authored
      If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
      of extents needed, btrfs release the metadata reservation too much on its
      way to write out the data.
      
      Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
      convert count_max_extents() to use it instead, and fix the calculation of
      the metadata reservation.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7d7672bc
    • Naohiro Aota's avatar
      btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size · f7b12a62
      Naohiro Aota authored
      On zoned filesystem, data write out is limited by max_zone_append_size,
      and a large ordered extent is split according the size of a bio. OTOH,
      the number of extents to be written is calculated using
      BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
      metadata bytes to update and/or create the metadata items.
      
      The metadata reservation is done at e.g, btrfs_buffered_write() and then
      released according to the estimation changes. Thus, if the number of extent
      increases massively, the reserved metadata can run out.
      
      The increase of the number of extents easily occurs on zoned filesystem
      if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
      following warning on a small RAM environment with disabling metadata
      over-commit (in the following patch).
      
      [75721.498492] ------------[ cut here ]------------
      [75721.505624] BTRFS: block rsv 1 returned -28
      [75721.512230] WARNING: CPU: 24 PID: 2327559 at fs/btrfs/block-rsv.c:537 btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.581854] CPU: 24 PID: 2327559 Comm: kworker/u64:10 Kdump: loaded Tainted: G        W         5.18.0-rc2-BTRFS-ZNS+ #109
      [75721.597200] Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
      [75721.607310] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
      [75721.616209] RIP: 0010:btrfs_use_block_rsv+0x560/0x760 [btrfs]
      [75721.646649] RSP: 0018:ffffc9000fbdf3e0 EFLAGS: 00010286
      [75721.654126] RAX: 0000000000000000 RBX: 0000000000004000 RCX: 0000000000000000
      [75721.663524] RDX: 0000000000000004 RSI: 0000000000000008 RDI: fffff52001f7be6e
      [75721.672921] RBP: ffffc9000fbdf420 R08: 0000000000000001 R09: ffff889f8d1fc6c7
      [75721.682493] R10: ffffed13f1a3f8d8 R11: 0000000000000001 R12: ffff88980a3c0e28
      [75721.692284] R13: ffff889b66590000 R14: ffff88980a3c0e40 R15: ffff88980a3c0e8a
      [75721.701878] FS:  0000000000000000(0000) GS:ffff889f8d000000(0000) knlGS:0000000000000000
      [75721.712601] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [75721.720726] CR2: 000055d12e05c018 CR3: 0000800193594000 CR4: 0000000000350ee0
      [75721.730499] Call Trace:
      [75721.735166]  <TASK>
      [75721.739886]  btrfs_alloc_tree_block+0x1e1/0x1100 [btrfs]
      [75721.747545]  ? btrfs_alloc_logged_file_extent+0x550/0x550 [btrfs]
      [75721.756145]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.762852]  ? btrfs_get_32+0xea/0x2d0 [btrfs]
      [75721.769520]  ? push_leaf_left+0x420/0x620 [btrfs]
      [75721.776431]  ? memcpy+0x4e/0x60
      [75721.781931]  split_leaf+0x433/0x12d0 [btrfs]
      [75721.788392]  ? btrfs_get_token_32+0x580/0x580 [btrfs]
      [75721.795636]  ? push_for_double_split.isra.0+0x420/0x420 [btrfs]
      [75721.803759]  ? leaf_space_used+0x15d/0x1a0 [btrfs]
      [75721.811156]  btrfs_search_slot+0x1bc3/0x2790 [btrfs]
      [75721.818300]  ? lock_downgrade+0x7c0/0x7c0
      [75721.824411]  ? free_extent_buffer.part.0+0x107/0x200 [btrfs]
      [75721.832456]  ? split_leaf+0x12d0/0x12d0 [btrfs]
      [75721.839149]  ? free_extent_buffer.part.0+0x14f/0x200 [btrfs]
      [75721.846945]  ? free_extent_buffer+0x13/0x20 [btrfs]
      [75721.853960]  ? btrfs_release_path+0x4b/0x190 [btrfs]
      [75721.861429]  btrfs_csum_file_blocks+0x85c/0x1500 [btrfs]
      [75721.869313]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.876085]  ? lock_release+0x552/0xf80
      [75721.881957]  ? btrfs_del_csums+0x8c0/0x8c0 [btrfs]
      [75721.888886]  ? __kasan_check_write+0x14/0x20
      [75721.895152]  ? do_raw_read_unlock+0x44/0x80
      [75721.901323]  ? _raw_write_lock_irq+0x60/0x80
      [75721.907983]  ? btrfs_global_root+0xb9/0xe0 [btrfs]
      [75721.915166]  ? btrfs_csum_root+0x12b/0x180 [btrfs]
      [75721.921918]  ? btrfs_get_global_root+0x820/0x820 [btrfs]
      [75721.929166]  ? _raw_write_unlock+0x23/0x40
      [75721.935116]  ? unpin_extent_cache+0x1e3/0x390 [btrfs]
      [75721.942041]  btrfs_finish_ordered_io.isra.0+0xa0c/0x1dc0 [btrfs]
      [75721.949906]  ? try_to_wake_up+0x30/0x14a0
      [75721.955700]  ? btrfs_unlink_subvol+0xda0/0xda0 [btrfs]
      [75721.962661]  ? rcu_read_lock_sched_held+0x16/0x80
      [75721.969111]  ? lock_acquire+0x41b/0x4c0
      [75721.974982]  finish_ordered_fn+0x15/0x20 [btrfs]
      [75721.981639]  btrfs_work_helper+0x1af/0xa80 [btrfs]
      [75721.988184]  ? _raw_spin_unlock_irq+0x28/0x50
      [75721.994643]  process_one_work+0x815/0x1460
      [75722.000444]  ? pwq_dec_nr_in_flight+0x250/0x250
      [75722.006643]  ? do_raw_spin_trylock+0xbb/0x190
      [75722.013086]  worker_thread+0x59a/0xeb0
      [75722.018511]  kthread+0x2ac/0x360
      [75722.023428]  ? process_one_work+0x1460/0x1460
      [75722.029431]  ? kthread_complete_and_exit+0x30/0x30
      [75722.036044]  ret_from_fork+0x22/0x30
      [75722.041255]  </TASK>
      [75722.045047] irq event stamp: 0
      [75722.049703] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [75722.057610] hardirqs last disabled at (0): [<ffffffff8118a94a>] copy_process+0x1c1a/0x66b0
      [75722.067533] softirqs last  enabled at (0): [<ffffffff8118a989>] copy_process+0x1c59/0x66b0
      [75722.077423] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [75722.085335] ---[ end trace 0000000000000000 ]---
      
      To fix the estimation, we need to introduce fs_info->max_extent_size to
      replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
      regular vs zoned filesystem.
      
      Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
      filesystem, it is set to fs_info->max_zone_append_size.
      
      CC: stable@vger.kernel.org # 5.12+
      Fixes: d8e3fb10 ("btrfs: zoned: use ZONE_APPEND write for zoned mode")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7b12a62
    • Naohiro Aota's avatar
      btrfs: zoned: revive max_zone_append_bytes · c2ae7b77
      Naohiro Aota authored
      This patch is basically a revert of commit 5a80d1c6 ("btrfs: zoned:
      remove max_zone_append_size logic"), but without unnecessary ASSERT and
      check. The max_zone_append_size will be used as a hint to estimate the
      number of extents to cover delalloc/writeback region in the later commits.
      
      The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
      this commit considers it to calculate max_zone_append_size. Technically, a
      bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
      contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
      as an upper limit of an extent size to calculate the number of extents
      needed to write data.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2ae7b77
    • Naohiro Aota's avatar
      block: add bdev_max_segments() helper · 65ea1b66
      Naohiro Aota authored
      Add bdev_max_segments() like other queue parameters.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      65ea1b66