1. 24 Oct, 2022 1 commit
  2. 14 Oct, 2022 1 commit
  3. 11 Oct, 2022 7 commits
    • Filipe Manana's avatar
      btrfs: ignore fiemap path cache if we have multiple leaves for a data extent · 63c84b46
      Filipe Manana authored
      The path cache used during fiemap used to determine the sharedness of
      extent buffers in a path from a leaf containing a file extent item
      pointing to our data extent up to the root node of the tree, is meant to
      be used for a single path. Having a single path is by far the most common
      case, and therefore worth to optimize for, but it's possible to actually
      have multiple paths because we have 2 or more leaves.
      
      If we have multiple leaves, the 'level' variable keeps getting incremented
      in each iteration of the while loop at btrfs_is_data_extent_shared(),
      which means we will treat the second leaf in the 'tmp' ulist as a level 1
      node, and so forth. In the worst case this can lead to getting a level
      greater than or equals to BTRFS_MAX_LEVEL (8), which will trigger a
      WARN_ON_ONCE() in the functions to lookup from or store in the path cache
      (lookup_backref_shared_cache() and store_backref_shared_cache()). If the
      current level never goes beyond 8, due to shared nodes in the paths and
      a fs tree height smaller than 8, it can still result in incorrectly
      marking one leaf as shared because some other leaf is shared and is stored
      one level below that other leaf, as when storing a true sharedness value
      in the cache results in updating the sharedness to true of all entries in
      the cache below the current level.
      
      Having multiple leaves happens in a case like the following:
      
        - We have a file extent item point to data extent at bytenr X, for
          a file range [0, 1M[ for example;
      
        - At this moment we have an extent data ref for the extent, with
          an offset of 0 and a count of 1;
      
        - A write into the middle of the extent happens, file range [64K, 128K)
          so the file extent item is split into two (at btrfs_drop_extents()):
      
          1) One for file range [0, 64K), with a length (num_bytes field) of
             64K and an extent offset of 0;
      
          2) Another one for file range [128K, 1M), with a length of 896K
             (1M - 128K) and an extent offset of 128K.
      
        - At this moment the two file extent items are located in the same
          leaf;
      
        - A new file extent item for the range [64K, 128K), pointing to a new
          data extent, is inserted in the leaf. This results in a leaf split
          and now those two file extent items pointing to data extent X end
          up located in different leaves;
      
        - Once delayed refs are run, we still have a single extent data ref
          item for our data extent at bytenr X, for offset 0, but now with a
          count of 2 instead of 1;
      
        - So during fiemap, at btrfs_is_data_extent_shared(), after we call
          find_parent_nodes() for the data extent, we get two leaves, since
          we have two file extent items point to data extent at bytenr X that
          are located in two different leaves.
      
      So skip the use of the path cache when we get more than one leaf.
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63c84b46
    • Filipe Manana's avatar
      btrfs: fix processing of delayed tree block refs during backref walking · 943553ef
      Filipe Manana authored
      During backref walking, when processing a delayed reference with a type of
      BTRFS_TREE_BLOCK_REF_KEY, we have two bugs there:
      
      1) We are accessing the delayed references extent_op, and its key, without
         the protection of the delayed ref head's lock;
      
      2) If there's no extent op for the delayed ref head, we end up with an
         uninitialized key in the stack, variable 'tmp_op_key', and then pass
         it to add_indirect_ref(), which adds the reference to the indirect
         refs rb tree.
      
         This is wrong, because indirect references should have a NULL key
         when we don't have access to the key, and in that case they should be
         added to the indirect_missing_keys rb tree and not to the indirect rb
         tree.
      
         This means that if have BTRFS_TREE_BLOCK_REF_KEY delayed ref resulting
         from freeing an extent buffer, therefore with a count of -1, it will
         not cancel out the corresponding reference we have in the extent tree
         (with a count of 1), since both references end up in different rb
         trees.
      
         When using fiemap, where we often need to check if extents are shared
         through shared subtrees resulting from snapshots, it means we can
         incorrectly report an extent as shared when it's no longer shared.
         However this is temporary because after the transaction is committed
         the extent is no longer reported as shared, as running the delayed
         reference results in deleting the tree block reference from the extent
         tree.
      
         Outside the fiemap context, the result is unpredictable, as the key was
         not initialized but it's used when navigating the rb trees to insert
         and search for references (prelim_ref_compare()), and we expect all
         references in the indirect rb tree to have valid keys.
      
      The following reproducer triggers the second bug:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount -o compress $DEV $MNT
      
         # With a compressed 128M file we get a tree height of 2 (level 1 root).
         xfs_io -f -c "pwrite -b 1M 0 128M" $MNT/foo
      
         btrfs subvolume snapshot $MNT $MNT/snap
      
         # Fiemap should output 0x2008 in the flags column.
         # 0x2000 means shared extent
         # 0x8 means encoded extent (because it's compressed)
         echo
         echo "fiemap after snapshot, range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         # Overwrite one extent and fsync to flush delalloc and COW a new path
         # in the snapshot's tree.
         #
         # After this we have a BTRFS_DROP_DELAYED_REF delayed ref of type
         # BTRFS_TREE_BLOCK_REF_KEY with a count of -1 for every COWed extent
         # buffer in the path.
         #
         # In the extent tree we have inline references of type
         # BTRFS_TREE_BLOCK_REF_KEY, with a count of 1, for the same extent
         # buffers, so they should cancel each other, and the extent buffers in
         # the fs tree should no longer be considered as shared.
         #
         echo "Overwriting file range [120M, 120M + 128K)..."
         xfs_io -c "pwrite -b 128K 120M 128K" $MNT/snap/foo
         xfs_io -c "fsync" $MNT/snap/foo
      
         # Fiemap should output 0x8 in the flags column. The extent in the range
         # [120M, 120M + 128K) is no longer shared, it's now exclusive to the fs
         # tree.
         echo
         echo "fiemap after overwrite range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         umount $MNT
      
      Running it before this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1152 sec (1.085 GiB/sec and 1110.5809 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (683.060 MiB/sec and 5464.4809 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
      The extent in the range [120M, 120M + 128K) is still reported as shared
      (0x2000 bit set) after overwriting that range and flushing delalloc, which
      is not correct - an entire path was COWed in the snapshot's tree and the
      extent is now only referenced by the original fs tree.
      
      Running it after this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1198 sec (1.043 GiB/sec and 1068.2067 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (694.444 MiB/sec and 5555.5556 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256   0x8
      
      Now the extent is not reported as shared anymore.
      
      So fix this by passing a NULL key pointer to add_indirect_ref() when
      processing a delayed reference for a tree block if there's no extent op
      for our delayed ref head with a defined key. Also access the extent op
      only after locking the delayed ref head's lock.
      
      The reproducer will be converted later to a test case for fstests.
      
      Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
      Fixes: a6dbceaf ("btrfs: Remove unused op_key var from add_delayed_refs")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      943553ef
    • Filipe Manana's avatar
      btrfs: fix processing of delayed data refs during backref walking · 4fc7b572
      Filipe Manana authored
      When processing delayed data references during backref walking and we are
      using a share context (we are being called through fiemap), whenever we
      find a delayed data reference for an inode different from the one we are
      interested in, then we immediately exit and consider the data extent as
      shared. This is wrong, because:
      
      1) This might be a DROP reference that will cancel out a reference in the
         extent tree;
      
      2) Even if it's an ADD reference, it may be followed by a DROP reference
         that cancels it out.
      
      In either case we should not exit immediately.
      
      Fix this by never exiting when we find a delayed data reference for
      another inode - instead add the reference and if it does not cancel out
      other delayed reference, we will exit early when we call
      extent_is_shared() after processing all delayed references. If we find
      a drop reference, then signal the code that processes references from
      the extent tree (add_inline_refs() and add_keyed_refs()) to not exit
      immediately if it finds there a reference for another inode, since we
      have delayed drop references that may cancel it out. In this later case
      we exit once we don't have references in the rb trees that cancel out
      each other and have two references for different inodes.
      
      Example reproducer for case 1):
      
         $ cat test-1.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      Example reproducer for case 2):
      
         $ cat test-2.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         # Flush delayed references to the extent tree and commit current
         # transaction.
         sync
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      After this patch, after deleting bar in both tests, the extent is not
      reported with the 0x2000 flag anymore, it gets only the flag 0x1
      (which is FIEMAP_EXTENT_LAST):
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
      These tests will later be converted to a test case for fstests.
      
      Fixes: dc046b10 ("Btrfs: make fiemap not blow when you have lots of snapshots")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4fc7b572
    • David Sterba's avatar
      btrfs: delete stale comments after merge conflict resolution · 295a53cc
      David Sterba authored
      There are two comments in btrfs_cache_block_group that I left when
      resolving conflict between commits ced8ecf0 "btrfs: fix space cache
      corruption and potential double allocations" and 527c490f "btrfs:
      delete btrfs_wait_space_cache_v1_finished".
      
      The former reworked the caching logic to wait until the caching ends in
      btrfs_cache_block_group while the latter only open coded the waiting.
      Both removed btrfs_wait_space_cache_v1_finished, the correct code is
      with the waiting and returning error. Thus the conflict resolution was
      OK.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      295a53cc
    • Josef Bacik's avatar
      btrfs: unlock locked extent area if we have contention · 9e769bd7
      Josef Bacik authored
      In production we hit the following deadlock
      
      task 1			task 2			task 3
      ------			------			------
      fiemap(file)		falloc(file)		fsync(file)
      						  write(0, 1MiB)
      						  btrfs_commit_transaction()
      						    wait_on(!pending_ordered)
      			  lock(512MiB, 1GiB)
      			  start_transaction
      			    wait_on_transaction
      
        lock(0, 1GiB)
          wait_extent_bit(512MiB)
      
      task 4
      ------
      finish_ordered_extent(0, 1MiB)
        lock(0, 1MiB)
        **DEADLOCK**
      
      This occurs because when task 1 does it's lock, it locks everything from
      0-512MiB, and then waits for the 512MiB chunk to unlock.  task 2 will
      never unlock because it's waiting on the transaction commit to happen,
      the transaction commit is waiting for the outstanding ordered extents,
      and then the ordered extent thread is blocked waiting on the 0-1MiB
      range to unlock.
      
      To fix this we have to clear anything we've locked so far, wait for the
      extent_state that we contended on, and then try to re-lock the entire
      range again.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e769bd7
    • David Sterba's avatar
      btrfs: send: update command for protocol version check · c86eab81
      David Sterba authored
      For a protocol and command compatibility we have a helper that hasn't
      been updated for v3 yet. We use it for verity so update where necessary.
      
      Fixes: 38622010 ("btrfs: send: add support for fs-verity")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c86eab81
    • Boris Burkov's avatar
      btrfs: send: allow protocol version 3 with CONFIG_BTRFS_DEBUG · 9971a741
      Boris Burkov authored
      We haven't finalized send stream v3 yet, so gate the send stream version
      behind CONFIG_BTRFS_DEBUG as we want some way to test it.
      
      The original verity send did not check the protocol version, so add that
      actual protection as well.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9971a741
  4. 07 Oct, 2022 1 commit
    • Filipe Manana's avatar
      btrfs: add missing path cache update during fiemap · 96dbcc00
      Filipe Manana authored
      When looking the stored result for a cached path node, if the stored
      result is valid and has a value of true, we must update all the nodes for
      all levels below it with a result of true as well. This is necessary when
      moving from one leaf in the fs tree to the next one, as well as when
      moving from a node at any level to the next node at the same level.
      
      Currently this logic is missing as it was somehow forgotten by a recent
      patch with the subject: "btrfs: speedup checking for extent sharedness
      during fiemap".
      
      This adds the missing logic, which is the counter part to what we do
      when adding a shared node to the cache at store_backref_shared_cache().
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96dbcc00
  5. 29 Sep, 2022 25 commits
    • Tetsuo Handa's avatar
      btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer · cbddcc4f
      Tetsuo Handa authored
      syzbot is reporting uninit-value in btrfs_clean_tree_block() [1], for
      commit bc877d28 ("btrfs: Deduplicate extent_buffer init code")
      missed that btrfs_set_header_generation() in btrfs_init_new_buffer() must
      not be moved to after clean_tree_block() because clean_tree_block() is
      calling btrfs_header_generation() since commit 55c69072 ("Btrfs:
      Fix extent_buffer usage when nodesize != leafsize").
      
      Since memzero_extent_buffer() will reset "struct btrfs_header" part, we
      can't move btrfs_set_header_generation() to before memzero_extent_buffer().
      Just re-add btrfs_set_header_generation() before btrfs_clean_tree_block().
      
      Link: https://syzkaller.appspot.com/bug?extid=fba8e2116a12609b6c59 [1]
      Reported-by: default avatarsyzbot <syzbot+fba8e2116a12609b6c59@syzkaller.appspotmail.com>
      Fixes: bc877d28 ("btrfs: Deduplicate extent_buffer init code")
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbddcc4f
    • Filipe Manana's avatar
      btrfs: drop extent map range more efficiently · db21370b
      Filipe Manana authored
      Currently when dropping extent maps for a file range, through
      btrfs_drop_extent_map_range(), we do the following non-optimal things:
      
      1) We lookup for extent maps one by one, always starting the search from
         the root of the extent map tree. This is not efficient if we have
         multiple extent maps in the range;
      
      2) We check on every iteration if we have the 'split' and 'split2' spare
         extent maps in case we need to split an extent map that intersects our
         range but also crosses its boundaries (to the left, to the right or
         both cases). If our target range is for example:
      
             [2M, 8M)
      
         And we have 3 extents maps in the range:
      
             [1M, 3M) [3M, 6M) [6M, 10M[
      
         The on the first iteration we allocate two extent maps for 'split' and
         'split2', and use the 'split' to split the first extent map, so after
         the split we set 'split' to 'split2' and then set 'split2' to NULL.
      
         On the second iteration, we don't need to split the second extent map,
         but because 'split2' is now NULL, we allocate a new extent map for
         'split2'.
      
         On the third iteration we need to split the third extent map, so we
         use the extent map pointed by 'split'.
      
         So we ended up allocating 3 extent maps for splitting, but all we
         needed was 2 extent maps. We never need to allocate more than 2,
         because extent maps that need to be split are always the first one
         and the last one in the target range.
      
      Improve on this by:
      
      1) Using rb_next() to move on to the next extent map. This results in
         iterating over less nodes of the tree and it does not require comparing
         the ranges of nodes to our start/end offset;
      
      2) Allocate the 2 extent maps for splitting before entering the loop and
         never allocate more than 2. In practice it's very rare to have the
         combination of both extent map allocations fail, since we have a
         dedicated slab for extent maps, and also have the need to split two
         extent maps.
      
      This patch is part of a patchset comprised of the following patches:
      
         btrfs: fix missed extent on fsync after dropping extent maps
         btrfs: move btrfs_drop_extent_cache() to extent_map.c
         btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
         btrfs: use cond_resched_rwlock_write() during inode eviction
         btrfs: move open coded extent map tree deletion out of inode eviction
         btrfs: add helper to replace extent map range with a new extent map
         btrfs: remove the refcount warning/check at free_extent_map()
         btrfs: remove unnecessary extent map initializations
         btrfs: assert tree is locked when clearing extent map from logging
         btrfs: remove unnecessary NULL pointer checks when searching extent maps
         btrfs: remove unnecessary next extent map search
         btrfs: avoid pointless extent map tree search when flushing delalloc
         btrfs: drop extent map range more efficiently
      
      And the following fio test was done before and after applying the whole
      patchset, on a non-debug kernel (Debian's default kernel config) on a 12
      cores Intel box with 64G of ram:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
         cat <<EOF > /tmp/fio-job.ini
         [writers]
         rw=randwrite
         fsync=8
         fallocate=none
         group_reporting=1
         direct=0
         bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
         ioengine=psync
         filesize=2G
         runtime=300
         time_based
         directory=$MNT
         numjobs=8
         thread
         EOF
      
         echo performance | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         echo
         echo "Using config:"
         echo
         cat /tmp/fio-job.ini
         echo
      
         umount $MNT &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV
         mount $MOUNT_OPTIONS $DEV $MNT
      
         fio /tmp/fio-job.ini
      
         umount $MNT
      
      Result before applying the patchset:
      
         WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec
      
      Result after applying the patchset:
      
         WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db21370b
    • Filipe Manana's avatar
      btrfs: avoid pointless extent map tree search when flushing delalloc · b54bb865
      Filipe Manana authored
      When flushing delalloc, in COW mode at cow_file_range(), before entering
      the loop that allocates extents and creates ordered extents, we do a call
      to btrfs_drop_extent_map_range() for the whole range. This is pointless
      because in the loop we call create_io_em(), which will also call
      btrfs_drop_extent_map_range() before inserting the new extent map.
      
      So remove that call at cow_file_range() not only because it is not needed,
      but also because it will make the btrfs_drop_extent_map_range() calls made
      from create_io_em() waste time searching the extent map tree, and that
      tree can be large for files with many extents. It also makes us waste time
      at btrfs_drop_extent_map_range() allocating and freeing the split extent
      maps for nothing.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b54bb865
    • Filipe Manana's avatar
      btrfs: remove unnecessary next extent map search · 6c05813e
      Filipe Manana authored
      At __tree_search(), and its single caller __lookup_extent_mapping(), there
      is no point in finding the next extent map that starts after the search
      offset if we were able to find the previous extent map that ends before
      our search offset, because __lookup_extent_mapping() ignores the next
      acceptable extent map if we were able to find the previous one.
      
      So just return immediately if we were able to find the previous extent
      map, therefore avoiding wasting time iterating the tree looking for the
      next extent map which will not be used by __lookup_extent_mapping().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c05813e
    • Filipe Manana's avatar
      btrfs: remove unnecessary NULL pointer checks when searching extent maps · 08f088dd
      Filipe Manana authored
      The previous and next pointer arguments passed to __tree_search() are
      never NULL as the only caller of this function, __lookup_extent_mapping(),
      always passes the address of two on stack pointers. So remove the NULL
      checks and add assertions to verify the pointers.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08f088dd
    • Filipe Manana's avatar
      btrfs: assert tree is locked when clearing extent map from logging · 74333c7d
      Filipe Manana authored
      When calling clear_em_logging() we should have a write lock on the extent
      map tree, as we will try to merge the extent map with the previous and
      next ones in the tree. So assert that we have a write lock.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74333c7d
    • Filipe Manana's avatar
      btrfs: remove unnecessary extent map initializations · 2e0cdaa0
      Filipe Manana authored
      When allocating an extent map, we use kmem_cache_zalloc() which guarantees
      the returned memory is initialized to zeroes, therefore it's pointless
      to initialize the generation and flags of the extent map to zero again.
      
      Remove those initializations, as they are pointless and slightly increase
      the object text size.
      
      Before removing them:
      
         $ size fs/btrfs/extent_map.o
            text	   data	    bss	    dec	    hex	filename
            9241	    274	     24	   9539	   2543	fs/btrfs/extent_map.o
      
      After removing them:
      
         $ size fs/btrfs/extent_map.o
            text	   data	    bss	    dec	    hex	filename
            9209	    274	     24	   9507	   2523	fs/btrfs/extent_map.o
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2e0cdaa0
    • Filipe Manana's avatar
      btrfs: remove the refcount warning/check at free_extent_map() · ad5d6e91
      Filipe Manana authored
      At free_extent_map(), it's pointless to have a WARN_ON() to check if the
      refcount of the extent map is zero. Such check is already done by the
      refcount_t module and refcount_dec_and_test(), which loudly complains if
      we try to decrement a reference count that is currently 0.
      
      The WARN_ON() dates back to the time when used a regular atomic_t type
      for the reference counter, before we switched to the refcount_t type.
      The main goal of the refcount_t type/module is precisely to catch such
      types of bugs and loudly complain if they happen.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad5d6e91
    • Filipe Manana's avatar
      btrfs: add helper to replace extent map range with a new extent map · a1ba4c08
      Filipe Manana authored
      We have several places that need to drop all the extent maps in a given
      file range and then add a new extent map for that range. Currently they
      call btrfs_drop_extent_map_range() to delete all extent maps in the range
      and then keep trying to add the new extent map in a loop that keeps
      retrying while the insertion of the new extent map fails with -EEXIST.
      
      So instead of repeating this logic, add a helper to extent_map.c that
      does these steps and name it btrfs_replace_extent_map_range(). Also add
      a comment about why the retry loop is necessary.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1ba4c08
    • Filipe Manana's avatar
      btrfs: move open coded extent map tree deletion out of inode eviction · 9c9d1b4f
      Filipe Manana authored
      Move the loop that removes all the extent maps from the inode's extent
      map tree during inode eviction out of inode.c and into extent_map.c, to
      btrfs_drop_extent_map_range(). Anything manipulating extent maps or the
      extent map tree should be in extent_map.c.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c9d1b4f
    • Filipe Manana's avatar
      btrfs: use cond_resched_rwlock_write() during inode eviction · 99ba0c81
      Filipe Manana authored
      At evict_inode_truncate_pages(), instead of manually checking if
      rescheduling is needed, then unlock the extent map tree, reschedule and
      then write lock again the tree, use the helper cond_resched_rwlock_write()
      which does all that.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99ba0c81
    • Filipe Manana's avatar
      btrfs: use extent_map_end() at btrfs_drop_extent_map_range() · f3109e33
      Filipe Manana authored
      Instead of open coding the end offset calculation of an extent map, use
      the helper extent_map_end() and cache its result in a local variable,
      since it's used several times.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3109e33
    • Filipe Manana's avatar
      btrfs: move btrfs_drop_extent_cache() to extent_map.c · 4c0c8cfc
      Filipe Manana authored
      The function btrfs_drop_extent_cache() doesn't really belong at file.c
      because what it does is drop a range of extent maps for a file range.
      It directly allocates and manipulates extent maps, by dropping,
      splitting and replacing them in an extent map tree, so it should be
      located at extent_map.c, where all manipulations of an extent map tree
      and its extent maps are supposed to be done.
      
      So move it out of file.c and into extent_map.c. Additionally do the
      following changes:
      
      1) Rename it into btrfs_drop_extent_map_range(), as this makes it more
         clear about what it does. The term "cache" is a bit confusing as it's
         not widely used, "extent maps" or "extent mapping" is much more common;
      
      2) Change its 'skip_pinned' argument from int to bool;
      
      3) Turn several of its local variables from int to bool, since they are
         used as booleans;
      
      4) Move the declaration of some variables out of the function's main
         scope and into the scopes where they are used;
      
      5) Remove pointless assignment of false to 'modified' early in the while
         loop, as later that variable is set and it's not used before that
         second assignment;
      
      6) Remove checks for NULL before calling free_extent_map().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c0c8cfc
    • Filipe Manana's avatar
      btrfs: fix missed extent on fsync after dropping extent maps · cef7820d
      Filipe Manana authored
      When dropping extent maps for a range, through btrfs_drop_extent_cache(),
      if we find an extent map that starts before our target range and/or ends
      before the target range, and we are not able to allocate extent maps for
      splitting that extent map, then we don't fail and simply remove the entire
      extent map from the inode's extent map tree.
      
      This is generally fine, because in case anyone needs to access the extent
      map, it can just load it again later from the respective file extent
      item(s) in the subvolume btree. However, if that extent map is new and is
      in the list of modified extents, then a fast fsync will miss the parts of
      the extent that were outside our range (that needed to be split),
      therefore not logging them. Fix that by marking the inode for a full
      fsync. This issue was introduced after removing BUG_ON()s triggered when
      the split extent map allocations failed, done by commit 7014cdb4
      ("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and
      the fast fsync path already existed but was very recent.
      
      Also, in the case where we could allocate extent maps for the split
      operations but then fail to add a split extent map to the tree, mark the
      inode for a full fsync as well. This is not supposed to ever fail, and we
      assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is
      not set), it's the correct thing to do to make sure a fast fsync will not
      miss a new extent.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cef7820d
    • Jeff Layton's avatar
      btrfs: remove stale prototype of btrfs_write_inode · 3050dfa6
      Jeff Layton authored
      This function no longer exists, was removed in 3c427693 ("Btrfs: fix
      btrfs_write_inode vs delayed iput deadlock").
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3050dfa6
    • Stefan Roesch's avatar
      btrfs: enable nowait async buffered writes · 926078b2
      Stefan Roesch authored
      Enable nowait async buffered writes in btrfs_do_write_iter() and
      btrfs_file_open().
      
      In this version encoded buffered writes have the optimization not
      enabled. Encoded writes are enabled by using an ioctl. io_uring
      currently does not support ioctls. This might be enabled in the future.
      
      Performance results:
      
        For fio the following results have been obtained with a queue depth of
        1 and 4k block size (runtime 600 secs):
      
                       sequential writes:
                       without patch           with patch      libaio     psync
        iops:              55k                    134k          117K       148K
        bw:               221MB/s                 538MB/s       469MB/s    592MB/s
        clat:           15286ns                    82ns         994ns     6340ns
      
      For an io depth of 1, the new patch improves throughput by over two
      times (compared to the existing behavior, where buffered writes are
      processed by an io-worker process) and also the latency is considerably
      reduced. To achieve the same or better performance with the existing
      code an io depth of 4 is required.  Increasing the iodepth further does
      not lead to improvements.
      
      The tests have been run like this:
      
      ./fio --name=seq-writers --ioengine=psync --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      ./fio --name=seq-writers --ioengine=io_uring --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      ./fio --name=seq-writers --ioengine=libaio --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      
      Testing:
        This patch has been tested with xfstests, fsx, fio. xfstests shows no new
        diffs compared to running without the patch series.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      926078b2
    • Stefan Roesch's avatar
      btrfs: assert nowait mode is not used for some btree search functions · c922b016
      Stefan Roesch authored
      Adds nowait asserts to btree search functions which are not used by
      buffered IO and direct IO paths.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c922b016
    • Stefan Roesch's avatar
      btrfs: make btrfs_buffered_write nowait compatible · 965f47ae
      Stefan Roesch authored
      We need to avoid unconditionally calling balance_dirty_pages_ratelimited
      as it could wait for some reason. Use balance_dirty_pages_ratelimited_flags
      with the BDP_ASYNC in case the buffered write is nowait, returning
      EAGAIN eventually.
      
      It also moves the function after the again label. This can cause the
      function to be called a bit later, but this should have no impact in the
      real world.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      965f47ae
    • Stefan Roesch's avatar
      btrfs: plumb NOWAIT through the write path · 304e45ac
      Stefan Roesch authored
      We have everywhere setup for nowait, plumb NOWAIT through the write path.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      304e45ac
    • Stefan Roesch's avatar
      btrfs: make lock_and_cleanup_extent_if_need nowait compatible · 2fcab928
      Stefan Roesch authored
      Add the nowait parameter to lock_and_cleanup_extent_if_need(). If the
      nowait parameter is specified we try to lock the extent in nowait mode.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2fcab928
    • Stefan Roesch's avatar
      btrfs: make prepare_pages nowait compatible · fc226000
      Stefan Roesch authored
      Add nowait parameter to the prepare_pages function. In case nowait is
      specified for an async buffered write request, do a nowait allocation or
      return -EAGAIN.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc226000
    • Josef Bacik's avatar
      btrfs: make btrfs_check_nocow_lock nowait compatible · 80f9d241
      Josef Bacik authored
      Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add
      a nowait flag to btrfs_check_nocow_lock so it can be used by the write
      path.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      80f9d241
    • Josef Bacik's avatar
      btrfs: add btrfs_try_lock_ordered_range · d2c7a19f
      Josef Bacik authored
      For IOCB_NOWAIT we're going to want to use try lock on the extent lock,
      and simply bail if there's an ordered extent in the range because the
      only choice there is to wait for the ordered extent to complete.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2c7a19f
    • Josef Bacik's avatar
      btrfs: add the ability to use NO_FLUSH for data reservations · 1daedb1d
      Josef Bacik authored
      In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH
      data reservations, so plumb this through the delalloc reservation
      system.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1daedb1d
    • Josef Bacik's avatar
      btrfs: make can_nocow_extent nowait compatible · 26ce9114
      Josef Bacik authored
      If we have NOWAIT specified on our IOCB and we're writing into a
      PREALLOC or NOCOW extent then we need to be able to tell
      can_nocow_extent that we don't want to wait on any locks or metadata IO.
      Fix can_nocow_extent to allow for NOWAIT.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      26ce9114
  6. 26 Sep, 2022 5 commits
    • Josef Bacik's avatar
      btrfs: implement a nowait option for tree searches · 857bc13f
      Josef Bacik authored
      For NOWAIT IOCBs we'll need a way to tell search to not wait on locks
      or anything.  Accomplish this by adding a path->nowait flag that will
      use trylocks and skip reading of metadata, returning -EAGAIN in either
      of these cases.  For now we only need this for reads, so only the read
      side is handled.  Add an ASSERT() to catch anybody trying to use this
      for writes so they know they'll have to implement the write side.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      857bc13f
    • Stefan Roesch's avatar
      mm: export balance_dirty_pages_ratelimited_flags() · 611df5d6
      Stefan Roesch authored
      Export the function balance_dirty_pages_ratelimited_flags(). It is now
      also called from btrfs.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      611df5d6
    • Qu Wenruo's avatar
      btrfs: relax block-group-tree feature dependency checks · d7f67ac9
      Qu Wenruo authored
      [BUG]
      When one user did a wrong attempt to clear block group tree, which can
      not be done through mount option, by using "-o clear_cache,space_cache=v2",
      it will cause the following error on a fs with block-group-tree feature:
      
        BTRFS info (device dm-1): force clearing of disk cache
        BTRFS info (device dm-1): using free space tree
        BTRFS info (device dm-1): clearing free space tree
        BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
        BTRFS info (device dm-1): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
        BTRFS error (device dm-1): block-group-tree feature requires fres-space-tree and no-holes
        BTRFS error (device dm-1): super block corruption detected before writing it to disk
        BTRFS: error (device dm-1) in write_all_supers:4318: errno=-117 Filesystem corrupted (unexpected superblock corruption detected)
        BTRFS warning (device dm-1: state E): Skipping commit of aborted transaction.
      
      [CAUSE]
      Although the dependency for block-group-tree feature is just an
      artificial one (to reduce test matrix), we put the dependency check into
      btrfs_validate_super().
      
      This is too strict, and during space cache clearing, we will have a
      window where free space tree is cleared, and we need to commit the super
      block.
      
      In that window, we had block group tree without v2 cache, and triggered
      the artificial dependency check.
      
      This is not necessary at all, especially for such a soft dependency.
      
      [FIX]
      Introduce a new helper, btrfs_check_features(), to do all the runtime
      limitation checks, including:
      
      - Unsupported incompat flags check
      
      - Unsupported compat RO flags check
      
      - Setting missing incompat flags
      
      - Artificial feature dependency checks
        Currently only block group tree will rely on this.
      
      - Subpage runtime check for v1 cache
      
      With this helper, we can move quite some checks from
      open_ctree()/btrfs_remount() into it, and just call it after
      btrfs_parse_options().
      
      Now "-o clear_cache,space_cache=v2" will not trigger the above error
      anymore.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ edit messages ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7f67ac9
    • Qu Wenruo's avatar
      btrfs: move end_io_func argument to btrfs_bio_ctrl structure · 5467abba
      Qu Wenruo authored
      For function submit_extent_page() and alloc_new_bio(), we have an
      argument @end_io_func to indicate the end io function.
      
      But that function never change inside any call site of them, thus no
      need to pass the pointer around everywhere.
      
      There is a better match for the lifespan of all the call sites, as we
      have btrfs_bio_ctrl structure, thus we can put the endio function
      pointer there, and grab the pointer every time we allocate a new bio.
      
      Also add extra ASSERT()s to make sure every call site of
      submit_extent_page() and alloc_new_bio() has properly set the pointer
      inside btrfs_bio_ctrl.
      
      This removes one argument from the already long argument list of
      submit_extent_page().
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5467abba
    • Qu Wenruo's avatar
      btrfs: switch page and disk_bytenr argument position for submit_extent_page() · 209ecde5
      Qu Wenruo authored
      Normally we put (page, pg_len, pg_offset) arguments together, just like
      what __bio_add_page() does.
      
      But in submit_extent_page(), what we got is, (page, disk_bytenr, pg_len,
      pg_offset), which sometimes can be confusing.
      
      Change the order to (disk_bytenr, page, pg_len, pg_offset) to make it
      to follow the common schema.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      209ecde5