1. 29 Sep, 2022 24 commits
    • Filipe Manana's avatar
      btrfs: drop extent map range more efficiently · db21370b
      Filipe Manana authored
      Currently when dropping extent maps for a file range, through
      btrfs_drop_extent_map_range(), we do the following non-optimal things:
      
      1) We lookup for extent maps one by one, always starting the search from
         the root of the extent map tree. This is not efficient if we have
         multiple extent maps in the range;
      
      2) We check on every iteration if we have the 'split' and 'split2' spare
         extent maps in case we need to split an extent map that intersects our
         range but also crosses its boundaries (to the left, to the right or
         both cases). If our target range is for example:
      
             [2M, 8M)
      
         And we have 3 extents maps in the range:
      
             [1M, 3M) [3M, 6M) [6M, 10M[
      
         The on the first iteration we allocate two extent maps for 'split' and
         'split2', and use the 'split' to split the first extent map, so after
         the split we set 'split' to 'split2' and then set 'split2' to NULL.
      
         On the second iteration, we don't need to split the second extent map,
         but because 'split2' is now NULL, we allocate a new extent map for
         'split2'.
      
         On the third iteration we need to split the third extent map, so we
         use the extent map pointed by 'split'.
      
         So we ended up allocating 3 extent maps for splitting, but all we
         needed was 2 extent maps. We never need to allocate more than 2,
         because extent maps that need to be split are always the first one
         and the last one in the target range.
      
      Improve on this by:
      
      1) Using rb_next() to move on to the next extent map. This results in
         iterating over less nodes of the tree and it does not require comparing
         the ranges of nodes to our start/end offset;
      
      2) Allocate the 2 extent maps for splitting before entering the loop and
         never allocate more than 2. In practice it's very rare to have the
         combination of both extent map allocations fail, since we have a
         dedicated slab for extent maps, and also have the need to split two
         extent maps.
      
      This patch is part of a patchset comprised of the following patches:
      
         btrfs: fix missed extent on fsync after dropping extent maps
         btrfs: move btrfs_drop_extent_cache() to extent_map.c
         btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
         btrfs: use cond_resched_rwlock_write() during inode eviction
         btrfs: move open coded extent map tree deletion out of inode eviction
         btrfs: add helper to replace extent map range with a new extent map
         btrfs: remove the refcount warning/check at free_extent_map()
         btrfs: remove unnecessary extent map initializations
         btrfs: assert tree is locked when clearing extent map from logging
         btrfs: remove unnecessary NULL pointer checks when searching extent maps
         btrfs: remove unnecessary next extent map search
         btrfs: avoid pointless extent map tree search when flushing delalloc
         btrfs: drop extent map range more efficiently
      
      And the following fio test was done before and after applying the whole
      patchset, on a non-debug kernel (Debian's default kernel config) on a 12
      cores Intel box with 64G of ram:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
         cat <<EOF > /tmp/fio-job.ini
         [writers]
         rw=randwrite
         fsync=8
         fallocate=none
         group_reporting=1
         direct=0
         bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
         ioengine=psync
         filesize=2G
         runtime=300
         time_based
         directory=$MNT
         numjobs=8
         thread
         EOF
      
         echo performance | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         echo
         echo "Using config:"
         echo
         cat /tmp/fio-job.ini
         echo
      
         umount $MNT &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV
         mount $MOUNT_OPTIONS $DEV $MNT
      
         fio /tmp/fio-job.ini
      
         umount $MNT
      
      Result before applying the patchset:
      
         WRITE: bw=197MiB/s (206MB/s), 197MiB/s-197MiB/s (206MB/s-206MB/s), io=57.7GiB (61.9GB), run=300188-300188msec
      
      Result after applying the patchset:
      
         WRITE: bw=203MiB/s (213MB/s), 203MiB/s-203MiB/s (213MB/s-213MB/s), io=59.5GiB (63.9GB), run=300019-300019msec
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      db21370b
    • Filipe Manana's avatar
      btrfs: avoid pointless extent map tree search when flushing delalloc · b54bb865
      Filipe Manana authored
      When flushing delalloc, in COW mode at cow_file_range(), before entering
      the loop that allocates extents and creates ordered extents, we do a call
      to btrfs_drop_extent_map_range() for the whole range. This is pointless
      because in the loop we call create_io_em(), which will also call
      btrfs_drop_extent_map_range() before inserting the new extent map.
      
      So remove that call at cow_file_range() not only because it is not needed,
      but also because it will make the btrfs_drop_extent_map_range() calls made
      from create_io_em() waste time searching the extent map tree, and that
      tree can be large for files with many extents. It also makes us waste time
      at btrfs_drop_extent_map_range() allocating and freeing the split extent
      maps for nothing.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b54bb865
    • Filipe Manana's avatar
      btrfs: remove unnecessary next extent map search · 6c05813e
      Filipe Manana authored
      At __tree_search(), and its single caller __lookup_extent_mapping(), there
      is no point in finding the next extent map that starts after the search
      offset if we were able to find the previous extent map that ends before
      our search offset, because __lookup_extent_mapping() ignores the next
      acceptable extent map if we were able to find the previous one.
      
      So just return immediately if we were able to find the previous extent
      map, therefore avoiding wasting time iterating the tree looking for the
      next extent map which will not be used by __lookup_extent_mapping().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c05813e
    • Filipe Manana's avatar
      btrfs: remove unnecessary NULL pointer checks when searching extent maps · 08f088dd
      Filipe Manana authored
      The previous and next pointer arguments passed to __tree_search() are
      never NULL as the only caller of this function, __lookup_extent_mapping(),
      always passes the address of two on stack pointers. So remove the NULL
      checks and add assertions to verify the pointers.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      08f088dd
    • Filipe Manana's avatar
      btrfs: assert tree is locked when clearing extent map from logging · 74333c7d
      Filipe Manana authored
      When calling clear_em_logging() we should have a write lock on the extent
      map tree, as we will try to merge the extent map with the previous and
      next ones in the tree. So assert that we have a write lock.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74333c7d
    • Filipe Manana's avatar
      btrfs: remove unnecessary extent map initializations · 2e0cdaa0
      Filipe Manana authored
      When allocating an extent map, we use kmem_cache_zalloc() which guarantees
      the returned memory is initialized to zeroes, therefore it's pointless
      to initialize the generation and flags of the extent map to zero again.
      
      Remove those initializations, as they are pointless and slightly increase
      the object text size.
      
      Before removing them:
      
         $ size fs/btrfs/extent_map.o
            text	   data	    bss	    dec	    hex	filename
            9241	    274	     24	   9539	   2543	fs/btrfs/extent_map.o
      
      After removing them:
      
         $ size fs/btrfs/extent_map.o
            text	   data	    bss	    dec	    hex	filename
            9209	    274	     24	   9507	   2523	fs/btrfs/extent_map.o
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2e0cdaa0
    • Filipe Manana's avatar
      btrfs: remove the refcount warning/check at free_extent_map() · ad5d6e91
      Filipe Manana authored
      At free_extent_map(), it's pointless to have a WARN_ON() to check if the
      refcount of the extent map is zero. Such check is already done by the
      refcount_t module and refcount_dec_and_test(), which loudly complains if
      we try to decrement a reference count that is currently 0.
      
      The WARN_ON() dates back to the time when used a regular atomic_t type
      for the reference counter, before we switched to the refcount_t type.
      The main goal of the refcount_t type/module is precisely to catch such
      types of bugs and loudly complain if they happen.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad5d6e91
    • Filipe Manana's avatar
      btrfs: add helper to replace extent map range with a new extent map · a1ba4c08
      Filipe Manana authored
      We have several places that need to drop all the extent maps in a given
      file range and then add a new extent map for that range. Currently they
      call btrfs_drop_extent_map_range() to delete all extent maps in the range
      and then keep trying to add the new extent map in a loop that keeps
      retrying while the insertion of the new extent map fails with -EEXIST.
      
      So instead of repeating this logic, add a helper to extent_map.c that
      does these steps and name it btrfs_replace_extent_map_range(). Also add
      a comment about why the retry loop is necessary.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a1ba4c08
    • Filipe Manana's avatar
      btrfs: move open coded extent map tree deletion out of inode eviction · 9c9d1b4f
      Filipe Manana authored
      Move the loop that removes all the extent maps from the inode's extent
      map tree during inode eviction out of inode.c and into extent_map.c, to
      btrfs_drop_extent_map_range(). Anything manipulating extent maps or the
      extent map tree should be in extent_map.c.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9c9d1b4f
    • Filipe Manana's avatar
      btrfs: use cond_resched_rwlock_write() during inode eviction · 99ba0c81
      Filipe Manana authored
      At evict_inode_truncate_pages(), instead of manually checking if
      rescheduling is needed, then unlock the extent map tree, reschedule and
      then write lock again the tree, use the helper cond_resched_rwlock_write()
      which does all that.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99ba0c81
    • Filipe Manana's avatar
      btrfs: use extent_map_end() at btrfs_drop_extent_map_range() · f3109e33
      Filipe Manana authored
      Instead of open coding the end offset calculation of an extent map, use
      the helper extent_map_end() and cache its result in a local variable,
      since it's used several times.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3109e33
    • Filipe Manana's avatar
      btrfs: move btrfs_drop_extent_cache() to extent_map.c · 4c0c8cfc
      Filipe Manana authored
      The function btrfs_drop_extent_cache() doesn't really belong at file.c
      because what it does is drop a range of extent maps for a file range.
      It directly allocates and manipulates extent maps, by dropping,
      splitting and replacing them in an extent map tree, so it should be
      located at extent_map.c, where all manipulations of an extent map tree
      and its extent maps are supposed to be done.
      
      So move it out of file.c and into extent_map.c. Additionally do the
      following changes:
      
      1) Rename it into btrfs_drop_extent_map_range(), as this makes it more
         clear about what it does. The term "cache" is a bit confusing as it's
         not widely used, "extent maps" or "extent mapping" is much more common;
      
      2) Change its 'skip_pinned' argument from int to bool;
      
      3) Turn several of its local variables from int to bool, since they are
         used as booleans;
      
      4) Move the declaration of some variables out of the function's main
         scope and into the scopes where they are used;
      
      5) Remove pointless assignment of false to 'modified' early in the while
         loop, as later that variable is set and it's not used before that
         second assignment;
      
      6) Remove checks for NULL before calling free_extent_map().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c0c8cfc
    • Filipe Manana's avatar
      btrfs: fix missed extent on fsync after dropping extent maps · cef7820d
      Filipe Manana authored
      When dropping extent maps for a range, through btrfs_drop_extent_cache(),
      if we find an extent map that starts before our target range and/or ends
      before the target range, and we are not able to allocate extent maps for
      splitting that extent map, then we don't fail and simply remove the entire
      extent map from the inode's extent map tree.
      
      This is generally fine, because in case anyone needs to access the extent
      map, it can just load it again later from the respective file extent
      item(s) in the subvolume btree. However, if that extent map is new and is
      in the list of modified extents, then a fast fsync will miss the parts of
      the extent that were outside our range (that needed to be split),
      therefore not logging them. Fix that by marking the inode for a full
      fsync. This issue was introduced after removing BUG_ON()s triggered when
      the split extent map allocations failed, done by commit 7014cdb4
      ("Btrfs: btrfs_drop_extent_cache should never fail"), back in 2012, and
      the fast fsync path already existed but was very recent.
      
      Also, in the case where we could allocate extent maps for the split
      operations but then fail to add a split extent map to the tree, mark the
      inode for a full fsync as well. This is not supposed to ever fail, and we
      assert that, but in case assertions are disabled (CONFIG_BTRFS_ASSERT is
      not set), it's the correct thing to do to make sure a fast fsync will not
      miss a new extent.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cef7820d
    • Jeff Layton's avatar
      btrfs: remove stale prototype of btrfs_write_inode · 3050dfa6
      Jeff Layton authored
      This function no longer exists, was removed in 3c427693 ("Btrfs: fix
      btrfs_write_inode vs delayed iput deadlock").
      Signed-off-by: default avatarJeff Layton <jlayton@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3050dfa6
    • Stefan Roesch's avatar
      btrfs: enable nowait async buffered writes · 926078b2
      Stefan Roesch authored
      Enable nowait async buffered writes in btrfs_do_write_iter() and
      btrfs_file_open().
      
      In this version encoded buffered writes have the optimization not
      enabled. Encoded writes are enabled by using an ioctl. io_uring
      currently does not support ioctls. This might be enabled in the future.
      
      Performance results:
      
        For fio the following results have been obtained with a queue depth of
        1 and 4k block size (runtime 600 secs):
      
                       sequential writes:
                       without patch           with patch      libaio     psync
        iops:              55k                    134k          117K       148K
        bw:               221MB/s                 538MB/s       469MB/s    592MB/s
        clat:           15286ns                    82ns         994ns     6340ns
      
      For an io depth of 1, the new patch improves throughput by over two
      times (compared to the existing behavior, where buffered writes are
      processed by an io-worker process) and also the latency is considerably
      reduced. To achieve the same or better performance with the existing
      code an io depth of 4 is required.  Increasing the iodepth further does
      not lead to improvements.
      
      The tests have been run like this:
      
      ./fio --name=seq-writers --ioengine=psync --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      ./fio --name=seq-writers --ioengine=io_uring --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      ./fio --name=seq-writers --ioengine=libaio --iodepth=1 --rw=write \
        --bs=4k --direct=0 --size=100000m --time_based --runtime=600   \
        --numjobs=1 --filename=...
      
      Testing:
        This patch has been tested with xfstests, fsx, fio. xfstests shows no new
        diffs compared to running without the patch series.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      926078b2
    • Stefan Roesch's avatar
      btrfs: assert nowait mode is not used for some btree search functions · c922b016
      Stefan Roesch authored
      Adds nowait asserts to btree search functions which are not used by
      buffered IO and direct IO paths.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c922b016
    • Stefan Roesch's avatar
      btrfs: make btrfs_buffered_write nowait compatible · 965f47ae
      Stefan Roesch authored
      We need to avoid unconditionally calling balance_dirty_pages_ratelimited
      as it could wait for some reason. Use balance_dirty_pages_ratelimited_flags
      with the BDP_ASYNC in case the buffered write is nowait, returning
      EAGAIN eventually.
      
      It also moves the function after the again label. This can cause the
      function to be called a bit later, but this should have no impact in the
      real world.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      965f47ae
    • Stefan Roesch's avatar
      btrfs: plumb NOWAIT through the write path · 304e45ac
      Stefan Roesch authored
      We have everywhere setup for nowait, plumb NOWAIT through the write path.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      304e45ac
    • Stefan Roesch's avatar
      btrfs: make lock_and_cleanup_extent_if_need nowait compatible · 2fcab928
      Stefan Roesch authored
      Add the nowait parameter to lock_and_cleanup_extent_if_need(). If the
      nowait parameter is specified we try to lock the extent in nowait mode.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2fcab928
    • Stefan Roesch's avatar
      btrfs: make prepare_pages nowait compatible · fc226000
      Stefan Roesch authored
      Add nowait parameter to the prepare_pages function. In case nowait is
      specified for an async buffered write request, do a nowait allocation or
      return -EAGAIN.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc226000
    • Josef Bacik's avatar
      btrfs: make btrfs_check_nocow_lock nowait compatible · 80f9d241
      Josef Bacik authored
      Now all the helpers that btrfs_check_nocow_lock uses handle nowait, add
      a nowait flag to btrfs_check_nocow_lock so it can be used by the write
      path.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      80f9d241
    • Josef Bacik's avatar
      btrfs: add btrfs_try_lock_ordered_range · d2c7a19f
      Josef Bacik authored
      For IOCB_NOWAIT we're going to want to use try lock on the extent lock,
      and simply bail if there's an ordered extent in the range because the
      only choice there is to wait for the ordered extent to complete.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d2c7a19f
    • Josef Bacik's avatar
      btrfs: add the ability to use NO_FLUSH for data reservations · 1daedb1d
      Josef Bacik authored
      In order to accommodate NOWAIT IOCB's we need to be able to do NO_FLUSH
      data reservations, so plumb this through the delalloc reservation
      system.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1daedb1d
    • Josef Bacik's avatar
      btrfs: make can_nocow_extent nowait compatible · 26ce9114
      Josef Bacik authored
      If we have NOWAIT specified on our IOCB and we're writing into a
      PREALLOC or NOCOW extent then we need to be able to tell
      can_nocow_extent that we don't want to wait on any locks or metadata IO.
      Fix can_nocow_extent to allow for NOWAIT.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      26ce9114
  2. 26 Sep, 2022 16 commits