1. 05 Dec, 2022 40 commits
    • Josef Bacik's avatar
      btrfs: remove fs_info::pending_changes and related code · 55e5cfd3
      Josef Bacik authored
      Now that we're not using this code anywhere we can remove it as well as
      the member from fs_info.
      
      We don't have any mount options or on/off features that would utilize
      the pending infrastructure, the last one was inode_cache.
      There was a patchset [1] to enable some features from sysfs that would
      break things if it would be set immediately. In case we'll need that
      kind of logic again the patch can be reverted, but for the current use
      it can be replaced by the single state bit to do the commit.
      
      [1] https://lore.kernel.org/linux-btrfs/1422609654-19519-1-git-send-email-quwenruo@cn.fujitsu.com/Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add note ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55e5cfd3
    • Josef Bacik's avatar
      btrfs: add a BTRFS_FS_NEED_TRANS_COMMIT flag · c52cc7b7
      Josef Bacik authored
      Currently we are only using fs_info->pending_changes to indicate that we
      need a transaction commit.  The original users for this were removed
      years ago and we don't have more usage in sight, so this is the only
      remaining reason to have this field.  Add a flag so we can remove this
      code.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c52cc7b7
    • Josef Bacik's avatar
      btrfs: move fs_info::flags enum to fs.h · 7966a6b5
      Josef Bacik authored
      These definitions are fs wide, take them out of ctree.h and put them in
      fs.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7966a6b5
    • Josef Bacik's avatar
      btrfs: move mount option definitions to fs.h · fc97a410
      Josef Bacik authored
      These are fs wide definitions and helpers, move them out of ctree.h and
      into fs.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc97a410
    • Josef Bacik's avatar
      btrfs: convert incompat and compat flag test helpers to macros · 0d3a9cf8
      Josef Bacik authored
      These helpers use functions not defined in fs.h, they're simply
      accessors of the super block in fs_info, convert them to macros so
      that we don't have a weird dependency between fs.h and accessors.h.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d3a9cf8
    • Josef Bacik's avatar
      btrfs: move BTRFS_FS_STATE* definitions and helpers to fs.h · ec8eb376
      Josef Bacik authored
      We're going to use fs.h to hold fs wide related helpers and definitions,
      move the FS_STATE enum and related helpers to fs.h, and then update all
      files that need these definitions to include fs.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec8eb376
    • Josef Bacik's avatar
      btrfs: push printk index code into their respective helpers · bbde07a4
      Josef Bacik authored
      The printk index work can be pushed into the printk helpers themselves,
      this allows us to further sanitize messages.h, removing the last
      include in the header itself.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bbde07a4
    • Josef Bacik's avatar
      btrfs: move the printk helpers out of ctree.h · 9b569ea0
      Josef Bacik authored
      We have a bunch of printk helpers that are in ctree.h.  These have
      nothing to do with ctree.c, so move them into their own header.
      Subsequent patches will cleanup the printk helpers.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9b569ea0
    • Josef Bacik's avatar
      btrfs: move assert helpers out of ctree.h · e118578a
      Josef Bacik authored
      These call functions that aren't defined in, or will be moved out of,
      ctree.h  Move them to super.c where the other assert/error message code
      is defined. Drop the __noreturn attribute for btrfs_assertfail as
      objtool does not like it and fails with warnings like
      
        fs/btrfs/dir-item.o: warning: objtool: .text.unlikely: unexpected end of section
        fs/btrfs/xattr.o: warning: objtool: btrfs_setxattr() falls through to next function btrfs_setxattr_trans.cold()
        fs/btrfs/xattr.o: warning: objtool: .text.unlikely: unexpected end of section
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e118578a
    • Josef Bacik's avatar
      btrfs: move fs wide helpers out of ctree.h · c7f13d42
      Josef Bacik authored
      We have several fs wide related helpers in ctree.h.  The bulk of these
      are the incompat flag test helpers, but there are things such as
      btrfs_fs_closing() and the read only helpers that also aren't directly
      related to the ctree code.  Move these into a fs.h header, which will
      serve as the location for file system wide related helpers.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7f13d42
    • Wang Yugui's avatar
      btrfs: send add define for v2 buffer size · 875c627c
      Wang Yugui authored
      Add a define for the data buffer size (though the maximum size is not
      limited by it) BTRFS_SEND_BUF_SIZE_V2 so it's more visible.
      Signed-off-by: default avatarWang Yugui <wangyugui@e16-tech.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      875c627c
    • David Sterba's avatar
      btrfs: simplify generation check in btrfs_get_dentry · b307f06d
      David Sterba authored
      Callers that pass non-zero generation always want to perform the
      generation check, we can simply encode that in one parameter and drop
      check_generation. Add function documentation.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b307f06d
    • David Sterba's avatar
      btrfs: auto enable discard=async when possible · 63a7cb13
      David Sterba authored
      There's a request to automatically enable async discard for capable
      devices. We can do that, the async mode is designed to wait for larger
      freed extents and is not intrusive, with limits to iops, kbps or latency.
      
      The status and tunables will be exported in /sys/fs/btrfs/FSID/discard .
      
      The automatic selection is done if there's at least one discard capable
      device in the filesystem (not capable devices are skipped). Mounting
      with any other discard option will honor that option, notably mounting
      with nodiscard will keep it disabled.
      
      Link: https://lore.kernel.org/linux-btrfs/CAEg-Je_b1YtdsCR0zS5XZ_SbvJgN70ezwvRwLiCZgDGLbeMB=w@mail.gmail.com/Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63a7cb13
    • David Sterba's avatar
      btrfs: sysfs: convert remaining scnprintf to sysfs_emit · 467761f9
      David Sterba authored
      The sysfs_emit is the safe API for writing to the sysfs files,
      previously converted from scnprintf, there's one left to do in
      btrfs_read_policy_show.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      467761f9
    • Josef Bacik's avatar
      btrfs: do not panic if we can't allocate a prealloc extent state · 5a75034e
      Josef Bacik authored
      We sometimes have to allocate new extent states when clearing or setting
      new bits in an extent io tree.  Generally we preallocate this before
      taking the tree spin lock, but we can use this preallocated extent state
      sometimes and then need to try to do a GFP_ATOMIC allocation under the
      lock.
      
      Unfortunately sometimes this fails, and then we hit the BUG_ON() and
      bring the box down.  This happens roughly 20 times a week in our fleet.
      
      However the vast majority of callers use GFP_NOFS, which means that if
      this GFP_ATOMIC allocation fails, we could simply drop the spin lock, go
      back and allocate a new extent state with our given gfp mask, and begin
      again from where we left off.
      
      For the remaining callers that do not use GFP_NOFS, they are generally
      using GFP_NOWAIT, which still allows for some reclaim.  So allow these
      allocations to attempt to happen outside of the spin lock so we don't
      need to rely on GFP_ATOMIC allocations.
      
      This in essence creates an infinite loop for anything that isn't
      GFP_NOFS.  To address this we may want to migrate to using mempools for
      extent states so that we will always have emergency reserves in order to
      make our allocations.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a75034e
    • Josef Bacik's avatar
      btrfs: remove unused unlock_extent_atomic · da2a071b
      Josef Bacik authored
      As of "btrfs: do not use GFP_ATOMIC in the read endio" we no longer have
      any users of unlock_extent_atomic, remove it.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da2a071b
    • Josef Bacik's avatar
      btrfs: do not use GFP_ATOMIC in the read endio · 48acc47d
      Josef Bacik authored
      We have done read endio in an async thread for a very, very long time,
      which makes the use of GFP_ATOMIC and unlock_extent_atomic() unneeded in
      our read endio path.  We've noticed under heavy memory pressure in our
      fleet that we can fail these allocations, and then often trip a
      BUG_ON(!allocation), which isn't an ideal outcome.  Begin to address
      this by simply not using GFP_ATOMIC, which will allow us to do things
      like actually allocate a extent state when doing
      set_extent_bits(UPTODATE) in the endio handler.
      
      End io handlers are not called in atomic context, besides we have been
      allocating failrec with GFP_NOFS so we'd notice there's a problem.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48acc47d
    • Qu Wenruo's avatar
      btrfs: skip update of block group item if used bytes are the same · 7248e0ce
      Qu Wenruo authored
      [BACKGROUND]
      
      When committing a transaction, we will update block group items for all
      dirty block groups.
      
      But in fact, dirty block groups don't always need to update their block
      group items.
      It's pretty common to have a metadata block group which experienced
      several COW operations, but still have the same amount of used bytes.
      
      In that case, we may unnecessarily COW a tree block doing nothing.
      
      [ENHANCEMENT]
      
      This patch will introduce btrfs_block_group::commit_used member to
      remember the last used bytes, and use that new member to skip
      unnecessary block group item update.
      
      This would be more common for large filesystems, where metadata block
      group can be as large as 1GiB, containing at most 64K metadata items.
      
      In that case, if COW added and then deleted one metadata item near the
      end of the block group, then it's completely possible we don't need to
      touch the block group item at all.
      
      [BENCHMARK]
      
      The change itself can have quite a high chance (20~80%) to skip block
      group item updates in lot of workloads.
      
      As a result, it would result shorter time spent on
      btrfs_write_dirty_block_groups(), and overall reduce the execution time
      of the critical section of btrfs_commit_transaction().
      
      Here comes a fio command, which will do random writes in 4K block size,
      causing a very heavy metadata updates.
      
      fio --filename=$mnt/file --size=512M --rw=randwrite --direct=1 --bs=4k \
          --ioengine=libaio --iodepth=64 --runtime=300 --numjobs=4 \
          --name=random_write --fallocate=none --time_based --fsync_on_close=1
      
      The file size (512M) and number of threads (4) means 2GiB file size in
      total, but during the full 300s run time, my dedicated SATA SSD is able
      to write around 20~25GiB, which is over 10 times the file size.
      
      Thus after we fill the initial 2G, we should not cause much block group
      item updates.
      
      Please note, the fio numbers by themselves don't have much change, but
      if we look deeper, there is some reduced execution time, especially for
      the critical section of btrfs_commit_transaction().
      
      I added extra trace_printk() to measure the following per-transaction
      execution time:
      
      - Critical section of btrfs_commit_transaction()
        By re-using the existing update_commit_stats() function, which
        has already calculated the interval correctly.
      
      - The while() loop for btrfs_write_dirty_block_groups()
        Although this includes the execution time of btrfs_run_delayed_refs(),
        it should still be representative overall.
      
      Both result involves transid 7~30, the same amount of transaction
      committed.
      
      The result looks like this:
      
                            |      Before       |     After      |  Diff
      ----------------------+-------------------+----------------+--------
      Transaction interval  | 229247198.5       | 215016933.6    | -6.2%
      Block group interval  | 23133.33333       | 18970.83333    | -18.0%
      
      The change in block group item updates is more obvious, as skipped block
      group item updates also mean less delayed refs.
      
      And the overall execution time for that block group update loop is
      pretty small, thus we can assume the extent tree is already mostly
      cached.  If we can skip an uncached tree block, it would cause more
      obvious change.
      
      Unfortunately the overall reduction in commit transaction critical
      section is much smaller, as the block group item updates loop is not
      really the major part, at least not for the above fio script.
      
      But still we have a observable reduction in the critical section.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7248e0ce
    • David Sterba's avatar
      btrfs: convert __TRANS_* defines to enum bits · cc37ea61
      David Sterba authored
      The base transaction bits can be defined as bits in a contiguous
      sequence, although right now there's a hole from bit 1 to 8.
      
      The bits are used for btrfs_trans_handle::type, and there's another set
      of TRANS_STATE_* defines that are for btrfs_transaction::state. They are
      mutually exclusive though the hole in the sequence looks like was made
      for the states.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc37ea61
    • David Sterba's avatar
      btrfs: convert QGROUP_* defines to enum bits · e0a8b9a7
      David Sterba authored
      The defines/enums are used only for tracepoints and are not part of the
      on-disk format.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e0a8b9a7
    • David Sterba's avatar
      btrfs: convert EXTENT_* bits to enums · d3b4d0fd
      David Sterba authored
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d3b4d0fd
    • David Sterba's avatar
    • David Sterba's avatar
      c7321b76
    • David Sterba's avatar
      btrfs: add helper for bit enumeration · d549ff7b
      David Sterba authored
      Define helper macro that can be used in enum {} to utilize the automatic
      increment to define all bits without directly defining the values or
      using additional linear bits.
      
      1. capture the sequence value, N
      2. use the value to define the given enum with N-th bit set
      3. reset the sequence back to N
      
      Use for enums that do not require fixed values for symbolic names (like
      for on-disk structures):
      
      enum {
      	ENUM_BIT(FIRST),
      	ENUM_BIT(SECOND),
      	ENUM_BIT(THIRD)
      };
      
      Where the values would be 0x1, 0x2 and 0x4.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d549ff7b
    • Qu Wenruo's avatar
      btrfs: make module init/exit match their sequence · 5565b8e0
      Qu Wenruo authored
      [BACKGROUND]
      In theory init_btrfs_fs() and exit_btrfs_fs() should match their
      sequence, thus normally they should look like this:
      
          init_btrfs_fs()   |   exit_btrfs_fs()
      ----------------------+------------------------
          init_A();         |
          init_B();         |
          init_C();         |
                            |   exit_C();
                            |   exit_B();
                            |   exit_A();
      
      So is for the error path of init_btrfs_fs().
      
      But it's not the case, some exit functions don't match their init
      functions sequence in init_btrfs_fs().
      
      Furthermore in init_btrfs_fs(), we need to have a new error label for
      each new init function we added.  This is not really expandable,
      especially recently we may add several new functions to init_btrfs_fs().
      
      [ENHANCEMENT]
      The patch will introduce the following things to enhance the situation:
      
      - struct init_sequence
        Just a wrapper of init and exit function pointers.
      
        The init function must use int type as return value, thus some init
        functions need to be updated to return 0.
      
        The exit function can be NULL, as there are some init sequence just
        outputting a message.
      
      - struct mod_init_seq[] array
        This is a const array, recording all the initialization we need to do
        in init_btrfs_fs(), and the order follows the old init_btrfs_fs().
      
      - bool mod_init_result[] array
        This is a bool array, recording if we have initialized one entry in
        mod_init_seq[].
      
        The reason to split mod_init_seq[] and mod_init_result[] is to avoid
        section mismatch in reference.
      
        All init function are in .init.text, but if mod_init_seq[] records
        the @initialized member it can no longer be const, thus will be put
        into .data section, and cause modpost warning.
      
      For init_btrfs_fs() we just call all init functions in their order in
      mod_init_seq[] array, and after each call, setting corresponding
      mod_init_result[] to true.
      
      For exit_btrfs_fs() and error handling path of init_btrfs_fs(), we just
      iterate mod_init_seq[] in reverse order, and skip all uninitialized
      entry.
      
      With this patch, init_btrfs_fs()/exit_btrfs_fs() will be much easier to
      expand and will always follow the strict order.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5565b8e0
    • Filipe Manana's avatar
      btrfs: remove gfp_t flag from btrfs_tree_mod_log_insert_key() · 33cff222
      Filipe Manana authored
      All callers of btrfs_tree_mod_log_insert_key() are now passing a GFP_NOFS
      flag to it, so remove the flag from it and from alloc_tree_mod_elem() and
      use it directly within alloc_tree_mod_elem().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      33cff222
    • Filipe Manana's avatar
      btrfs: switch GFP_ATOMIC to GFP_NOFS when fixing up low keys · 879b2221
      Filipe Manana authored
      When fixing up the first key of each node above the current level, at
      fixup_low_keys(), we are doing a GFP_ATOMIC allocation for inserting an
      operation record for the tree mod log. However we can do just fine with
      GFP_NOFS nowadays. The need for GFP_ATOMIC was for the old days when we
      had custom locks with spinning behaviour for extent buffers and we were
      in spinning mode while at fixup_low_keys(). Now we use rw semaphores for
      extent buffer locks, so we can safely use GFP_NOFS.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      879b2221
    • Boris Burkov's avatar
      btrfs: re-check reclaim condition in reclaim worker · 81531225
      Boris Burkov authored
      I have observed the following case play out and lead to unnecessary
      relocations:
      
      1. write a file across multiple block groups
      2. delete the file
      3. several block groups fall below the reclaim threshold
      4. reclaim the first, moving extents into the others
      5. reclaim the others which are now actually very full, leading to poor
         reclaim behavior with lots of writing, allocating new block groups,
         etc.
      
      I believe the risk of missing some reasonable reclaims is worth it
      when traded off against the savings of avoiding overfull reclaims.
      
      Going forward, it could be interesting to make the check more advanced
      (zoned aware, fragmentation aware, etc...) so that it can be a really
      strong signal both at extent delete and reclaim time.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      81531225
    • Boris Burkov's avatar
      btrfs: skip reclaim if block_group is empty · cc4804bf
      Boris Burkov authored
      As we delete extents from a block group, at some deletion we cross below
      the reclaim threshold. It is possible we are still in the middle of
      deleting more extents and might soon hit 0. If the block group is empty
      by the time the reclaim worker runs, we will still relocate it.
      
      This works just fine, as relocating an empty block group ultimately
      results in properly deleting it. However, we have more direct ways of
      removing empty block groups in the cleaner thread. Those are either
      async discard or the unused_bgs list. In fact, when we decide whether to
      relocate a block group during extent deletion, we do check for emptiness
      and prefer the discard/unused_bgs mechanisms when possible.
      
      Not using relocation for this case reduces some modest overhead from
      empty bg relocation:
      
      - extra transactions
      - extra metadata use/churn for creating relocation metadata
      - trying to read the extent tree to look for extents (and in this case
        finding none)
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cc4804bf
    • Filipe Manana's avatar
      btrfs: avoid unnecessary resolution of indirect backrefs during fiemap · 6976201f
      Filipe Manana authored
      During fiemap, when determining if a data extent is shared or not, if we
      don't find the extent is directly shared, then we need to determine if
      it's shared through subtrees. For that we need to resolve the indirect
      reference we found in order to figure out the path in the inode's fs tree,
      which is a path starting at the fs tree's root node and going down to the
      leaf that contains the file extent item that points to the data extent.
      We then proceed to determine if any extent buffer in that path is shared
      with other trees or not.
      
      However when the generation of the data extent is more recent than the
      last generation used to snapshot the root, we don't need to determine
      the path, since the data extent can not be shared through snapshots.
      For this case we currently still determine the leaf of that path (at
      find_parent_nodes(), but then stop determining the other nodes in the
      path (at btrfs_is_data_extent_shared()) as it's pointless.
      
      So do the check of the data extent's generation earlier, at
      find_parent_nodes(), before trying to resolve the indirect reference to
      determine the leaf in the path. This saves us from doing one expensive
      b+tree search in the fs tree of our target inode, as well as other minor
      work.
      
      The following test was run on a non-debug kernel (Debian's default kernel
      config):
      
         $ cat test-fiemap.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         # Use compression to quickly create files with a lot of extents
         # (each with a size of 128K).
         mount -o compress=lzo $DEV $MNT
      
         # 40G gives 327680 extents, each with a size of 128K.
         xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
         # Add some more files to increase the size of the fs and extent
         # trees (in the real world there's a lot of files and extents
         # from other files).
         xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
         xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
         xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
      
         umount $MNT
         mount -o compress=lzo $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata not cached)"
         echo
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata cached)"
      
         umount $MNT
      
      Before applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 1285 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 742 milliseconds (metadata cached)
      
      After applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 689 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 393 milliseconds (metadata cached)
      
      That's a -46.4% total reduction for the metadata not cached case, and
      a -47.0% reduction for the cached metadata case.
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6976201f
    • Filipe Manana's avatar
      btrfs: avoid duplicated resolution of indirect backrefs during fiemap · 877c1476
      Filipe Manana authored
      During fiemap, when determining if a data extent is shared or not, if we
      don't find the extent is directly shared, then we need to determine if
      it's shared through subtrees. For that we need to resolve the indirect
      reference we found in order to figure out the path in the inode's fs tree,
      which is a path starting at the fs tree's root node and going down to the
      leaf that contains the file extent item that points to the data extent.
      We then proceed to determine if any extent buffer in that path is shared
      with other trees or not.
      
      Currently whenever we find the data extent that a file extent item points
      to is not directly shared, we always resolve the path in the fs tree, and
      then check if any extent buffer in the path is shared. This is a lot of
      work and when we have file extent items that belong to the same leaf, we
      have the same path, so we only need to calculate it once.
      
      This change does that, it keeps track of the current and previous leaf,
      and when we find that a data extent is not directly shared, we try to
      compute the fs tree path only once and then use it for every other file
      extent item in the same leaf, using the existing cached path result for
      the leaf as long as the cache results are valid.
      
      This saves us from doing expensive b+tree searches in the fs tree of our
      target inode, as well as other minor work.
      
      The following test was run on a non-debug kernel (Debian's default kernel
      config):
      
         $ cat test-with-snapshots.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         # Use compression to quickly create files with a lot of extents
         # (each with a size of 128K).
         mount -o compress=lzo $DEV $MNT
      
         # 40G gives 327680 extents, each with a size of 128K.
         xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
      
         # Add some more files to increase the size of the fs and extent
         # trees (in the real world there's a lot of files and extents
         # from other files).
         xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
         xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
         xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3
      
         # Create a snapshot so all the extents become indirectly shared
         # through subtrees, with a generation less than or equals to the
         # generation used to create the snapshot.
         btrfs subvolume snapshot -r $MNT $MNT/snap1
      
         umount $MNT
         mount -o compress=lzo $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata not cached)"
         echo
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds (metadata cached)"
      
         umount $MNT
      
      Result before applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 1204 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 729 milliseconds (metadata cached)
      
      Result after applying this patch:
      
         (...)
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 732 milliseconds (metadata not cached)
      
         /mnt/sdi/foobar: 327680 extents found
         fiemap took 421 milliseconds (metadata cached)
      
      That's a -46.1% total reduction for the metadata not cached case, and
      a -42.2% reduction for the cached metadata case.
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      877c1476
    • Filipe Manana's avatar
      btrfs: move up backref sharedness cache store and lookup functions · 583f4ac5
      Filipe Manana authored
      Move the static functions to lookup and store sharedness check of an
      extent buffer to a location above find_all_parents(), because in the
      next patch the lookup function will be used by find_all_parents().
      The store function is also moved just because it's the counter part
      to the lookup function and it's best to have their definitions close
      together.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      583f4ac5
    • Filipe Manana's avatar
      btrfs: cache sharedness of the last few data extents during fiemap · 73e339e6
      Filipe Manana authored
      During fiemap we process all the file extent items of an inode, by their
      file offset order (left to right b+tree order), and then check if the data
      extent they point at is shared or not. Until now we didn't cache those
      results, we only did it for b+tree nodes/leaves since for each unique
      b+tree path we have access to hundreds of file extent items. However, it
      is also common to repeat checking the sharedness of a particular data
      extent in a very short time window, and the cases that lead to that are
      the following:
      
      1) COW writes.
      
         If have a file extent item like this:
      
                        [ bytenr X, offset = 0, num_bytes = 512K ]
         file offset    0                                        512K
      
         Then a 4K write into file offset 64K happens, we end up with the
         following file extent item layout:
      
                        [ bytenr X, offset = 0, num_bytes = 64K ]
         file offset    0                                       64K
      
                        [ bytenr Y, offset = 0, num_bytes = 4K ]
         file offset   64K                                     68K
      
                        [ bytenr X, offset = 68K, num_bytes = 444K ]
         file offset   68K                                         512K
      
         So during fiemap we well check for the sharedness of the data extent
         with bytenr X twice. Typically for COW writes and for at least
         moderately updated files, we end up with many file extent items that
         point to different sections of the same data extent.
      
      2) Writing into a NOCOW file after a snapshot is taken.
      
         This happens if the target extent was created in a generation older
         than the generation where the last snapshot for the root (the tree the
         inode belongs to) was made.
      
         This leads to a scenario like the previous one.
      
      3) Writing into sections of a preallocated extent.
      
         For example if a file has the following layout:
      
         [ bytenr X, offset = 0, num_bytes = 1M, type = prealloc ]
         0                                                       1M
      
         After doing a 4K write into file offset 0 and another 4K write into
         offset 512K, we get the following layout:
      
            [ bytenr X, offset = 0, num_bytes = 4K, type = regular ]
            0                                                      4K
      
            [ bytenr X, offset = 4K, num_bytes = 508K, type = prealloc ]
           4K                                                          512K
      
            [ bytenr X, offset = 512K, num_bytes = 4K, type = regular ]
         512K                                                         516K
      
            [ bytenr X, offset = 516K, num_bytes = 508K, type = prealloc ]
         516K                                                            1M
      
         So we end up with 4 consecutive file extent items pointing to the data
         extent at bytenr X.
      
      4) Hole punching in the middle of an extent.
      
         For example if a file has the following file extent item:
      
         [ bytenr X, offset = 0, num_bytes = 8M ]
         0                                      8M
      
         And then hole is punched for the file range [4M, 6M[, we our file
         extent item split into two:
      
         [ bytenr X, offset = 0, num_bytes = 4M  ]
         0                                       4M
      
         [ 2M hole, implicit or explicit depending on NO_HOLES feature ]
         4M                                                            6M
      
         [ bytenr X, offset = 6M, num_bytes = 2M  ]
         6M                                       8M
      
         Again, we end up with two file extent items pointing to the same
         data extent.
      
      5) When reflinking (clone and deduplication) within the same file.
         This is probably the least common case of all.
      
      In cases 1, 2, 4 and 4, when we have multiple file extent items that point
      to the same data extent, their distance is usually short, typically
      separated by a few slots in a b+tree leaf (or across sibling leaves). For
      case 5, the distance can vary a lot, but it's typically the less common
      case.
      
      This change caches the result of the sharedness checks for data extents,
      but only for the last 8 extents that we notice that our inode refers to
      with multiple file extent items. Whenever we want to check if a data
      extent is shared, we lookup the cache which consists of doing a linear
      scan of an 8 elements array, and if we find the data extent there, we
      return the result and don't check the extent tree and delayed refs.
      
      The array/cache is small so that doing the search has no noticeable
      negative impact on the performance in case we don't have file extent items
      within a distance of 8 slots that point to the same data extent.
      
      Slots in the cache/array are overwritten in a simple round robin fashion,
      as that approach fits very well.
      
      Using this simple approach with only the last 8 data extents seen is
      effective as usually when multiple file extents items point to the same
      data extent, their distance is within 8 slots. It also uses very little
      memory and the time to cache a result or lookup the cache is negligible.
      
      The following test was run on non-debug kernel (Debian's default kernel
      config) to measure the impact in the case of COW writes (first example
      given above), where we run fiemap after overwriting 33% of the blocks of
      a file:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdi
         MNT=/mnt/sdi
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         FILE_SIZE=$((1 * 1024 * 1024  * 1024))
      
         # Create the file full of 1M extents.
         xfs_io -f -s -c "pwrite -b 1M -S 0xab 0 $FILE_SIZE" $MNT/foobar
      
         block_count=$((FILE_SIZE / 4096))
         # Overwrite about 33% of the file blocks.
         overwrite_count=$((block_count / 3))
      
         echo -e "\nOverwriting $overwrite_count 4K blocks (out of $block_count)..."
         RANDOM=123
         for ((i = 1; i <= $overwrite_count; i++)); do
             off=$(((RANDOM % block_count) * 4096))
             xfs_io -c "pwrite -S 0xcd $off 4K" $MNT/foobar > /dev/null
             echo -ne "\r$i blocks overwritten..."
         done
         echo -e "\n"
      
         # Unmount and mount to clear all cached metadata.
         umount $MNT
         mount $DEV $MNT
      
         start=$(date +%s%N)
         filefrag $MNT/foobar
         end=$(date +%s%N)
         dur=$(( (end - start) / 1000000 ))
         echo "fiemap took $dur milliseconds"
      
         umount $MNT
      
      Result before applying this patch:
      
         fiemap took 128 milliseconds
      
      Result after applying this patch:
      
         fiemap took 92 milliseconds   (-28.1%)
      
      The test is somewhat limited in the sense the gains may be higher in
      practice, because in the test the filesystem is small, so we have small
      fs and extent trees, plus there's no concurrent access to the trees as
      well, therefore no lock contention there.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      73e339e6
    • Filipe Manana's avatar
      btrfs: remove useless logic when finding parent nodes · 56f5c199
      Filipe Manana authored
      At find_parent_nodes(), at its last step, when iterating over all direct
      references, we are checking if we have a share context and if we have
      a reference with a different root from the one in the share context.
      However that logic is pointless because of two reasons:
      
      1) After the previous patch in the series (subject "btrfs: remove roots
         ulist when checking data extent sharedness"), the roots argument is
         always NULL when using a share check context (struct share_check), so
         this code is never triggered;
      
      2) Even before that previous patch, we could not hit this code because
         if we had a reference with a root different from the one in our share
         context, then we would have exited earlier when doing either of the
         following:
      
            - Adding a second direct ref to the direct refs red black tree
              resulted in extent_is_shared() returning true when called from
              add_direct_ref() -> add_prelim_ref(), after processing delayed
              references or while processing references in the extent tree;
      
            - When adding a second reference to the indirect refs red black
              tree (same as above, extent_is_shared() returns true);
      
            - If we only have one indirect reference and no direct references,
              then when resolving it at resolve_indirect_refs() we immediately
              return that the target extent is shared, therefore never reaching
              that loop that iterates over all direct references at
              find_parent_nodes();
      
            - If we have 1 indirect reference and 1 direct reference, then we
              also exit early because extent_is_shared() ends up returning true
              when called through add_prelim_ref() (by add_direct_ref() or
              add_indirect_ref()) or add_delayed_refs(). Same applies as when
              having a combination of direct, indirect and indirect with missing
              key references.
      
         This logic had been obsoleted since commit 3ec4d323 ("btrfs:
         allow backref search checks for shared extents"), which introduced the
         early exits in case an extent is shared.
      
      So just remove that logic, and assert at find_parent_nodes() that when we
      have a share context we don't have a roots ulist and that we haven't found
      the extent to be directly shared after processing delayed references and
      all references from the extent tree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      56f5c199
    • Filipe Manana's avatar
      btrfs: remove roots ulist when checking data extent sharedness · b6296858
      Filipe Manana authored
      Currently btrfs_is_data_extent_shared() is passing a ulist for the roots
      argument of find_parent_nodes(), however it does not use that ulist for
      anything and for this context that list always ends up with at most one
      element.
      
      Since find_parent_nodes() is able to deal with a NULL ulist for its roots
      argument, make btrfs_is_data_extent_shared() pass it NULL and avoid the
      burden of allocating memory for the unnused roots ulist, initializing it,
      releasing it and allocating one struct ulist_node for it during the call
      to find_parent_nodes().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b6296858
    • Filipe Manana's avatar
      btrfs: move ulists to data extent sharedness check context · 84a7949d
      Filipe Manana authored
      When calling btrfs_is_data_extent_shared() we pass two ulists that were
      allocated by the caller. This is because the single caller, fiemap, calls
      btrfs_is_data_extent_shared() multiple times and the ulists can be reused,
      instead of allocating new ones before each call and freeing them after
      each call.
      
      Now that we have a context structure/object that we pass to
      btrfs_is_data_extent_shared(), we can move those ulists to it, and hide
      their allocation and the context's allocation in a helper function, as
      well as the freeing of the ulists and the context object. This allows to
      reduce the number of parameters passed to btrfs_is_data_extent_shared(),
      the need to pass the ulists from extent_fiemap() to fiemap_process_hole()
      and having the caller deal with allocating and releasing the ulists.
      
      Also rename one of the ulists from 'tmp' / 'tmp_ulist' to 'refs', since
      that's a much better name as it reflects what the list is used for (and
      matching the argument name for find_parent_nodes()).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      84a7949d
    • Filipe Manana's avatar
      btrfs: turn the backref sharedness check cache into a context object · 61dbb952
      Filipe Manana authored
      Right now we are using a struct btrfs_backref_shared_cache to pass state
      across multiple btrfs_is_data_extent_shared() calls. The structure's name
      closely follows its current purpose, which is to cache previous checks
      for the sharedness of metadata extents. However we will start using the
      structure for more things other than caching sharedness checks, so rename
      it to struct btrfs_backref_share_check_ctx.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61dbb952
    • Filipe Manana's avatar
      btrfs: directly pass the inode to btrfs_is_data_extent_shared() · ceb707da
      Filipe Manana authored
      Currently we pass a root and an inode number as arguments for
      btrfs_is_data_extent_shared() and the inode number is always from an
      inode that belongs to that root (it wouldn't make sense otherwise).
      In every context that we call btrfs_is_data_extent_shared() (fiemap only),
      we have an inode available, so directly pass the inode to the function
      instead of a root and inode number. This reduces the number of parameters
      and it makes the function's signature conform to most other functions we
      have.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ceb707da
    • Filipe Manana's avatar
      btrfs: remove checks for a 0 inode number during backref walking · a0a5472a
      Filipe Manana authored
      When doing backref walking to determine if an extent is shared, we are
      testing if the inode number, stored in the 'inum' field of struct
      share_check, is 0. However that can never be case, since the all instances
      of the structure are created at btrfs_is_data_extent_shared(), which
      always initializes it with the inode number from a fs tree (and the number
      for any inode from any tree can never be 0). So remove the checks.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a0a5472a
    • Filipe Manana's avatar
      btrfs: remove checks for a root with id 0 during backref walking · c9024219
      Filipe Manana authored
      When doing backref walking to determine if an extent is shared, we are
      testing the root_objectid of the given share_check struct is 0, but that
      is an impossible case, since btrfs_is_data_extent_shared() always
      initializes the root_objectid field with the id of the given root, and
      no root can have an objectid of 0. So remove those checks.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9024219