1. 15 Feb, 2023 5 commits
    • Filipe Manana's avatar
      btrfs: assert commit root semaphore is held when accessing backref cache · 4e4488d4
      Filipe Manana authored
      During fiemap, when accessing the cache that stores the sharedness of an
      extent, we need to either be holding a transaction handle or the commit
      root semaphore. I left comments about this in the comment that precedes
      store_backref_shared_cache() and lookup_backref_shared_cache(), but have
      actually not enforced it through assertions. So assert that the commit
      root semaphore is held if we are not holding a transaction handle.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4e4488d4
    • Boris Burkov's avatar
      btrfs: hold block group refcount during async discard · 2b5463fc
      Boris Burkov authored
      Async discard does not acquire the block group reference count while it
      holds a reference on the discard list. This is generally OK, as the
      paths which destroy block groups tend to try to synchronize on
      cancelling async discard work. However, relying on cancelling work
      requires careful analysis to be sure it is safe from races with
      unpinning scheduling more work.
      
      While I am unable to find a race with unpinning in the current code for
      either the unused bgs or relocation paths, I believe we have one in an
      older version of auto relocation in a Meta internal build. This suggests
      that this is in fact an error prone model, and could be fragile to
      future changes to these bg deletion paths.
      
      To make this ownership more clear, add a refcount for async discard. If
      work is queued for a block group, its refcount should be incremented,
      and when work is completed or canceled, it should be decremented.
      
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b5463fc
    • Filipe Manana's avatar
      btrfs: send: cache utimes operations for directories if possible · 3e49363b
      Filipe Manana authored
      Whenever we add or remove an entry to a directory, we issue an utimes
      command for the directory. If we add 1000 entries to a directory (create
      1000 files under it or move 1000 files to it), then we issue the same
      utimes command 1000 times, which increases the send stream size, results
      in more pipe IO, one search in the send b+tree, allocating one path for
      the search, etc, as well as making the receiver do a system call for each
      duplicated utimes command.
      
      We also issue an utimes command when we create a new directory, but later
      we might add entries to it corresponding to inodes with an higher inode
      number, so it's pointless to issue the utimes command before we create
      the last inode under the directory.
      
      So use a lru cache to track directories for which we must send a utimes
      command. When we need to remove an entry from the cache, we issue the
      utimes command for the respective directory. When finishing the send
      operation, we go over each cache element and issue the respective utimes
      command. Finally the caching is entirely optional, just a performance
      optimization, meaning that if we fail to cache (due to memory allocation
      failure), we issue the utimes command right away, that is, we fallback
      to the previous, unoptimized, behaviour.
      
      This patch belongs to a patchset comprised of the following patches:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      
      The following test was run before and after applying the whole patchset,
      and on a non-debug kernel (Debian's default kernel config):
      
         #!/bin/bash
      
         MNT=/mnt/sdi
         DEV=/dev/sdi
      
         mkfs.btrfs -f $DEV > /dev/null
         mount $DEV $MNT
      
         mkdir $MNT/A
         for ((i = 1; i <= 20000; i++)); do
             echo -n > $MNT/A/file_$i
         done
      
         btrfs subvolume snapshot -r $MNT $MNT/snap1
      
         mkdir $MNT/B
         for ((i = 20000; i <= 40000; i++)); do
             echo -n > $MNT/B/file_$i
         done
      
         mv $MNT/A/file_* $MNT/B/
      
         btrfs subvolume snapshot -r $MNT $MNT/snap2
      
         start=$(date +%s%N)
         btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
         end=$(date +%s%N)
      
         dur=$(( (end - start) / 1000000 ))
         echo "Incremental send took $dur milliseconds"
      
         umount $MNT
      
      Before the whole patchset: 18408 milliseconds
      After the whole patchset:   1942 milliseconds  (9.5x speedup)
      
      Using 60000 files instead of 40000:
      
      Before the whole patchset: 39764 milliseconds
      After the whole patchset:   3076 milliseconds  (12.9x speedup)
      
      Using 20000 files instead of 40000:
      
      Before the whole patchset:  5072 milliseconds
      After the whole patchset:    916 milliseconds  (5.5x speedup)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3e49363b
    • Filipe Manana's avatar
      btrfs: send: update size of roots array for backref cache entries · ace79df8
      Filipe Manana authored
      Currently we limit the size of the roots array, for backref cache entries,
      to 12 elements. This is because that number is enough for most cases and
      to make the backref cache entry size to be exactly 128 bytes, so that
      memory is allocated from the kmalloc-128 slab and no space is wasted.
      
      However recent changes in the series refactored the backref cache to be
      more generic and allow it to be reused for other purposes, which resulted
      in increasing the size of the embedded structure btrfs_lru_cache_entry in
      order to allow for supporting inode numbers as keys on 32 bits system and
      allow multiple generations per key. This resulted in increasing the size
      of struct backref_cache_entry from 128 bytes to 152 bytes. Since the cache
      entries are allocated with kmalloc(), it means we end up using the slab
      kmalloc-192, so we end up wasting 40 bytes of memory. So bump the size of
      the roots array from 12 elements to 17 elements, so we end up using 192
      bytes for each backref cache entry.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ace79df8
    • Filipe Manana's avatar
      btrfs: send: use the lru cache to implement the name cache · c48545de
      Filipe Manana authored
      The name cache in send is basically a lru cache implemented with a radix
      tree and linked lists, very similar to the lru cache module which is used
      for the send backref cache and the cache of previously created directories
      during a send operation. So remove all the custom caching code for the
      name cache and make it use the lru cache instead.
      
      One particular detail to note is that the current cache behaves a bit
      differently when it comes to eviction of entries. Namely when after
      inserting a new name in the cache, if the cache now has 256 entries, we
      evict the last 128 LRU entries. The lru_cache.{c,h} module behaves a bit
      differently in that once we reach the cache limit, we evict a single LRU
      entry. In practice this doesn't make much difference, but it's actually
      better to evict just one entry instead of half of the entries, as there's
      always a chance we will need a name stored in one of that last 128 removed
      entries.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c48545de
  2. 13 Feb, 2023 35 commits
    • Filipe Manana's avatar
      btrfs: add an api to delete a specific entry from the lru cache · d588adae
      Filipe Manana authored
      In order to replace the open coded name cache in send with the lru cache,
      we need an API for the lru cache to delete a specific entry for which we
      did a previous lookup. This adds the API for it, and a next patch in the
      series will use it.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d588adae
    • Filipe Manana's avatar
      btrfs: allow a generation number to be associated with lru cache entries · 0da0c560
      Filipe Manana authored
      This allows an optional generation number to be associated to each entry
      of the lru cache. Entries with the same key but different generations, are
      stored in the linked list to which the maple tree points to. This is meant
      to be used when there's a small number of different generations, so the
      impact of searching a linked list is negligible. The goal is to get rid of
      the open coded name cache in the send code (which uses a radix tree and
      a similar linked list of values/entries) and use instead the lru cache
      module. For that particular use case we have at most 2 generations that
      are associated to each key (inode number): one generation for the send
      root and another generation for the parent root. The actual migration of
      the send name cache is done in the next patch in the series.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0da0c560
    • Filipe Manana's avatar
      btrfs: send: cache information about created directories · e8a7f49d
      Filipe Manana authored
      During an incremental send, when processing the reference for an inode
      we need to check if the directory where the new reference is located was
      already created before creating the new reference. This check, which is
      done by the helper did_create_dir(), can be expensive if the directory
      has many entries, since it consists in searching the send root's b+tree
      and visiting every single dir index key until we either find one which
      points to an inode with a number smaller than the current inode's number
      or until we visited all index keys. So it doesn't scale well for very
      large directories.
      
      So improve on this by caching created directories using a lru cache, and
      limiting its size to 64 entries, which results in using at most 4096
      bytes of memory. The caching is optional, if we fail to allocate memory,
      we just proceed as before and use the existing slower path.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e8a7f49d
    • Filipe Manana's avatar
      btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems · 6273ee62
      Filipe Manana authored
      The lru cache is backed by a maple tree, which uses the unsigned long
      type for keys, and that type has a width of 32 bits on 32 bits systems
      and a width of 64 bits on 64 bits systems.
      
      Currently there is only one user of the lru cache, the send backref cache,
      which uses a sector number as a key, a logical address right shifted by
      fs_info->sectorsize_bits, so a 32 bits width is not yet a problem (the
      same happens with the radix tree we use to track extent buffers,
      fs_info->buffer_radix).
      
      However the next patches in the series will start using the lru cache for
      cases where inode numbers are the keys, and the inode numbers are always
      64 bits, even if we are running on a 32 bits system.
      
      So adapt the lru cache to allow multiple values under the same key, by
      having the maple tree store a head entry that points to a list of entries
      instead of pointing to a single entry. This is a similar approach to what
      we currently do for the name cache in send (which uses a radix tree that
      has indexes with an unsigned long type as well), and will allow later to
      use the lru cache for the send name cache as well.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6273ee62
    • Filipe Manana's avatar
      btrfs: send: genericize the backref cache to allow it to be reused · 90b90d4a
      Filipe Manana authored
      The backref cache is a cache backed by a maple tree and a linked list to
      keep track of temporal access to cached entries (the LRU entry always at
      the head of the list). This type of caching method is going to be useful
      in other scenarios, so make the cache implementation more generic and
      move it into its own header and source files.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      90b90d4a
    • Filipe Manana's avatar
      btrfs: send: initialize all the red black trees earlier · d307d2f3
      Filipe Manana authored
      After we allocate the send context object and before we initialize all
      the red black trees, we can jump to the 'out' label if some errors happen,
      and then under the 'out' label we use RB_EMPTY_ROOT() against some of the
      those trees, which we have not yet initialized. This happens to work out
      ok because the send context object was initialized to zeroes with kzalloc
      and the RB_ROOT initializer just happens to have the following definition:
      
          #define RB_ROOT (struct rb_root) { NULL, }
      
      But it's really neither clean nor a good practice as RB_ROOT is supposed
      to be opaque and in case it changes or we change those red black trees to
      some other data structure, it leaves us in a precarious situation.
      
      So initialize all the red black trees immediately after allocating the
      send context and before any jump into the 'out' label.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d307d2f3
    • Filipe Manana's avatar
      btrfs: send: iterate waiting dir move rbtree only once when processing refs · 8c139e1d
      Filipe Manana authored
      When processing the new references for an inode, we unnecessarily iterate
      twice the waiting dir moves rbtree, once with is_waiting_for_move() and
      if we found an entry in the rbtree, we iterate it again with a call to
      get_waiting_dir_move(). This is pointless, we can make this simpler and
      more efficient by calling only get_waiting_dir_move(), so just do that.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c139e1d
    • Filipe Manana's avatar
      btrfs: send: reduce searches on parent root when checking if dir can be removed · 474e4761
      Filipe Manana authored
      During an incremental send, every time we remove a reference (dentry) for
      an inode and the parent directory does not exists anymore in the send
      root, we go check if we can remove the directory by making a call to
      can_rmdir(). This helper can only return true (value 1) if all dentries
      were already removed, and for that it always does a search on the parent
      root for dir index keys - if it finds any dentry referring to an inode
      with a number higher then the inode currently being processed, then the
      directory can not be removed and it must return false (value 0).
      
      However that means if a directory that was deleted had 1000 dentries, and
      each one pointed to an inode with a number higher then the number of the
      directory's inode, we end up doing 1000 searches on the parent root.
      Typically files are created in a directory after the directory was created
      and therefore they get an higher inode number than the directory. It's
      also common to have the each dentry pointing to an inode with a higher
      number then the inodes the previous dentries point to, for example when
      creating a series of files inside a directory, a very common pattern.
      
      So improve on that by having the first call to can_rmdir() for a directory
      to check the number of the inode that the last dentry points to and cache
      that inode number in the orphan dir structure. Then every subsequent call
      to can_rmdir() can avoid doing a search on the parent root if the number
      of the inode currently being processed is smaller than cached inode number
      at the directory's orphan dir structure.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      474e4761
    • Filipe Manana's avatar
      btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir() · 78cf1a95
      Filipe Manana authored
      At can_rmdir() we start by searching the orphan dirs rbtree for an orphan
      dir object for the target directory. Later when iterating over the dir
      index keys, if we find that any dir entry points to inode for which there
      is a pending dir move or the inode was not yet processed, we exit because
      we can't remove the directory yet. However we end up always calling
      add_orphan_dir_info(), which will iterate again the rbtree and if there is
      already an orphan dir object (created by the first call to can_rmdir()),
      it returns the existing object. This is unnecessary work because in case
      there is already an existing orphan dir object, we got a reference to it
      at the start of can_rmdir(). So skip the call to add_orphan_dir_info()
      if we already have a reference for an orphan dir object.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      78cf1a95
    • Filipe Manana's avatar
      btrfs: send: avoid duplicated orphan dir allocation and initialization · d921b9cf
      Filipe Manana authored
      At can_rmdir() we are allocating and initializing an orphan dir object
      twice. This can be deduplicated outside of the loop that iterates over
      the dir index keys. So deduplicate that code, even because other patch
      in the series will need to add more initialization code and another one
      will add one more condition.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d921b9cf
    • Filipe Manana's avatar
      btrfs: send: remove send_progress argument from can_rmdir() · 24970ccb
      Filipe Manana authored
      All callers of can_rmdir() pass sctx->cur_ino as the value for the
      send_progress argument, so remove the argument and directly use
      sctx->cur_ino.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      24970ccb
    • Filipe Manana's avatar
      btrfs: send: avoid extra b+tree searches when checking reference overrides · 498581f3
      Filipe Manana authored
      During an incremental send, when processing the new references of an inode
      (either it's a new inode or an existing one renamed/moved), he will search
      the b+tree of the send or parent roots in order to find out the inode item
      of the parent directory and extract its generation. However we are doing
      that search twice, once with is_inode_existent() -> get_cur_inode_state()
      and then again at did_overwrite_ref() or will_overwrite_ref().
      
      So avoid that and get the generation at get_cur_inode_state() and then
      propagate it up to did_overwrite_ref() and will_overwrite_ref().
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      498581f3
    • Filipe Manana's avatar
      btrfs: send: directly return from will_overwrite_ref() and simplify it · b3047a42
      Filipe Manana authored
      There are no resources to release before will_overwrite_ref() returns, so
      we don't really need the 'out' label and jumping to it when conditions are
      met - we can directly return and get rid of the label and jumps. Also we
      can deal with -ENOENT and other errors in a single if-else logic, as it's
      more straightforward.
      
      This helps the next patch in the series to be more simple as well.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3047a42
    • Filipe Manana's avatar
      btrfs: send: avoid unnecessary generation search at did_overwrite_ref() · cb689481
      Filipe Manana authored
      At did_overwrite_ref() we always call get_inode_gen() to find out the
      generation of the inode 'ow_inode'. However we don't always need to use
      that generation, and in fact it's very common to not use it, so we end
      up doing a b+tree search on the send root, allocating a path, etc, for
      nothing. So improve on this by getting the generation only if we need
      to use it.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb689481
    • Filipe Manana's avatar
      btrfs: send: directly return from did_overwrite_ref() and simplify it · e739ba30
      Filipe Manana authored
      There are no resources to release before did_overwrite_ref() returns, so
      we don't really need the 'out' label and jumping to it when conditions are
      met - we can directly return and get rid of the label and jumps. Also we
      can deal with -ENOENT and other errors in a single if-else logic, as it's
      more straightforward.
      
      This helps the next patch in the series to be more simple as well.
      
      This patch is part of a larger patchset and the changelog of the last
      patch in the series contains a sample performance test and results.
      The patches that comprise the patchset are the following:
      
        btrfs: send: directly return from did_overwrite_ref() and simplify it
        btrfs: send: avoid unnecessary generation search at did_overwrite_ref()
        btrfs: send: directly return from will_overwrite_ref() and simplify it
        btrfs: send: avoid extra b+tree searches when checking reference overrides
        btrfs: send: remove send_progress argument from can_rmdir()
        btrfs: send: avoid duplicated orphan dir allocation and initialization
        btrfs: send: avoid unnecessary orphan dir rbtree search at can_rmdir()
        btrfs: send: reduce searches on parent root when checking if dir can be removed
        btrfs: send: iterate waiting dir move rbtree only once when processing refs
        btrfs: send: initialize all the red black trees earlier
        btrfs: send: genericize the backref cache to allow it to be reused
        btrfs: adapt lru cache to allow for 64 bits keys on 32 bits systems
        btrfs: send: cache information about created directories
        btrfs: allow a generation number to be associated with lru cache entries
        btrfs: add an api to delete a specific entry from the lru cache
        btrfs: send: use the lru cache to implement the name cache
        btrfs: send: update size of roots array for backref cache entries
        btrfs: send: cache utimes operations for directories if possible
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e739ba30
    • Qu Wenruo's avatar
      btrfs: sysfs: update fs features directory asynchronously · b7625f46
      Qu Wenruo authored
      [BUG]
      Since the introduction of per-fs feature sysfs interface
      (/sys/fs/btrfs/<UUID>/features/), the content of that directory is never
      updated.
      
      Thus for the following case, that directory will not show the new
      features like RAID56:
      
        # mkfs.btrfs -f $dev1 $dev2 $dev3
        # mount $dev1 $mnt
        # btrfs balance start -f -mconvert=raid5 $mnt
        # ls /sys/fs/btrfs/$uuid/features/
        extended_iref  free_space_tree  no_holes  skinny_metadata
      
      While after unmount and mount, we got the correct features:
      
        # umount $mnt
        # mount $dev1 $mnt
        # ls /sys/fs/btrfs/$uuid/features/
        extended_iref  free_space_tree  no_holes  raid56 skinny_metadata
      
      [CAUSE]
      Because we never really try to update the content of per-fs features/
      directory.
      
      We had an attempt to update the features directory dynamically in commit
      14e46e04 ("btrfs: synchronize incompat feature bits with sysfs
      files"), but unfortunately it get reverted in commit e410e34f
      ("Revert "btrfs: synchronize incompat feature bits with sysfs files"").
      The problem in the original patch is, in the context of
      btrfs_create_chunk(), we can not afford to update the sysfs group.
      
      The exported but never utilized function, btrfs_sysfs_feature_update()
      is the leftover of such attempt.  As even if we go sysfs_update_group(),
      new files will need extra memory allocation, and we have no way to
      specify the sysfs update to go GFP_NOFS.
      
      [FIX]
      This patch will address the old problem by doing asynchronous sysfs
      update in the cleaner thread.
      
      This involves the following changes:
      
      - Make __btrfs_(set|clear)_fs_(incompat|compat_ro) helpers to set
        BTRFS_FS_FEATURE_CHANGED flag when needed
      
      - Update btrfs_sysfs_feature_update() to use sysfs_update_group()
        And drop unnecessary arguments.
      
      - Call btrfs_sysfs_feature_update() in cleaner_kthread
        If we have the BTRFS_FS_FEATURE_CHANGED flag set.
      
      - Wake up cleaner_kthread in btrfs_commit_transaction if we have
        BTRFS_FS_FEATURE_CHANGED flag
      
      By this, all the previously dangerous call sites like
      btrfs_create_chunk() need no new changes, as above helpers would
      have already set the BTRFS_FS_FEATURE_CHANGED flag.
      
      The real work happens at cleaner_kthread, thus we pay the cost of
      delaying the update to sysfs directory, but the delayed time should be
      small enough that end user can not distinguish though it might get
      delayed if the cleaner thread is busy with removing subvolumes or
      defrag.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7625f46
    • ye xingchen's avatar
      btrfs: remove duplicate include header in extent-tree.c · 58e36c2a
      ye xingchen authored
      extent-tree.h is included more than once, added in a0231804 ("btrfs:
      move extent-tree helpers into their own header file").
      Signed-off-by: default avatarye xingchen <ye.xingchen@zte.com.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      58e36c2a
    • Qu Wenruo's avatar
      btrfs: scrub: improve tree block error reporting · 28232909
      Qu Wenruo authored
      [BUG]
      When debugging a scrub related metadata error, it turns out that our
      metadata error reporting is not ideal.
      
      The only 3 error messages are:
      
      - BTRFS error (device dm-2): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 0, gen 1
        Showing we have metadata generation mismatch errors.
      
      - BTRFS error (device dm-2): unable to fixup (regular) error at logical 7110656 on dev /dev/mapper/test-scratch1
        Showing which tree blocks are corrupted.
      
      - BTRFS warning (device dm-2): checksum/header error at logical 24772608 on dev /dev/mapper/test-scratch2, physical 3801088: metadata node (level 1) in tree 5
        Showing which physical range the corrupted metadata is at.
      
      We have to combine the above 3 to know we have a corrupted metadata with
      generation mismatch.
      
      And this is already the better case, if we have other problems, like
      fsid mismatch, we can not even know the cause.
      
      [CAUSE]
      The problem is caused by the fact that, scrub_checksum_tree_block()
      never outputs any error message.
      
      It just return two bits for scrub: sblock->header_error, and
      sblock->generation_error.
      
      And later we report error in scrub_print_warning(), but unfortunately we
      only have two bits, there is not really much thing we can done to print
      any detailed errors.
      
      [FIX]
      This patch will do the following to enhance the error reporting of
      metadata scrub:
      
      - Add extra warning (ratelimited) for every error we hit
        This can help us to distinguish the different types of errors.
        Some errors can help us to know what's going wrong immediately,
        like bytenr mismatch.
      
      - Re-order the checks
        Currently we check bytenr first, then immediately generation.
        This can lead to false generation mismatch reports, while the fsid
        mismatches.
      
      Here is the new output for the bug I'm debugging (we forgot to
      writeback tree blocks for commit roots):
      
       BTRFS warning (device dm-2): tree block 24117248 mirror 1 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
       BTRFS warning (device dm-2): tree block 24117248 mirror 0 has bad fsid, has b77cd862-f150-4c71-90ec-7baf0544d83f want 17df6abf-23cd-445f-b350-5b3e40bfd2fc
      
      Now we can immediately know it's some tree blocks didn't even get written
      back, other than the original confusing generation mismatch.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28232909
    • Boris Burkov's avatar
      btrfs: don't use size classes for zoned file systems · cb0922f2
      Boris Burkov authored
      When a file system has ZNS devices which are constrained by a maximum
      number of active block groups, then not being able to use all the block
      groups for every allocation is not ideal, and could cause us to loop a
      ton with mixed size allocations.
      
      In general, since zoned doesn't write into gaps behind where block
      groups are writing, it is not susceptible to the same sort of
      fragmentation that size classes are designed to solve, so we can skip
      size classes for zoned file systems in general, even though there would
      probably be no harm for SMR devices.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb0922f2
    • Boris Burkov's avatar
      btrfs: load block group size class when caching · c7eec3d9
      Boris Burkov authored
      Since the size class is an artifact of an arbitrary anti fragmentation
      strategy, it doesn't really make sense to persist it. Furthermore, most
      of the size class logic assumes fresh block groups. That is of course
      not a reasonable assumption -- we will be upgrading kernels with
      existing filesystems whose block groups are not classified.
      
      To work around those issues, implement logic to compute the size class
      of the block groups as we cache them in. To perfectly assess the state
      of a block group, we would have to read the entire extent tree (since
      the free space cache mashes together contiguous extent items) which
      would be prohibitively expensive for larger file systems with more
      extents.
      
      We can do it relatively cheaply by implementing a simple heuristic of
      sampling a handful of extents and picking the smallest one we see. In
      the happy case where the block group was classified, we will only see
      extents of the correct size. In the unhappy case, we will hopefully find
      one of the smaller extents, but there is no perfect answer anyway.
      Autorelocation will eventually churn up the block group if there is
      significant freeing anyway.
      
      There was no regression in mount performance at end state of the fsperf
      test suite, and the delay until the block group is marked cached is
      minimized by the constant number of extent samples.
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7eec3d9
    • Boris Burkov's avatar
      btrfs: introduce size class to block group allocator · 52bb7a21
      Boris Burkov authored
      The aim of this patch is to reduce the fragmentation of block groups
      under certain unhappy workloads. It is particularly effective when the
      size of extents correlates with their lifetime, which is something we
      have observed causing fragmentation in the fleet at Meta.
      
      This patch categorizes extents into size classes:
      
      - x < 128KiB: "small"
      - 128KiB < x < 8MiB: "medium"
      - x > 8MiB: "large"
      
      and as much as possible reduces allocations of extents into block groups
      that don't match the size class. This takes advantage of any (possible)
      correlation between size and lifetime and also leaves behind predictable
      re-usable gaps when extents are freed; small writes don't gum up bigger
      holes.
      
      Size classes are implemented in the following way:
      
      - Mark each new block group with a size class of the first allocation
        that goes into it.
      
      - Add two new passes to ffe: "unset size class" and "wrong size class".
        First, try only matching block groups, then try unset ones, then allow
        allocation of new ones, and finally allow mismatched block groups.
      
      - Filtering is done just by skipping inappropriate ones, there is no
        special size class indexing.
      
      Other solutions I considered were:
      
      - A best fit allocator with an rb-tree. This worked well, as small
        writes didn't leak big holes from large freed extents, but led to
        regressions in ffe and write performance due to lock contention on
        the rb-tree with every allocation possibly updating it in parallel.
        Perhaps something clever could be done to do the updates in the
        background while being "right enough".
      
      - A fixed size "working set". This prevents freeing an extent
        drastically changing where writes currently land, and seems like a
        good option too. Doesn't take advantage of size in any way.
      
      - The same size class idea, but implemented with xarray marks. This
        turned out to be slower than looping the linked list and skipping
        wrong block groups, and is also less flexible since we must have only
        3 size classes (max #marks). With the current approach we can have as
        many as we like.
      
      Performance testing was done via: https://github.com/josefbacik/fsperf
      Of particular relevance are the new fragmentation specific tests.
      
      A brief summary of the testing results:
      
      - Neutral results on existing tests. There are some minor regressions
        and improvements here and there, but nothing that truly stands out as
        notable.
      - Improvement on new tests where size class and extent lifetime are
        correlated. Fragmentation in these cases is completely eliminated
        and write performance is generally a little better. There is also
        significant improvement where extent sizes are just a bit larger than
        the size class boundaries.
      - Regression on one new tests: where the allocations are sized
        intentionally a hair under the borders of the size classes. Results
        are neutral on the test that intentionally attacks this new scheme by
        mixing extent size and lifetime.
      
      The full dump of the performance results can be found here:
      https://bur.io/fsperf/size-class-2022-11-15.txt
      (there are ANSI escape codes, so best to curl and view in terminal)
      
      Here is a snippet from the full results for a new test which mixes
      buffered writes appending to a long lived set of files and large short
      lived fallocates:
      
      bufferedappendvsfallocate results
               metric             baseline       current        stdev            diff
      ======================================================================================
      avg_commit_ms                    31.13         29.20          2.67     -6.22%
      bg_count                            14         15.60             0     11.43%
      commits                          11.10         12.20          0.32      9.91%
      elapsed                          27.30         26.40          2.98     -3.30%
      end_state_mount_ns         11122551.90   10635118.90     851143.04     -4.38%
      end_state_umount_ns           1.36e+09      1.35e+09   12248056.65     -1.07%
      find_free_extent_calls       116244.30     114354.30        964.56     -1.63%
      find_free_extent_ns_max      599507.20    1047168.20     103337.08     74.67%
      find_free_extent_ns_mean       3607.19       3672.11        101.20      1.80%
      find_free_extent_ns_min            500           512          6.67      2.40%
      find_free_extent_ns_p50           2848          2876         37.65      0.98%
      find_free_extent_ns_p95           4916          5000         75.45      1.71%
      find_free_extent_ns_p99       20734.49      20920.48       1670.93      0.90%
      frag_pct_max                     61.67             0          8.05   -100.00%
      frag_pct_mean                    43.59             0          6.10   -100.00%
      frag_pct_min                     25.91             0         16.60   -100.00%
      frag_pct_p50                     42.53             0          7.25   -100.00%
      frag_pct_p95                     61.67             0          8.05   -100.00%
      frag_pct_p99                     61.67             0          8.05   -100.00%
      fragmented_bg_count               6.10             0          1.45   -100.00%
      max_commit_ms                    49.80            46          5.37     -7.63%
      sys_cpu                           2.59          2.62          0.29      1.39%
      write_bw_bytes                1.62e+08      1.68e+08   17975843.50      3.23%
      write_clat_ns_mean            57426.39      54475.95       2292.72     -5.14%
      write_clat_ns_p50             46950.40      42905.60       2101.35     -8.62%
      write_clat_ns_p99            148070.40     143769.60       2115.17     -2.90%
      write_io_kbytes                4194304       4194304             0      0.00%
      write_iops                     2476.15       2556.10        274.29      3.23%
      write_lat_ns_max            2101667.60    2251129.50     370556.59      7.11%
      write_lat_ns_mean             59374.91      55682.00       2523.09     -6.22%
      write_lat_ns_min              17353.10         16250       1646.08     -6.36%
      
      There are some mixed improvements/regressions in most metrics along with
      an elimination of fragmentation in this workload.
      
      On the balance, the drastic 1->0 improvement in the happy cases seems
      worth the mix of regressions and improvements we do observe.
      
      Some considerations for future work:
      
      - Experimenting with more size classes
      - More hinting/search ordering work to approximate a best-fit allocator
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52bb7a21
    • Boris Burkov's avatar
      btrfs: add more find_free_extent tracepoints · 854c2f36
      Boris Burkov authored
      find_free_extent is a complicated function. It consists (at least) of:
      
      - a hint that jumps into the middle of a for loop macro
      - a middle loop trying every raid level
      - an outer loop ascending through ffe loop levels
      - complicated logic for skipping some of those ffe loop levels
      - multiple underlying in-bg allocators (zoned, cluster, no cluster)
      
      Which is all to say that more tracing is helpful for debugging its
      behavior. Add two new tracepoints: at the entrance to the block_groups
      loop (hit for every raid level and every ffe_ctl loop) and at the point
      we seriously consider a block_group for allocation. This way we can see
      the whole path through the algorithm, including hints, multiple loops,
      etc.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      854c2f36
    • Boris Burkov's avatar
      btrfs: pass find_free_extent_ctl to allocator tracepoints · cfc2de0f
      Boris Burkov authored
      The allocator tracepoints currently have a pile of values from ffe_ctl.
      In modifying the allocator and adding more tracepoints, I found myself
      adding to the already long argument list of the tracepoints. It makes it
      a lot simpler to just send in the ffe_ctl itself.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cfc2de0f
    • Christoph Hellwig's avatar
      btrfs: remove the wait argument to btrfs_start_ordered_extent · 36d45567
      Christoph Hellwig authored
      Given that wait is always set to 1, so remove the argument.
      Last use of wait with 0 was in 0c304304 ("Btrfs: remove
      csum_bytes_left").
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36d45567
    • Filipe Manana's avatar
      btrfs: use a single variable to track return value for log_dir_items() · 235e1c7b
      Filipe Manana authored
      We currently use 'ret' and 'err' to track the return value for
      log_dir_items(), which is confusing and likely the cause for previous
      bugs where log_dir_items() did not return an error when it should, fixed
      in previous patches.
      
      So change this and use only a single variable, 'ret', to track the return
      value. This is simpler and makes it similar to most of the existing code.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      235e1c7b
    • Filipe Manana's avatar
      btrfs: use a negative value for BTRFS_LOG_FORCE_COMMIT · 5cce1780
      Filipe Manana authored
      Currently we use the value 1 for BTRFS_LOG_FORCE_COMMIT, but that value
      has a few inconveniences:
      
      1) If it's ever used by btrfs_log_inode(), or any function down the call
         chain, we have to remember to btrfs_set_log_full_commit(), which is
         repetitive and has a chance to be forgotten in future use cases.
         btrfs_log_inode_parent() only calls btrfs_set_log_full_commit() when
         it gets a negative value from btrfs_log_inode();
      
      2) Down the call chain of btrfs_log_inode(), we may have functions that
         need to force a log commit, but can return either an error (negative
         value), false (0) or true (1). So they are forced to return some
         random negative to force a log commit - using BTRFS_LOG_FORCE_COMMIT
         would make the intention more clear. Currently the only example is
         flush_dir_items_batch().
      
      So turn BTRFS_LOG_FORCE_COMMIT into a negative value. The chosen value
      is -(MAX_ERRNO + 1), so that it does not overlap any errno value and makes
      it easier to debug.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5cce1780
    • Yushan Zhou's avatar
      btrfs: use PAGE_{ALIGN, ALIGNED, ALIGN_DOWN} macro · ce394a7f
      Yushan Zhou authored
      The header file linux/mm.h provides PAGE_ALIGN, PAGE_ALIGNED,
      PAGE_ALIGN_DOWN macros. Use these macros to make code more
      concise.
      Signed-off-by: default avatarYushan Zhou <katrinzhou@tencent.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ce394a7f
    • Peng Hao's avatar
      btrfs: go to matching label when cleaning em in btrfs_submit_direct · d31de378
      Peng Hao authored
      When btrfs_get_chunk_map fails to allocate a new em the cleanup does not
      need to be done so the goto target is out_err, which is consistent with
      current coding style.
      Signed-off-by: default avatarPeng Hao <flyingpeng@tencent.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d31de378
    • Josef Bacik's avatar
      btrfs: turn on -Wmaybe-uninitialized · 1ec49744
      Josef Bacik authored
      We had a recent bug that would have been caught by a newer compiler with
      -Wmaybe-uninitialized and would have saved us a month of failing tests
      that I didn't have time to investigate.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ec49744
    • Josef Bacik's avatar
      btrfs: fix uninitialized variable warning in run_one_async_start · a6ca692e
      Josef Bacik authored
      With -Wmaybe-uninitialized compiler complains about ret being possibly
      uninitialized, which isn't possible as the WQ_ constants are set only
      from our code, however we can handle the default case and get rid of the
      warning.
      
      The value is set to BLK_STS_IOERR so it does not issue any IO and could
      be potentially detected, but this is basically a "cannot happen" error.
      To catch any problems during development use the assert.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ set the error in default: ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a6ca692e
    • Naohiro Aota's avatar
      btrfs: zoned: fix uninitialized variable warning in btrfs_get_dev_zones · cd30d3bc
      Naohiro Aota authored
      Fix an uninitialized warning we get with -Wmaybe-uninitialized where it
      thought zno may have been uninitialized, in both cases it depends on
      zinfo->zone_cache but we know the value won't change between checks.
      Reported-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Link: https://lore.kernel.org/linux-btrfs/af6c527cbd8bdc782e50bd33996ee83acc3a16fb.1671221596.git.josef@toxicpanda.com/Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd30d3bc
    • Josef Bacik's avatar
      btrfs: fix uninitialized variable warning in btrfs_sb_log_location · 12adffe6
      Josef Bacik authored
      We only have 3 possible mirrors, and we have ASSERT()'s to make sure
      we're not passing in an invalid super mirror into this function, so
      technically this value isn't uninitialized.  However
      -Wmaybe-uninitialized will complain, so set it to U64_MAX so if we don't
      have ASSERT()'s turned on it'll error out later on when it see's the
      zone is beyond our maximum zones.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12adffe6
    • Josef Bacik's avatar
      btrfs: fix uninitialized variable warnings in __set_extent_bit and convert_extent_bit · 59864325
      Josef Bacik authored
      We will pass in the parent and p pointer into our tree_search function
      to avoid doing a second search when inserting a new extent state into
      the tree.  However because this is conditional upon passing in these
      pointers the compiler seems to think these values can be uninitialized
      if we're using -Wmaybe-uninitialized.  Fix this by initializing these
      values.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59864325
    • Josef Bacik's avatar
      btrfs: fix uninitialized variable warning in btrfs_update_block_group · efbf35a1
      Josef Bacik authored
      reclaim isn't set in the alloc case, however we only care about
      reclaim in the !alloc case.  This isn't an actual problem, however
      -Wmaybe-uninitialized will complain, so initialize reclaim to quiet the
      compiler.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      efbf35a1
    • Josef Bacik's avatar
      btrfs: fix uninitialized variable warning in get_inode_gen · ab199013
      Josef Bacik authored
      Anybody that calls get_inode_gen() can have an uninitialized gen if
      there's an error.  This isn't a big deal because all the users just exit
      if they get an error, however it makes -Wmaybe-uninitialized complain,
      so fix this up to always initialize the passed in gen, this quiets all
      of the uninitialized warnings in send.c.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab199013