1. 26 Sep, 2022 40 commits
    • Christoph Hellwig's avatar
      btrfs: decide bio cloning inside submit_stripe_bio · 28793b19
      Christoph Hellwig authored
      Remove the orig_bio argument as it can be derived from the bioc, and
      the clone argument as it can be calculated from bioc and dev_nr.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      28793b19
    • Christoph Hellwig's avatar
      btrfs: factor out low-level bio setup from submit_stripe_bio · 32747c44
      Christoph Hellwig authored
      Split out a low-level btrfs_submit_dev_bio helper that just submits
      the bio without any cloning decisions or setting up the end I/O handler
      for future reuse by a different caller.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      32747c44
    • Christoph Hellwig's avatar
      btrfs: give struct btrfs_bio a real end_io handler · 917f32a2
      Christoph Hellwig authored
      Currently btrfs_bio end I/O handling is a bit of a mess.  The bi_end_io
      handler and bi_private pointer of the embedded struct bio are both used
      to handle the completion of the high-level btrfs_bio and for the I/O
      completion for the low-level device that the embedded bio ends up being
      sent to.
      
      To support this bi_end_io and bi_private are saved into the
      btrfs_io_context structure and then restored after the bio sent to the
      underlying device has completed the actual I/O.
      
      Untangle this by adding an end I/O handler and private data to struct
      btrfs_bio for the high-level btrfs_bio based completions, and leave the
      actual bio bi_end_io handler and bi_private pointer entirely to the
      low-level device I/O.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      917f32a2
    • Christoph Hellwig's avatar
      btrfs: properly abstract the parity raid bio handling · f1c29379
      Christoph Hellwig authored
      The parity raid write/recover functionality is currently not very well
      abstracted from the bio submission and completion handling in volumes.c:
      
       - the raid56 code directly completes the original btrfs_bio fed into
         btrfs_submit_bio instead of dispatching back to volumes.c
       - the raid56 code consumes the bioc and bio_counter references taken
         by volumes.c, which also leads to special casing of the calls from
         the scrub code into the raid56 code
      
      To fix this up supply a bi_end_io handler that calls back into the
      volumes.c machinery, which then puts the bioc, decrements the bio_counter
      and completes the original bio, and updates the scrub code to also
      take ownership of the bioc and bio_counter in all cases.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f1c29379
    • Christoph Hellwig's avatar
      btrfs: use chained bios when cloning · c3a62baf
      Christoph Hellwig authored
      The stripes_pending in the btrfs_io_context counts number of inflight
      low-level bios for an upper btrfs_bio.  For reads this is generally
      one as reads are never cloned, while for writes we can trivially use
      the bio remaining mechanisms that is used for chained bios.
      
      To be able to make use of that mechanism, split out a separate trivial
      end_io handler for the cloned bios that does a minimal amount of error
      tracking and which then calls bio_endio on the original bio to transfer
      control to that, with the remaining counter making sure it is completed
      last.  This then allows to merge btrfs_end_bioc into the original bio
      bi_end_io handler.
      
      To make this all work all error handling needs to happen through the
      bi_end_io handler, which requires a small amount of reshuffling in
      submit_stripe_bio so that the bio is cloned already by the time the
      suitability of the device is checked.
      
      This reduces the size of the btrfs_io_context and prepares splitting
      the btrfs_bio at the stripe boundary.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c3a62baf
    • Christoph Hellwig's avatar
      btrfs: don't take a bio_counter reference for cloned bios · 2bbc72f1
      Christoph Hellwig authored
      Stop grabbing an extra bio_counter reference for each clone bio in a
      mirrored write and instead just release the one original reference in
      btrfs_end_bioc once all the bios for a single btrfs_bio have completed
      instead of at the end of btrfs_submit_bio once all bios have been
      submitted.
      
      This means the reference is now carried by the "upper" btrfs_bio only
      instead of each lower bio.
      
      Also remove the now unused btrfs_bio_counter_inc_noblocked helper.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2bbc72f1
    • Christoph Hellwig's avatar
      btrfs: pass the operation to btrfs_bio_alloc · 6b42f5e3
      Christoph Hellwig authored
      Pass the operation to btrfs_bio_alloc, matching what bio_alloc_bioset
      set does.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6b42f5e3
    • Christoph Hellwig's avatar
      btrfs: move btrfs_bio allocation to volumes.c · d45cfb88
      Christoph Hellwig authored
      volumes.c is the place that implements the storage layer using the
      btrfs_bio structure, so move the bio_set and allocation helpers there
      as well.
      
      To make up for the new initialization boilerplate, merge the two
      init/exit helpers in extent_io.c into a single one.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d45cfb88
    • Christoph Hellwig's avatar
      btrfs: don't create integrity bioset for btrfs_bioset · 1e408af3
      Christoph Hellwig authored
      btrfs never uses bio integrity data itself, so don't allocate
      the integrity pools for btrfs_bioset.
      
      This patch is a revert of the commit b208c2f7 ("btrfs: Fix crash due
      to not allocating integrity data for a set").  The integrity data pool
      is not needed, the bio-integrity code now handles allocating the
      integrity payload without that.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Tested-by: default avatarNikolay Borisov <nborisov@suse.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e408af3
    • Josef Bacik's avatar
      btrfs: remove use btrfs_remove_free_space_cache instead of variant · fc80f7ac
      Josef Bacik authored
      We are calling __btrfs_remove_free_space_cache everywhere to cleanup the
      block group free space, however we can just use
      btrfs_remove_free_space_cache and pass in the block group in all of
      these places.  Then we can remove __btrfs_remove_free_space_cache and
      rename __btrfs_remove_free_space_cache_locked to
      __btrfs_remove_free_space_cache.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc80f7ac
    • Josef Bacik's avatar
      btrfs: call __btrfs_remove_free_space_cache_locked on cache load failure · 8a1ae278
      Josef Bacik authored
      Now that lockdep is staying enabled through our entire CI runs I started
      seeing the following stack in generic/475
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 2171864 at fs/btrfs/discard.c:604 btrfs_discard_update_discardable+0x98/0xb0
      CPU: 1 PID: 2171864 Comm: kworker/u4:0 Not tainted 5.19.0-rc8+ #789
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Workqueue: btrfs-cache btrfs_work_helper
      RIP: 0010:btrfs_discard_update_discardable+0x98/0xb0
      RSP: 0018:ffffb857c2f7bad0 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: ffff8c85c605c200 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: ffffffff86807c5b RDI: ffffffff868a831e
      RBP: ffff8c85c4c54000 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff8c85c66932f0 R11: 0000000000000001 R12: ffff8c85c3899010
      R13: ffff8c85d5be4f40 R14: ffff8c85c4c54000 R15: ffff8c86114bfa80
      FS:  0000000000000000(0000) GS:ffff8c863bd00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f2e7f168160 CR3: 000000010289a004 CR4: 0000000000370ee0
      Call Trace:
      
       __btrfs_remove_free_space_cache+0x27/0x30
       load_free_space_cache+0xad2/0xaf0
       caching_thread+0x40b/0x650
       ? lock_release+0x137/0x2d0
       btrfs_work_helper+0xf2/0x3e0
       ? lock_is_held_type+0xe2/0x140
       process_one_work+0x271/0x590
       ? process_one_work+0x590/0x590
       worker_thread+0x52/0x3b0
       ? process_one_work+0x590/0x590
       kthread+0xf0/0x120
       ? kthread_complete_and_exit+0x20/0x20
       ret_from_fork+0x1f/0x30
      
      This is the code
      
              ctl = block_group->free_space_ctl;
              discard_ctl = &block_group->fs_info->discard_ctl;
      
              lockdep_assert_held(&ctl->tree_lock);
      
      We have a temporary free space ctl for loading the free space cache in
      order to avoid having allocations happening while we're loading the
      cache.  When we hit an error we free it all up, however this also calls
      btrfs_discard_update_discardable, which requires
      block_group->free_space_ctl->tree_lock to be held.  However this is our
      temporary ctl so this lock isn't held.  Fix this by calling
      __btrfs_remove_free_space_cache_locked instead so that we only clean up
      the entries and do not mess with the discardable stats.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a1ae278
    • Filipe Manana's avatar
      btrfs: fix race between quota enable and quota rescan ioctl · 331cd946
      Filipe Manana authored
      When enabling quotas, at btrfs_quota_enable(), after committing the
      transaction, we change fs_info->quota_root to point to the quota root we
      created and set BTRFS_FS_QUOTA_ENABLED at fs_info->flags. Then we try
      to start the qgroup rescan worker, first by initializing it with a call
      to qgroup_rescan_init() - however if that fails we end up freeing the
      quota root but we leave fs_info->quota_root still pointing to it, this
      can later result in a use-after-free somewhere else.
      
      We have previously set the flags BTRFS_FS_QUOTA_ENABLED and
      BTRFS_QGROUP_STATUS_FLAG_ON, so we can only fail with -EINPROGRESS at
      btrfs_quota_enable(), which is possible if someone already called the
      quota rescan ioctl, and therefore started the rescan worker.
      
      So fix this by ignoring an -EINPROGRESS and asserting we can't get any
      other error.
      Reported-by: default avatarYe Bin <yebin10@huawei.com>
      Link: https://lore.kernel.org/linux-btrfs/20220823015931.421355-1-yebin10@huawei.com/
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      331cd946
    • Maciej S. Szmigiero's avatar
      btrfs: don't print information about space cache or tree every remount · dbecac26
      Maciej S. Szmigiero authored
      btrfs currently prints information about space cache or free space tree
      being in use on every remount, regardless whether such remount actually
      enabled or disabled one of these features.
      
      This is actually unnecessary since providing remount options changing the
      state of these features will explicitly print the appropriate notice.
      
      Let's instead print such unconditional information just on an initial mount
      to avoid filling the kernel log when, for example, laptop-mode-tools
      remount the fs on some events.
      Signed-off-by: default avatarMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dbecac26
    • Filipe Manana's avatar
      btrfs: simplify error handling at btrfs_del_root_ref() · 1fdbd03d
      Filipe Manana authored
      At btrfs_del_root_ref() we are using two return variables, named 'ret'
      and 'err'. This makes it harder to follow and easier to return the wrong
      value in case an error happens - the previous patch in the series, which
      has the subject "btrfs: fix silent failure when deleting root
      reference", fixed a bug due to confusion created by these two variables.
      
      So change the function to use a single variable for tracking the return
      value of the function, using only 'ret', which is consistent with most
      of the codebase.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1fdbd03d
    • Omar Sandoval's avatar
      btrfs: get rid of block group caching progress logic · 48ff7083
      Omar Sandoval authored
      struct btrfs_caching_ctl::progress and struct
      btrfs_block_group::last_byte_to_unpin were previously needed to ensure
      that unpin_extent_range() didn't return a range to the free space cache
      before the caching thread had a chance to cache that range. However, the
      commit "btrfs: fix space cache corruption and potential double
      allocations" made it so that we always synchronously cache the block
      group at the time that we pin the extent, so this machinery is no longer
      necessary.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      48ff7083
    • BingJing Chang's avatar
      btrfs: send: fix failures when processing inodes with no links · 9ed0a72e
      BingJing Chang authored
      There is a bug causing send failures when processing an orphan directory
      with no links. In commit 46b2f459 ("Btrfs: fix send failure when
      root has deleted files still open")', the orphan inode issue was
      addressed. The send operation fails with a ENOENT error because of any
      attempts to generate a path for the inode with a link count of zero.
      Therefore, in that patch, sctx->ignore_cur_inode was introduced to be
      set if the current inode has a link count of zero for bypassing some
      unnecessary steps. And a helper function btrfs_unlink_all_paths() was
      introduced and called to clean up old paths found in the parent
      snapshot. However, not only regular files but also directories can be
      orphan inodes. So if the send operation meets an orphan directory, it
      will issue a wrong unlink command for that directory now. Soon the
      receive operation fails with a EISDIR error. Besides, the send operation
      also fails with a ENOENT error later when it tries to generate a path of
      it.
      
      Similar example but making an orphan dir for an incremental send:
      
        $ btrfs subvolume create vol
        $ mkdir vol/dir
        $ touch vol/dir/foo
      
        $ btrfs subvolume snapshot -r vol snap1
        $ btrfs subvolume snapshot -r vol snap2
      
        # Turn the second snapshot to RW mode and delete the whole dir while
        # holding an open file descriptor on it.
        $ btrfs property set snap2 ro false
        $ exec 73<snap2/dir
        $ rm -rf snap2/dir
      
        # Set the second snapshot back to RO mode and do an incremental send.
        $ btrfs property set snap2 ro true
        $ mkdir receive_dir
        $ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
        At subvol snap2
        At snapshot snap2
        ERROR: send ioctl failed with -2: No such file or directory
        ERROR: unlink dir failed. Is a directory
      
      Actually, orphan inodes are more common use cases in cascading backups.
      (Please see the illustration below.) In a cascading backup, a user wants
      to replicate a couple of snapshots from Machine A to Machine B and from
      Machine B to Machine C. Machine B doesn't take any RO snapshots for
      sending. All a receiver does is create an RW snapshot of its parent
      snapshot, apply the send stream and turn it into RO mode at the end.
      Even if all paths of some inodes are deleted in applying the send
      stream, these inodes would not be deleted and become orphans after
      changing the subvolume from RW to RO. Moreover, orphan inodes can occur
      not only in send snapshots but also in parent snapshots because Machine
      B may do a batch replication of a couple of snapshots.
      
      An illustration for cascading backups:
      
        Machine A (snapshot {1..n}) --> Machine B --> Machine C
      
      The idea to solve the problem is to delete all the items of orphan
      inodes before using these snapshots for sending. I used to think that
      the reasonable timing for doing that is during the ioctl of changing the
      subvolume from RW to RO because it sounds good that we will not modify
      the fs tree of a RO snapshot anymore. However, attempting to do the
      orphan cleanup in the ioctl would be pointless. Because if someone is
      holding an open file descriptor on the inode, the reference count of the
      inode will never drop to 0. Then iput() cannot trigger eviction, which
      finally deletes all the items of it. So we try to extend the original
      patch to handle orphans in send/parent snapshots. Here are several cases
      that need to be considered:
      
      Case 1: BTRFS_COMPARE_TREE_NEW
             |  send snapshot  | action
       --------------------------------
       nlink |        0        | ignore
      
      In case 1, when we get a BTRFS_COMPARE_TREE_NEW tree comparison result,
      it means that a new inode is found in the send snapshot and it doesn't
      appear in the parent snapshot. Since this inode has a link count of zero
      (It's an orphan and there're no paths for it.), we can leverage
      sctx->ignore_cur_inode in the original patch to prevent it from being
      created.
      
      Case 2: BTRFS_COMPARE_TREE_DELETED
             | parent snapshot | action
       ----------------------------------
       nlink |        0        | as usual
      
      In case 2, when we get a BTRFS_COMPARE_TREE_DELETED tree comparison
      result, it means that the inode only appears in the parent snapshot.
      As usual, the send operation will try to delete all its paths. However,
      this inode has a link count of zero, so no paths of it will be found. No
      deletion operations will be issued. We don't need to change any logic.
      
      Case 3: BTRFS_COMPARE_TREE_CHANGED
                 |       | parent snapshot | send snapshot | action
       -----------------------------------------------------------------------
       subcase 1 | nlink |        0        |       0       | ignore
       subcase 2 | nlink |       >0        |       0       | new_gen(deletion)
       subcase 3 | nlink |        0        |      >0       | new_gen(creation)
      
      In case 3, when we get a BTRFS_COMPARE_TREE_CHANGED tree comparison result,
      it means that the inode appears in both snapshots. Here are 3 subcases.
      
      First, when the inode has link counts of zero in both snapshots. Since
      there are no paths for this inode in (source/destination) parent
      snapshots and we don't care about whether there is also an orphan inode
      in destination or not, we can set sctx->ignore_cur_inode on to prevent
      it from being created.
      
      For the second and the third subcases, if there are paths in one
      snapshot and there're no paths in the other snapshot for this inode. We
      can treat this inode as a new generation. We can also leverage the logic
      handling a new generation of an inode with small adjustments. Then it
      will delete all old paths and create a new inode with new attributes and
      paths only when there's a positive link count in the send snapshot.
      
      In subcase 2, the send operation only needs to delete all old paths as
      in the parent snapshot. But it may require more operations for a
      directory to remove its old paths. If a not-empty directory is going to
      be deleted (because it has a link count of zero in the send snapshot)
      but there are files/directories with bigger inode numbers under it, the
      send operation will need to rename it to its orphan name first. After
      processing and deleting the last item under this directory, the send
      operation will check this directory, aka the parent directory of the
      last item, again and issue a rmdir operation to remove it finally.
      
      Therefore, we also need to treat inodes with a link count of zero as if
      they didn't exist in get_cur_inode_state(), which is used in
      process_recorded_refs(). By doing this, when checking a directory with
      orphan names after the last item under it has been deleted, the send
      operation now can properly issue a rmdir operation. Otherwise, without
      doing this, the orphan directory with an orphan name would be kept here
      at the end due to the existing inode with a link count of zero being
      found.
      
      In subcase 3, as in case 2, no old paths would be found, so no deletion
      operations will be issued. The send operation will only create a new one
      for that inode.
      
      Note that subcase 3 is not common. That's because it's easy to reduce
      the hard links of an inode, but once all valid paths are removed,
      there are no valid paths for creating other hard links. The only way to
      do that is trying to send an older snapshot after a newer snapshot has
      been sent.
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ed0a72e
    • BingJing Chang's avatar
      btrfs: send: refactor arguments of get_inode_info() · 7e93f6dc
      BingJing Chang authored
      Refactor get_inode_info() to populate all wanted fields on an output
      structure. Besides, also introduce a helper function called
      get_inode_gen(), which is commonly used.
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7e93f6dc
    • Ethan Lien's avatar
      btrfs: remove unnecessary EXTENT_UPTODATE state in buffered I/O path · 52b029f4
      Ethan Lien authored
      After we copied data to page cache in buffered I/O, we
      1. Insert a EXTENT_UPTODATE state into inode's io_tree, by
         endio_readpage_release_extent(), set_extent_delalloc() or
         set_extent_defrag().
      2. Set page uptodate before we unlock the page.
      
      But the only place we check io_tree's EXTENT_UPTODATE state is in
      btrfs_do_readpage(). We know we enter btrfs_do_readpage() only when we
      have a non-uptodate page, so it is unnecessary to set EXTENT_UPTODATE.
      
      For example, when performing a buffered random read:
      
      	fio --rw=randread --ioengine=libaio --direct=0 --numjobs=4 \
      		--filesize=32G --size=4G --bs=4k --name=job \
      		--filename=/mnt/file --name=job
      
      Then check how many extent_state in io_tree:
      
      	cat /proc/slabinfo | grep btrfs_extent_state | awk '{print $2}'
      
      w/o this patch, we got 640567 btrfs_extent_state.
      w/  this patch, we got    204 btrfs_extent_state.
      
      Maintaining such a big tree brings overhead since every I/O needs to insert
      EXTENT_LOCKED, insert EXTENT_UPTODATE, then remove EXTENT_LOCKED. And in
      every insert or remove, we need to lock io_tree, do tree search, alloc or
      dealloc extent states. By removing unnecessary EXTENT_UPTODATE, we keep
      io_tree in a minimal size and reduce overhead when performing buffered I/O.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarRobbie Ko <robbieko@synology.com>
      Signed-off-by: default avatarEthan Lien <ethanlien@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52b029f4
    • Filipe Manana's avatar
      btrfs: simplify adding and replacing references during log replay · 7059c658
      Filipe Manana authored
      During log replay, when adding/replacing inode references, there are two
      special cases that have special code for them:
      
      1) When we have an inode with two or more hardlinks in the same directory,
         therefore two or more names encoded in the same inode reference item,
         and one of the hard links gets renamed to the old name of another hard
         link - that is, the index number for a name changes. This was added in
         commit 0d836392 ("Btrfs: fix mount failure after fsync due to
         hard link recreation"), and is covered by test case generic/502 from
         fstests;
      
      2) When we have several inodes that got renamed to an old name of some
         other inode, in a cascading style. The code to deal with this special
         case was added in commit 6b5fc433 ("Btrfs: fix fsync after
         succession of renames of different files"), and is covered by test
         cases generic/526 and generic/527 from fstests.
      
      Both cases can be deal with by making sure __add_inode_ref() is always
      called by add_inode_ref() for every name encoded in the inode reference
      item, and not just for the first name that has a conflict. With such
      change we no longer need that special casing for the two cases mentioned
      before. So do those changes.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7059c658
    • David Sterba's avatar
      btrfs: sysfs: show discard stats and tunables in non-debug build · fb731430
      David Sterba authored
      When discard=async was introduced there were also sysfs knobs and stats
      for debugging and tuning, hidden under CONFIG_BTRFS_DEBUG. The defaults
      have been set and so far seem to satisfy all users on a range of
      workloads. As there are not only tunables (like iops or kbps) but also
      stats tracking amount of discardable bytes, that should be available
      when the async discard is on (otherwise it's not).
      
      The stats are moved from the per-fs debug directory, so it's under
        /sys/fs/btrfs/FSID/discard
      
      - discard_bitmap_bytes - amount of discarded bytes from data tracked as
                               bitmaps
      - discard_extent_bytes - dtto but as extents
      - discard_bytes_saved -
      - discardable_bytes - amount of bytes that can be discarded
      - discardable_extents - number of extents to be discarded
      - iops_limit - tunable limit of number of discard IOs to be issued
      - kbps_limit - tunable limit of kilobytes per second issued as discard IO
      - max_discard_size - tunable limit for size of one IO discard request
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb731430
    • Filipe Manana's avatar
      btrfs: use delayed items when logging a directory · 30b80f3c
      Filipe Manana authored
      When logging a directory we start by flushing all its delayed items.
      That results in adding dir index items to the subvolume btree, for new
      dentries, and removing dir index items from the subvolume btree for any
      dentries that were deleted.
      
      This makes it straightforward to log a directory simply by iterating over
      all the modified subvolume btree leaves, especially when we used to log
      both dir index keys and dir item keys (before commit 339d0354
      ("btrfs: only copy dir index keys when logging a directory") and when we
      used to copy old dir index entries for leaves modified in the current
      transaction (before commit 732d591a ("btrfs: stop copying old dir
      items when logging a directory")).
      
      From an efficiency point of view this has a couple of drawbacks:
      
      1) Adds extra latency, due to copying delayed items to the subvolume btree
         and deleting dir index items from the btree.
      
         Further if there are other tasks accessing the btree, which is common
         (syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks,
         etc, finishing an ordered extent, etc), lock contention can cause
         further delays, both to the task logging a directory and to the other
         tasks accessing the btree;
      
      2) More time spent overall flushing delayed items, if after logging the
         directory further changes are done to the directory in the same
         transaction.
      
         For example, if we add 10 dentries to a directory, fsync it, add more
         10 dentries, fsync it again, then add more 10 dentries and fsync it
         again, then we end up inserting 3 batches of 10 items to the subvolume
         btree. With the changes from this patch, we flush all the delayed items
         to the btree only once - a single batch of 30 items, and outside the
         logging code (transaction commit or when delayed items are flushed
         asynchronously).
      
      This change simply skips the flushing of delayed items every time we log a
      directory. Instead we copy the delayed insertion items directly to the log
      tree and delete delayed deletion items directly from the log tree.
      Therefore avoiding changing first the subvolume btree and then scanning it
      for new items to copy from it to the log tree and detecting deletions
      by observing gaps in consecutive dir index keys in subvolume btree leaves.
      
      Running the following tests on a non-debug kernel (Debian's default kernel
      config), on a box with a NVMe device, a 12 cores Intel CPU and 64G of ram,
      produced the results below.
      
      The results compare a branch without this patch and all the other patches
      it depends on versus the same branch with the patchset applied.
      
      The patchset is comprised of the following patches:
      
        btrfs: don't drop dir index range items when logging a directory
        btrfs: remove the root argument from log_new_dir_dentries()
        btrfs: update stale comment for log_new_dir_dentries()
        btrfs: free list element sooner at log_new_dir_dentries()
        btrfs: avoid memory allocation at log_new_dir_dentries() for common case
        btrfs: remove root argument from btrfs_delayed_item_reserve_metadata()
        btrfs: store index number instead of key in struct btrfs_delayed_item
        btrfs: remove unused logic when looking up delayed items
        btrfs: shrink the size of struct btrfs_delayed_item
        btrfs: search for last logged dir index if it's not cached in the inode
        btrfs: move need_log_inode() to above log_conflicting_inodes()
        btrfs: move log_new_dir_dentries() above btrfs_log_inode()
        btrfs: log conflicting inodes without holding log mutex of the initial inode
        btrfs: skip logging parent dir when conflicting inode is not a dir
        btrfs: use delayed items when logging a directory
      
      Custom test script for testing time spent at btrfs_log_inode():
      
         #!/bin/bash
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
      
         # Total number of files to create in the test directory.
         NUM_FILES=10000
         # Fsync after creating or renaming N files.
         FSYNC_AFTER=100
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount -o ssd $DEV $MNT
      
         TEST_DIR=$MNT/testdir
         mkdir $TEST_DIR
      
         echo "Creating files..."
         for ((i = 1; i <= $NUM_FILES; i++)); do
                 echo -n > $TEST_DIR/file_$i
                 if (( ($i % $FSYNC_AFTER) == 0 )); then
                         xfs_io -c "fsync" $TEST_DIR
                 fi
         done
      
         sync
      
         echo "Renaming files..."
         for ((i = 1; i <= $NUM_FILES; i++)); do
                 mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed
                 if (( ($i % $FSYNC_AFTER) == 0 )); then
                         xfs_io -c "fsync" $TEST_DIR
                 fi
         done
      
         umount $MNT
      
      And using the following bpftrace script to capture the total time that is
      spent at btrfs_log_inode():
      
         #!/usr/bin/bpftrace
      
         k:btrfs_log_inode
         {
                 @start_log_inode[tid] = nsecs;
         }
      
         kr:btrfs_log_inode
         /@start_log_inode[tid]/
         {
                 $dur = (nsecs - @start_log_inode[tid]) / 1000;
                 @btrfs_log_inode_total_time = sum($dur);
                 delete(@start_log_inode[tid]);
         }
      
         END
         {
                 clear(@start_log_inode);
         }
      
      Result before applying patchset:
      
         @btrfs_log_inode_total_time: 622642
      
      Result after applying patchset:
      
         @btrfs_log_inode_total_time: 354134    (-43.1% time spent)
      
      The following dbench script was also used for testing:
      
         #!/bin/bash
      
         NUM_JOBS=$(nproc --all)
      
         DEV=/dev/nvme0n1
         MNT=/mnt/nvme0n1
         MOUNT_OPTIONS="-o ssd"
         MKFS_OPTIONS="-O no-holes -R free-space-tree"
      
         echo "performance" | \
             tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $MKFS_OPTIONS $DEV
         mount $MOUNT_OPTIONS $DEV $MNT
      
         dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS
      
         umount $MNT
      
      Before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    3322265     0.034    21.032
       Close        2440562     0.002     0.994
       Rename        140664     1.150   269.633
       Unlink        670796     1.093   269.678
       Deltree           96     5.481    15.510
       Mkdir             48     0.004     0.052
       Qpathinfo    3010924     0.014     8.127
       Qfileinfo     528055     0.001     0.518
       Qfsinfo       552113     0.003     0.372
       Sfileinfo     270575     0.005     0.688
       Find         1164176     0.052    13.931
       WriteX       1658537     0.019     5.918
       ReadX        5207412     0.003     1.034
       LockX          10818     0.003     0.079
       UnlockX        10818     0.002     0.313
       Flush         232811     1.027   269.735
      
      Throughput 869.867 MB/sec (sync dirs)  12 clients  12 procs  max_latency=269.741 ms
      
      After patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4152738     0.029    20.863
       Close        3050770     0.002     1.119
       Rename        175829     0.871   211.741
       Unlink        838447     0.845   211.724
       Deltree          120     4.798    14.162
       Mkdir             60     0.003     0.005
       Qpathinfo    3763807     0.011     4.673
       Qfileinfo     660111     0.001     0.400
       Qfsinfo       690141     0.003     0.429
       Sfileinfo     338260     0.005     0.725
       Find         1455273     0.046     6.787
       WriteX       2073307     0.017     5.690
       ReadX        6509193     0.003     1.171
       LockX          13522     0.003     0.077
       UnlockX        13522     0.002     0.125
       Flush         291044     0.811   211.631
      
      Throughput 1089.27 MB/sec (sync dirs)  12 clients  12 procs  max_latency=211.750 ms
      
      (+25.2% throughput, -21.5% max latency)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30b80f3c
    • Filipe Manana's avatar
      btrfs: skip logging parent dir when conflicting inode is not a dir · 5557a069
      Filipe Manana authored
      When we find a conflicting inode (an inode that had the same name and
      parent directory as the inode we are logging now) that was deleted in the
      current transaction, we always end up logging its parent directory.
      
      This is to deal with the case where the conflicting inode corresponds to
      a deleted subvolume/snapshot or a directory that had subvolumes/snapshots
      (or some subdirectory inside it had subvolumes/snapshots, etc), because
      we can't deal with dropping subvolumes/snapshots during log replay. So
      if we log the parent directory, and if we are dealing with these special
      cases, then we fallback to a transaction commit when logging the parent,
      because its last_unlink_trans will match the current transaction (which
      gets set and propagated when a subvolume/snapshot is deleted).
      
      This change skips the logging of the parent directory when the conflicting
      inode is not a directory (or a subvolume/snapshot). This is ok because in
      this case logging the current inode is enough to trigger an unlink of the
      conflicting inode during log replay.
      
      So for a case like this:
      
        $ mkdir /mnt/dir
        $ echo -n "first foo data" > /mnt/dir/foo
      
        $ sync
      
        $ rm -f /mnt/dir/foo
        $ echo -n "second foo data" > /mnt/dir/foo
        $ xfs_io -c "fsync" /mnt/dir/foo
      
      We avoid logging parent directory "dir" when logging the new file "foo".
      In other cases it avoids falling back to a transaction commit, when the
      parent directory has a last_unlink_trans value that matches the current
      transaction, due to moving a file from it to some other directory.
      
      This is a case that happens frequently with dbench for example, where a
      new file that has the name/parent of another file that was deleted in the
      current transaction, is fsynced.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5557a069
    • Filipe Manana's avatar
      btrfs: log conflicting inodes without holding log mutex of the initial inode · e09d94c9
      Filipe Manana authored
      When logging an inode, if we detect the inode has a reference that
      conflicts with some other inode that got renamed, we log that other inode
      while holding the log mutex of the current inode. We then find out if
      there are other inodes that conflict with the first conflicting inode,
      and log them while under the log mutex of the original inode. This is
      fine because the recursion can only happen once.
      
      For the upcoming work where we directly log delayed items without flushing
      them first to the subvolume tree, this recursion adds a lot of complexity
      and it's hard to keep lockdep happy about it.
      
      So collect a list of conflicting inodes and then log the inodes after
      unlocking the log mutex of the inode we started with.
      
      Also limit the maximum number of conflict inodes we log to 10, to avoid
      spending too much time logging (and maybe allocating too many list
      elements too), as typically we don't have more than 1 or 2 conflicting
      inodes - if we go over the limit, simply fallback to a transaction commit.
      
      It is possible to have a very long list of conflicting inodes to be
      intentionally created by a user if he/she creates a very long succession
      of renames like this:
      
        (...)
        rename E to F
        rename D to E
        rename C to D
        rename B to C
        rename A to B
        touch A (create a new file named A)
        fsync A
      
      If that happened for a sequence of hundreds or thousands of renames, it
      could massively slow down the logging and cause other secondary effects
      like for example blocking other fsync operations and transaction commits
      for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM
      first). However such cases are very uncommon to happen in practice,
      nevertheless it's better to be prepared for them and avoid chaos.
      Such long sequence of conflicting inodes could be created before this
      change.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e09d94c9
    • Filipe Manana's avatar
      btrfs: move log_new_dir_dentries() above btrfs_log_inode() · f6d86dbe
      Filipe Manana authored
      The static function log_new_dir_dentries() is currently defined below
      btrfs_log_inode(), but in an upcoming patch a new function is introduced
      that is called by btrfs_log_inode() and this new function needs to call
      log_new_dir_dentries(). So move log_new_dir_dentries() to a location
      between btrfs_log_inode() and need_log_inode() (the later is called by
      log_new_dir_dentries()).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f6d86dbe
    • Filipe Manana's avatar
      btrfs: move need_log_inode() to above log_conflicting_inodes() · a3751024
      Filipe Manana authored
      The static function need_log_inode() is defined below btrfs_log_inode()
      and log_conflicting_inodes(), but in the next patches in the series we
      will need to call need_log_inode() in a couple new functions that will be
      used by btrfs_log_inode(). So move its definition to a location above
      log_conflicting_inodes().
      
      Also make its arguments 'const', since they are not supposed to be
      modified.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a3751024
    • Filipe Manana's avatar
      btrfs: search for last logged dir index if it's not cached in the inode · 193df624
      Filipe Manana authored
      The key offset of the last dir index item that was logged is stored in
      the inode's last_dir_index_offset field. However that field is not
      persisted in the inode item or elsewhere, so if the inode gets evicted
      and reloaded, it gets a value of (u64)-1, so that when we are logging
      dir index items we check if they were logged before, to avoid attempts
      to insert duplicated keys and fallback to a transaction commit.
      
      Improve on this by searching for the last dir index that was logged when
      we start logging a directory if the inode's last_dir_index_offset is not
      set (has a value of (u64)-1) and it was logged before. This avoids
      checking if each dir index item we find was already logged before, and
      simplifies the logging of dir index items (process_dir_items_leaf()).
      
      This will also be needed for an incoming change where we start logging
      delayed items directly, without flushing them first.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      193df624
    • Filipe Manana's avatar
      btrfs: shrink the size of struct btrfs_delayed_item · 4c469798
      Filipe Manana authored
      Currently struct btrfs_delayed_item has a base size of 96 bytes, but its
      size can be decreased by doing the following 2 tweaks:
      
      1) Change data_len from u32 to u16. Our maximum possible leaf size is 64K,
         so the data_len can never be larger than that, and in fact it is always
         much smaller than that. The max length for a dentry's name is ensured
         at the VFS level (PATH_MAX, 4096 bytes) and in struct btrfs_inode_ref
         and btrfs_dir_item we use a u16 to store the name's length;
      
      2) Change 'ins_or_del' to a 1 bit enum, which is all we need since it
         can only have 2 values. After this there's also no longer the need to
         BUG_ON() before using 'ins_or_del' in several places. Also rename the
         field from 'ins_or_del' to 'type', which is more clear.
      
      These two tweaks decrease the size of struct btrfs_delayed_item from 96
      bytes down to 88 bytes. A previous patch already reduced the size of this
      structure by 16 bytes, but an upcoming change will increase its size by
      16 bytes (adding a struct list_head element).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c469798
    • Filipe Manana's avatar
      btrfs: remove unused logic when looking up delayed items · 4cbf37f5
      Filipe Manana authored
      All callers pass NULL to the 'prev' and 'next' arguments of the function
      __btrfs_lookup_delayed_item(), so remove these arguments. Also, remove
      the unnecessary wrapper __btrfs_lookup_delayed_insertion_item(), making
      btrfs_delete_delayed_insertion_item() directly call
      __btrfs_lookup_delayed_item().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4cbf37f5
    • Filipe Manana's avatar
      btrfs: store index number instead of key in struct btrfs_delayed_item · 96d89923
      Filipe Manana authored
      All delayed items are for dir index keys, so there's really no point of
      having an embedded struct btrfs_key in struct btrfs_delayed_item, which
      makes the structure use more space than necessary (and adds a hole of 7
      bytes).
      
      So replace the key field with an index number (u64), which reduces the
      size of struct btrfs_delayed_item from 112 bytes down to 96 bytes.
      
      Some upcoming work will increase the structure size by 16 bytes, so this
      change compensates for that future size increase.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      96d89923
    • Filipe Manana's avatar
      btrfs: remove root argument from btrfs_delayed_item_reserve_metadata() · df492881
      Filipe Manana authored
      The root argument of btrfs_delayed_item_reserve_metadata() is used only
      to get the fs_info object, but we already have a transaction handle, which
      we can use to get the fs_info. So remove the root argument.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      df492881
    • Filipe Manana's avatar
      btrfs: avoid memory allocation at log_new_dir_dentries() for common case · 009d9bea
      Filipe Manana authored
      At log_new_dir_dentries() we always start by allocating a list element
      for the starting inode and then do a while loop with the condition being
      a list emptiness check.
      
      This however is not needed, we can avoid allocating this initial list
      element and then just check for the list emptiness at the end of the
      loop's body. So just do that to save one memory allocation from the
      kmalloc-32 slab.
      
      This allows for not doing any memory allocation when we don't have any
      subdirectory to log, which is a very common case.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      009d9bea
    • Filipe Manana's avatar
      btrfs: free list element sooner at log_new_dir_dentries() · 40084813
      Filipe Manana authored
      At log_new_dir_dentries(), there's no need to keep the current list
      element allocated while processing the leaves with directory items for
      the current directory, and while logging other inodes. Plus in case we
      find a subdirectory, we also end up allocating a new list element while
      the current one is still allocated, temporarily using more memory than
      necessary.
      
      So free the current list element early on, before processing leaves.
      Also make the removal and release of all list elements in case of an
      error more simple by eliminating the label and goto, adding an explicit
      loop to release all list elements in case an error happens.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      40084813
    • Filipe Manana's avatar
      btrfs: update stale comment for log_new_dir_dentries() · b96c552b
      Filipe Manana authored
      The comment refers to the function log_dir_items() in order to check why
      the inodes of new directory entries need to be logged, but the relevant
      comments are no longer at log_dir_items(), they were moved to the function
      process_dir_items_leaf() in commit eb10d85e ("btrfs: factor out the
      copying loop of dir items from log_dir_items()"). So update it with the
      current function name.
      
      Also remove references with i_mutex to "VFS lock", since the inode lock
      is no longer a mutex since 2016 (it's now a rw semaphore).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b96c552b
    • Filipe Manana's avatar
      btrfs: remove the root argument from log_new_dir_dentries() · 8786a6d7
      Filipe Manana authored
      There's no point in passing a root argument to log_new_dir_dentries()
      because it always corresponds to the root of the given inode. So remove
      it and extract the root from the given inode.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8786a6d7
    • Filipe Manana's avatar
      btrfs: don't drop dir index range items when logging a directory · 04fc7d51
      Filipe Manana authored
      When logging a directory that was previously logged in the current
      transaction, we drop all the range items (BTRFS_DIR_LOG_INDEX_KEY key
      type). This is because we will process all leaves in the subvolume's tree
      that were changed in the current transaction and then add range items for
      covering new dir index items and deleted dir index items, which could
      cover now a larger range than before.
      
      We used to fail if we tried to insert a range item key that already
      exists, so we dropped all range items to avoid failing. However nowadays,
      since commit 750ee454 ("btrfs: fix assertion failure when logging
      directory key range item"), we simply update any range item that already
      exists, increasing its range's last dir index if needed. Since the range
      covered by a range item can never decrease, due to the fact that dir index
      values come from a monotonically increasing counter and are never reused,
      we can stop dropping all range items before we start logging a directory.
      By not dropping the items we can avoid having occasional tree rebalance
      operations.
      
      This will also be needed for an incoming change where we start logging
      delayed items directly, without flushing them first.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04fc7d51
    • Qu Wenruo's avatar
      btrfs: scrub: use larger block size for data extent scrub · 786672e9
      Qu Wenruo authored
      [PROBLEM]
      The existing scrub code for data extents always limit the block size to
      sectorsize.
      
      This causes quite some extra scrub_block being allocated:
      (there is a data extent at logical bytenr 298844160, length 64KiB)
      
        alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1
        alloc_scrub_block: new block: logical=298848256 physical=298848256 mirror=1
        alloc_scrub_block: new block: logical=298852352 physical=298852352 mirror=1
        alloc_scrub_block: new block: logical=298856448 physical=298856448 mirror=1
        alloc_scrub_block: new block: logical=298860544 physical=298860544 mirror=1
        alloc_scrub_block: new block: logical=298864640 physical=298864640 mirror=1
        alloc_scrub_block: new block: logical=298868736 physical=298868736 mirror=1
        alloc_scrub_block: new block: logical=298872832 physical=298872832 mirror=1
        alloc_scrub_block: new block: logical=298876928 physical=298876928 mirror=1
        alloc_scrub_block: new block: logical=298881024 physical=298881024 mirror=1
        alloc_scrub_block: new block: logical=298885120 physical=298885120 mirror=1
        alloc_scrub_block: new block: logical=298889216 physical=298889216 mirror=1
        alloc_scrub_block: new block: logical=298893312 physical=298893312 mirror=1
        alloc_scrub_block: new block: logical=298897408 physical=298897408 mirror=1
        alloc_scrub_block: new block: logical=298901504 physical=298901504 mirror=1
        alloc_scrub_block: new block: logical=298905600 physical=298905600 mirror=1
        ...
        scrub_block_put: free block: logical=298844160 physical=298844160 len=4096 mirror=1
        scrub_block_put: free block: logical=298848256 physical=298848256 len=4096 mirror=1
        scrub_block_put: free block: logical=298852352 physical=298852352 len=4096 mirror=1
        scrub_block_put: free block: logical=298856448 physical=298856448 len=4096 mirror=1
        scrub_block_put: free block: logical=298860544 physical=298860544 len=4096 mirror=1
        scrub_block_put: free block: logical=298864640 physical=298864640 len=4096 mirror=1
        scrub_block_put: free block: logical=298868736 physical=298868736 len=4096 mirror=1
        scrub_block_put: free block: logical=298872832 physical=298872832 len=4096 mirror=1
        scrub_block_put: free block: logical=298876928 physical=298876928 len=4096 mirror=1
        scrub_block_put: free block: logical=298881024 physical=298881024 len=4096 mirror=1
        scrub_block_put: free block: logical=298885120 physical=298885120 len=4096 mirror=1
        scrub_block_put: free block: logical=298889216 physical=298889216 len=4096 mirror=1
        scrub_block_put: free block: logical=298893312 physical=298893312 len=4096 mirror=1
        scrub_block_put: free block: logical=298897408 physical=298897408 len=4096 mirror=1
        scrub_block_put: free block: logical=298901504 physical=298901504 len=4096 mirror=1
        scrub_block_put: free block: logical=298905600 physical=298905600 len=4096 mirror=1
      
      This behavior will waste a lot of memory, especially after we have moved
      quite some members from scrub_sector to scrub_block.
      
      [FIX]
      To reduce the allocation of scrub_block, and to reduce memory usage, use
      BTRFS_STRIPE_LEN instead of sectorsize as the block size to scrub data
      extents.
      
      This results only one scrub_block to be allocated for above data extent:
      
        alloc_scrub_block: new block: logical=298844160 physical=298844160 mirror=1
        scrub_block_put: free block: logical=298844160 physical=298844160 len=65536 mirror=1
      
      This would greatly reduce the memory usage (even it's just transient)
      for larger data extents scrub.
      
      For above example, the memory usage would be:
      
      Old: num_sectors * (sizeof(scrub_block) + sizeof(scrub_sector))
           16          * (408                 + 96) = 8065
      
      New: sizeof(scrub_block) + num_sectors * sizeof(scrub_sector)
           408                 + 16          * 96 = 1944
      
      A good reduction of 75.9%.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      786672e9
    • Qu Wenruo's avatar
      btrfs: scrub: move logical/physical/dev/mirror_num from scrub_sector to scrub_block · 8686c40e
      Qu Wenruo authored
      Currently we store the following members in scrub_sector:
      
      - logical
      - physical
      - physical_for_dev_replace
      - dev
      - mirror_num
      
      However the current scrub code has ensured that scrub_blocks never cross
      stripe boundary.
      This is caused by the entry functions (scrub_simple_mirror,
      scrub_simple_stripe), thus every scrub_block will not cross stripe
      boundary.
      
      Thus this makes it possible to move those members into scrub_block other
      than putting them into scrub_sector.
      
      This should save quite some memory, as a scrub_block can be as large as 64
      sectors, even for metadata it's 16 sectors byte default.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8686c40e
    • Qu Wenruo's avatar
      btrfs: scrub: remove scrub_sector::page and use scrub_block::pages instead · eb2fad30
      Qu Wenruo authored
      Although scrub currently works for subpage (PAGE_SIZE > sectorsize) cases,
      it will allocate one page for each scrub_sector, which can cause extra
      unnecessary memory usage.
      
      Utilize scrub_block::pages[] instead of allocating page for each
      scrub_sector, this allows us to integrate larger extents while using
      less memory.
      
      For example, if our page size is 64K, sectorsize is 4K, and we got an
      32K sized extent.
      We will only allocate one page for scrub_block, and all 8 scrub sectors
      will point to that page.
      
      To do that properly, here we introduce several small helpers:
      
      - scrub_page_get_logical()
        Get the logical bytenr of a page.
        We store the logical bytenr of the page range into page::private.
        But for 32bit systems, their (void *) is not large enough to contain
        a u64, so in that case we will need to allocate extra memory for it.
      
        For 64bit systems, we can use page::private directly.
      
      - scrub_block_get_logical()
        Just get the logical bytenr of the first page.
      
      - scrub_sector_get_page()
        Return the page which the scrub_sector points to.
      
      - scrub_sector_get_page_offset()
        Return the offset inside the page which the scrub_sector points to.
      
      - scrub_sector_get_kaddr()
        Return the address which the scrub_sector points to.
        Just a wrapper using scrub_sector_get_page() and
        scrub_sector_get_page_offset()
      
      - bio_add_scrub_sector()
      
      Please note that, even with this patch, we're still allocating one page
      for one sector for data extents.
      
      This is because in scrub_extent() we split the data extent using
      sectorsize.
      
      The memory usage reduction will need extra work to make scrub to work
      like data read to only use the correct sector(s).
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb2fad30
    • Qu Wenruo's avatar
      btrfs: scrub: introduce scrub_block::pages for more efficient memory usage for subpage · f3e01e0e
      Qu Wenruo authored
      [BACKGROUND]
      Currently for scrub, we allocate one page for one sector, this is fine
      for PAGE_SIZE == sectorsize support, but can waste extra memory for
      subpage support.
      
      [CODE CHANGE]
      Make scrub_block contain all the pages, so if we're scrubbing an extent
      sized 64K, and our page size is also 64K, we only need to allocate one
      page.
      
      [LIFESPAN CHANGE]
      Since now scrub_sector no longer holds a page, but is using
      scrub_block::pages[] instead, we have to ensure scrub_block has a longer
      lifespan for write bio. The lifespan for read bio is already large
      enough.
      
      Now scrub_block will only be released after the write bio finished.
      
      [COMING NEXT]
      Currently we only added scrub_block::pages[] for this purpose, but
      scrub_sector is still utilizing the old scrub_sector::page.
      
      The switch will happen in the next patch.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3e01e0e
    • Qu Wenruo's avatar
      btrfs: scrub: factor out allocation and initialization of scrub_sector into helper · 5dd3d8e4
      Qu Wenruo authored
      The allocation and initialization is shared by 3 call sites, and we're
      going to change the initialization of some members in the upcoming
      patches.
      
      So factor out the allocation and initialization of scrub_sector into a
      helper, alloc_scrub_sector(), which will do the following work:
      
      - Allocate the memory for scrub_sector
      
      - Allocate a page for scrub_sector::page
      
      - Initialize scrub_sector::refs to 1
      
      - Attach the allocated scrub_sector to scrub_block
        The attachment is bidirectional, which means scrub_block::sectorv[]
        will be updated and scrub_sector::sblock will also be updated.
      
      - Update scrub_block::sector_count and do extra sanity check on it
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5dd3d8e4