1. 25 Jul, 2022 33 commits
    • Stefan Roesch's avatar
      btrfs: store chunk size in space-info struct · f6fca391
      Stefan Roesch authored
      The chunk size is stored in the btrfs_space_info structure.  It is
      initialized at the start and is then used.
      
      A new API is added to update the current chunk size.  This API is used
      to be able to expose the chunk_size as a sysfs setting.
      Signed-off-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ rename and merge helpers, switch atomic type to u64, style fixes ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f6fca391
    • Josef Bacik's avatar
      btrfs: do not batch insert non-consecutive dir indexes during log replay · 71b68e9e
      Josef Bacik authored
      While running generic/475 in a loop I got the following error
      
      BTRFS critical (device dm-11): corrupt leaf: root=5 block=31096832 slot=69, bad key order, prev (263 96 531) current (263 96 524)
      <snip>
       item 65 key (263 96 517) itemoff 14132 itemsize 33
       item 66 key (263 96 523) itemoff 14099 itemsize 33
       item 67 key (263 96 525) itemoff 14066 itemsize 33
       item 68 key (263 96 531) itemoff 14033 itemsize 33
       item 69 key (263 96 524) itemoff 14000 itemsize 33
      
      As you can see here we have 3 dir index keys with the dir index value of
      523, 524, and 525 inserted between 517 and 524.  This occurs because our
      dir index insertion code will bulk insert all dir index items on the
      node regardless of their actual key value.
      
      This makes sense on a normally running system, because if there's a gap
      in between the items there was a deletion before the item was inserted,
      so there's not going to be an overlap of the dir index items that need
      to be inserted and what exists on disk.
      
      However during log replay this isn't necessarily true, we could have any
      number of dir indexes in the tree already.
      
      Fix this by seeing if we're replaying the log, and if we are simply skip
      batching if there's a gap in the key space.
      
      This file system was left broken from the fstest, I tested this patch
      against the broken fs to make sure it replayed the log properly, and
      then btrfs checked the file system after the log replay to verify
      everything was ok.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71b68e9e
    • Filipe Manana's avatar
      btrfs: reduce amount of reserved metadata for delayed item insertion · 763748b2
      Filipe Manana authored
      Whenever we want to create a new dir index item (when creating an inode,
      create a hard link, rename a file) we reserve 1 unit of metadata space
      for it in a transaction (that's 256K for a node/leaf size of 16K), and
      then create a delayed insertion item for it to be added later to the
      subvolume's tree. That unit of metadata is kept until the delayed item
      is inserted into the subvolume tree, which may take a while to happen
      (in the worst case, it's done only when the transaction commits). If we
      have multiple dir index items to insert for the same directory, say N
      index items, and they all fit in a single leaf of metadata, then we are
      holding N units of reserved metadata space when all we need is 1 unit.
      
      This change addresses that, whenever a new delayed dir index item is
      added, we release the unit of metadata the caller has reserved when it
      started the transaction if adding that new dir index item does not
      result in touching one more metadata leaf, otherwise the reservation
      is kept by transferring it from the transaction block reserve to the
      delayed items block reserve, just like before. Given that with a leaf
      size of 16K we can have a few hundred dir index items in a single leaf
      (the exact value depends on file name lengths), this reduces pressure on
      metadata reservation by releasing unnecessary space much sooner.
      
      The following fs_mark test showed some improvement when creating many
      files in parallel on machine running a non debug kernel (debian's default
      kernel config) with 12 cores:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        FILES=100000
        THREADS=$(nproc --all)
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        OPTS="-S 0 -L 10 -n $FILES -s 0 -t $THREADS -k"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      Before:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           2      1200000            0     225991.3          5465891
           4      2400000            0     345728.1          5512106
           4      3600000            0     346959.5          5557653
           8      4800000            0     329643.0          5587548
           8      6000000            0     312657.4          5606717
           8      7200000            0     281707.5          5727985
          12      8400000            0      88309.8          5020422
          12      9600000            0      85835.9          5207496
          16     10800000            0      81039.2          5404964
          16     12000000            0      58548.6          5842468
      
      After:
      
      FSUse%        Count         Size    Files/sec     App Overhead
           2      1200000            0     230604.5          5778375
           4      2400000            0     348908.3          5508072
           4      3600000            0     357028.7          5484337
           6      4800000            0     342898.3          5565703
           6      6000000            0     314670.8          5751555
           8      7200000            0     282548.2          5778177
          12      8400000            0      90844.9          5306819
          12      9600000            0      86963.1          5304689
          16     10800000            0      89113.2          5455248
          16     12000000            0      86693.5          5518933
      
      The "after" results are after applying this patch and all the other
      patches in the same patchset, which is comprised of the following
      changes:
      
        btrfs: balance btree dirty pages and delayed items after a rename
        btrfs: free the path earlier when creating a new inode
        btrfs: balance btree dirty pages and delayed items after clone and dedupe
        btrfs: add assertions when deleting batches of delayed items
        btrfs: deal with deletion errors when deleting delayed items
        btrfs: refactor the delayed item deletion entry point
        btrfs: improve batch deletion of delayed dir index items
        btrfs: assert that delayed item is a dir index item when adding it
        btrfs: improve batch insertion of delayed dir index items
        btrfs: do not BUG_ON() on failure to reserve metadata for delayed item
        btrfs: set delayed item type when initializing it
        btrfs: reduce amount of reserved metadata for delayed item insertion
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      763748b2
    • Filipe Manana's avatar
      btrfs: set delayed item type when initializing it · c9d02ab4
      Filipe Manana authored
      Currently we set the type of a delayed item only after successfully
      inserting it into its respective rbtree. This is fine, as the type
      is not used anywhere before that point, but for the next patch in the
      series, there will be the need to check the type of a delayed item
      before inserting it into a rbtree.
      
      So set the type of a delayed item immediately after allocating it.
      This also makes the trivial wrappers for adding insertion and deletion
      useless, so it removes them as well.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9d02ab4
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on failure to reserve metadata for delayed item · 3bae13e9
      Filipe Manana authored
      At btrfs_insert_delayed_dir_index(), we don't expect the metadata
      reservation for the delayed dir index item insertion to fail, because the
      caller is supposed to have reserved 1 unit of metadata space for that.
      All callers are able to deal with an error in case that happens, so there
      is no need for something so drastic as a BUG_ON() in case of failure.
      Instead just emit a warning, so that's easily noticed during development
      (fstests in particular), and return the error to the caller.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3bae13e9
    • Filipe Manana's avatar
      btrfs: improve batch insertion of delayed dir index items · 06ac264f
      Filipe Manana authored
      Currently we group delayed dir index items for insertion as a single batch
      (a single btree operation) as long as their keys are sequential in the key
      space.
      
      For example we have delayed index items for the following index keys:
      
         10, 11, 12, 15, 16, 20, 21
      
      We end up building three batches:
      
      1) First one for index keys 10, 11 and 12;
      2) Second one for index keys 15 and 16;
      3) Third one for index keys 20 and 21.
      
      However, since the dir index numbers come from a monotonically increasing
      counter and are never reused, we could group all these items into a single
      batch. The existence of holes in the sequence happens only when we had
      delayed dir index items for insertion that got deleted before they were
      flushed to the subvolume's tree.
      
      The delayed items are stored in a rbtree based on their key order, so
      we can just group items into a batch as long as they all fit in a leaf,
      and ignore if there's a gap (key offset, index number) between two
      consecutive items. This is more efficient and reduces the amount of
      time spent when running delayed items if there are gaps between dir
      index items.
      
      For example running the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=100
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
             echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        start=$(date +%s%N)
        sync
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nsync took $dur milliseconds"
      
        umount $MNT
      
      While having the following bpftrace script running in another shell:
      
        $ cat bpf-delayed-items-inserts.sh
        #!/usr/bin/bpftrace
      
        /* Must add 'noinline' to btrfs_insert_delayed_items(). */
        k:btrfs_insert_delayed_items
        {
            @start_insert_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_insert_empty_items
        /@start_insert_delayed_items[tid]/
        {
           @insert_batches = count();
        }
      
        kr:btrfs_insert_delayed_items
        /@start_insert_delayed_items[tid]/
        {
            $dur = (nsecs - @start_insert_delayed_items[tid]) / 1000;
            @btrfs_insert_delayed_items_total_time = sum($dur);
            delete(@start_insert_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_insert_delayed_items_total_time: 576
      @insert_batches: 51
      
      After this change:
      
      @btrfs_insert_delayed_items_total_time: 174
      @insert_batches: 2
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      06ac264f
    • Filipe Manana's avatar
      btrfs: assert that delayed item is a dir index item when adding it · a176affe
      Filipe Manana authored
      All delayed items are for dir index items, we don't support any other item
      types at the moment. So simplify __btrfs_add_delayed_item() and add an
      assertion for checking the item's key type. This also allows the next
      change to be simpler and avoid to check key types. In case we add support
      for different item types in the future, then we'll hit the assertion
      during development and be able to adjust any code that is assuming delayed
      items are always associated to dir index items.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a176affe
    • Filipe Manana's avatar
      btrfs: improve batch deletion of delayed dir index items · 4bd02d90
      Filipe Manana authored
      Currently we group delayed dir index items for deletion in a single batch
      (single btree operation) as long as they all exist in the same leaf and as
      long as their keys are sequential in the key space. For example if we have
      a leaf that has dir index items with offsets:
      
          2, 3, 4, 6, 7, 10
      
      And we have delayed dir index items for deleting all these indexes, and
      no delayed items for any other index keys in between, then we end up
      deleting in 3 batches:
      
      1) First batch for indexes 2, 3 and 4;
      2) Second batch for indexes 6 and 7;
      3) Third batch for index 10.
      
      This is a waste because we can delete all the index keys in a single
      batch. What matters is that each consecutive delayed index key matches
      each consecutive dir index key in a leaf.
      
      So update the logic at btrfs_batch_delete_items() to check only for a
      key match between delayed dir index items and dir index items in a leaf.
      Also avoid the useless first iteration on comparing the key of the
      first slot to delete with the key of the first delayed item, as it's
      silly since they always match, as the delayed item's key was used for
      the btree search that gave us the path we have.
      
      This is more efficient and reduces runtime of running delayed items, as
      well as lock contention on the subvolume's tree.
      
      For example, the following test script:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f $DEV
        mount $DEV $MNT
      
        NUM_FILES=1000
      
        mkdir $MNT/testdir
        for ((i = 1; i <= $NUM_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        # Now delete every other file, to create gaps in the dir index keys.
        for ((i = 1; i <= $NUM_FILES; i += 2)); do
            rm -f $MNT/testdir/file_$i
        done
      
        # Sync to force any delayed items to be flushed to the tree.
        sync
      
        start=$(date +%s%N)
        rm -fr $MNT/testdir
        end=$(date +%s%N)
        dur=$(( (end - start) / 1000000 ))
      
        echo -e "\nrm -fr took $dur milliseconds"
      
        umount $MNT
      
      Running that test script while having the following bpftrace script
      running in another shell:
      
        $ cat bpf-measure.sh
        #!/usr/bin/bpftrace
      
        /* Add 'noinline' to btrfs_delete_delayed_items()'s definition. */
        k:btrfs_delete_delayed_items
        {
            @start_delete_delayed_items[tid] = nsecs;
        }
      
        k:btrfs_del_items
        /@start_delete_delayed_items[tid]/
        {
            @delete_batches = count();
        }
      
        kr:btrfs_delete_delayed_items
        /@start_delete_delayed_items[tid]/
        {
            $dur = (nsecs - @start_delete_delayed_items[tid]) / 1000;
            @btrfs_delete_delayed_items_total_time = sum($dur);
            delete(@start_delete_delayed_items[tid]);
        }
      
      Before this change:
      
      @btrfs_delete_delayed_items_total_time: 9563
      @delete_batches: 1001
      
      After this change:
      
      @btrfs_delete_delayed_items_total_time: 7328
      @delete_batches: 509
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4bd02d90
    • Filipe Manana's avatar
      btrfs: refactor the delayed item deletion entry point · 36baa2c7
      Filipe Manana authored
      The delayed item deletion entry point, btrfs_delete_delayed_items(), is a
      bit convoluted for a few reasons:
      
      1) It's really a loop disguised with labels and goto statements;
      
      2) There's a 'delete_fail' label which isn't only for error cases, we can
         jump to that label even if no error happened, if we simply don't have
         more delayed items to delete;
      
      3) Unnecessarily keeps track of the current and previous items for no
         good reason, as after getting the next item and releasing the current
         one, it just jumps to the 'again' label just to look again for the
         first delayed item;
      
      4) When a delayed item is not in the tree (because it was already deleted
         before), it releases the item while holding a path locked, which is
         not necessary and adds more contention to the tree, specially taking
         into account that the path came from a deletion search, meaning we have
         write locks for nodes at levels 2, 1 and 0. And releasing the item is
         not computationally trivial (rb tree deletion, a kfree() and some
         trivial things).
      
      So refactor it to use a while loop and add some comments to make it more
      obvious why we can have delayed items without a matching item in the tree
      as well as why not keep the delayed node locked all the time when running
      all its deletion items. This is also a preparation for some upcoming work
      involving delayed items.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36baa2c7
    • Filipe Manana's avatar
      btrfs: deal with deletion errors when deleting delayed items · 2b1d260d
      Filipe Manana authored
      Currently, btrfs_delete_delayed_items() ignores any errors returned from
      btrfs_batch_delete_items(). This looks fishy but it's not a problem at
      the moment because:
      
      1) Two of the errors returned from btrfs_batch_delete_items() are for
         impossible cases, cases where a delayed item does not match any item
         in the leaf the path points to - btrfs_delete_delayed_items() always
         calls btrfs_batch_delete_items() with a path that points to a leaf
         that contains an item matching a delayed item;
      
      2) btrfs_batch_delete_items() may return an error from btrfs_del_items(),
         in which case it does not release the delayed items of the batch.
      
         At the moment this is harmless because btrfs_del_items() actually is
         always able to delete items, even if it returns an error - when it
         returns an error it's because it ended up with a leaf mostly empty
         (less than 1/3 full) and failed to migrate items from that leaf into
         its neighbour leaves - this is not critical, as all the items were
         deleted, we just left the tree a bit unbalanced, but it's still a
         valid tree and causes no harm, and future operations on the tree will
         eventually balance it.
      
         So even if we get an error from btrfs_del_items(), the delayed items
         will not be released but the next time we run delayed items we will
         find out, at btrfs_delete_delayed_items(), that they are not present
         in the tree anymore and then release them.
      
      This is all a bit subtle, and it's certainly prone to be a disaster in
      case btrfs_del_items() changes one day and may return errors before being
      able to delete all the requested items, in which case we could leave the
      filesystem in an inconsistent state as we would commit a transaction
      despite a failure from deleting items from the tree.
      
      So make btrfs_delete_delayed_items() check for any errors from the call
      to btrfs_batch_delete_items().
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2b1d260d
    • Filipe Manana's avatar
      btrfs: add assertions when deleting batches of delayed items · 659192e6
      Filipe Manana authored
      There are a few impossible cases that btrfs_batch_delete_items() tries to
      deal with:
      
      1) Getting a path pointing to a NULL leaf;
      2) The leaf slot is pointing beyond the last item in the leaf;
      3) We can't find a single item to delete.
      
      The first case is impossible because the given path was returned by a
      successful call to btrfs_search_slot(). Replace the BUG_ON() with an
      ASSERT for this.
      
      The second case is impossible because we are always called when a delayed
      item matches an item in the given leaf. So add an ASSERT() for that and
      if that condition is not satisfied, trigger a warning and return an error.
      
      The third case is impossible exactly because of the same reason as the
      second case. The given delayed item matches one item in the leaf, so we
      know that our batch always has at least one item. Add an ASSERT to check
      that, trigger a warning if that expectation fails and return an error.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      659192e6
    • Filipe Manana's avatar
      btrfs: balance btree dirty pages and delayed items after clone and dedupe · 6fe81a3a
      Filipe Manana authored
      When reflinking extents (clone and deduplication), we need to touch the
      btree of the destination inode's subvolume, as well as potentially
      create a delayed inode for the destination inode (if it was not created
      before). However we are neither balancing the btree dirty pages nor the
      delayed items after such operations, so if we have a task that is doing
      a long series of clone or deduplication operations, it can result in
      accumulation of too many btree dirty pages and delayed items.
      
      So just call btrfs_btree_balance_dirty() after clone and deduplication,
      just like we do for every other system call that results on modifying a
      btree and adding delayed items.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6fe81a3a
    • Filipe Manana's avatar
      btrfs: free the path earlier when creating a new inode · 814e7718
      Filipe Manana authored
      When creating an inode, through btrfs_create_new_inode(), we release the
      path we allocated before once we don't need it anymore. But we keep it
      allocated until we return from that function, which is wasteful because
      after we release the path we do several things that can allocate yet
      another path: inheriting properties, setting the xattrs used by ACLs and
      secutiry modules, adding an orphan item (O_TMPFILE case) or adding a
      dir item (for the non-O_TMPFILE case).
      
      So instead of releasing the path once we don't need it anymore, free it
      instead. This way we avoid having two paths allocated until we return
      from btrfs_create_new_inode().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      814e7718
    • Filipe Manana's avatar
      btrfs: balance btree dirty pages and delayed items after a rename · ca6dee6b
      Filipe Manana authored
      A rename operation modifies a subvolume's btree, to remove the old dir
      item, add the new dir item, remove an inode ref and add a new inode ref.
      It can also create the delayed inode for the inodes involved in the
      operation, and it creates two delayed dir index items, one to delete
      the old name and another one to add the new name.
      
      However we are neither balancing the btree dirty pages nor the delayed
      items after a rename, which can result in accumulation of too many
      btree dirty pages and delayed items, specially if a task is doing a
      series of rename operations (for example it can happen for package
      installations/upgrades through the zypper tool).
      
      So just call btrfs_btree_balance_dirty() after a rename, just like we
      do for every other system call that results on modifying a btree and
      adding delayed items.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca6dee6b
    • Qu Wenruo's avatar
      btrfs: add trace event for submitted RAID56 bio · b8bea09a
      Qu Wenruo authored
      Add tracepoint for better insight to how the RAID56 data are submitted.
      
      The output looks like this: (trace event header and UUID skipped)
      
         raid56_read_partial: full_stripe=389152768 devid=3 type=DATA1 offset=32768 opf=0x0 physical=323059712 len=32768
         raid56_read_partial: full_stripe=389152768 devid=1 type=DATA2 offset=0 opf=0x0 physical=67174400 len=65536
         raid56_write_stripe: full_stripe=389152768 devid=3 type=DATA1 offset=0 opf=0x1 physical=323026944 len=32768
         raid56_write_stripe: full_stripe=389152768 devid=2 type=PQ1 offset=0 opf=0x1 physical=323026944 len=32768
      
      The above debug output is from a 32K data write into an empty RAID56
      data chunk.
      
      Some explanation on the event output:
      
        full_stripe:	the logical bytenr of the full stripe
        devid:	btrfs devid
        type:		raid stripe type.
               	DATA1:	the first data stripe
               	DATA2:	the second data stripe
               	PQ1:	the P stripe
               	PQ2:	the Q stripe
        offset:	the offset inside the stripe.
        opf:		the bio op type
        physical:	the physical offset the bio is for
        len:		the length of the bio
      
      The first two lines are from partial RMW read, which is reading the
      remaining data stripes from disks.
      
      The last two lines are for full stripe RMW write, which is writing the
      involved two 16K stripes (one for DATA1 stripe, one for P stripe).
      The stripe for DATA2 doesn't need to be written.
      
      There are 5 types of trace events:
      
      - raid56_read_partial
        Read remaining data for regular read/write path.
      
      - raid56_write_stripe
        Write the modified stripes for regular read/write path.
      
      - raid56_scrub_read_recover
        Read remaining data for scrub recovery path.
      
      - raid56_scrub_write_stripe
        Write the modified stripes for scrub path.
      
      - raid56_scrub_read
        Read remaining data for scrub path.
      
      Also, since the trace events are included at super.c, we have to export
      needed structure definitions to 'raid56.h' and include the header in
      super.c, or we're unable to access those members.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reformat comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b8bea09a
    • Qu Wenruo's avatar
      btrfs: update stripe_sectors::uptodate in steal_rbio · 4d100466
      Qu Wenruo authored
      [BUG]
      With added debugging, it turns out the following write sequence would
      cause extra read which is unnecessary:
      
        # xfs_io -f -s -c "pwrite -b 32k 0 32k" -c "pwrite -b 32k 32k 32k" \
      		 -c "pwrite -b 32k 64k 32k" -c "pwrite -b 32k 96k 32k" \
      		 $mnt/file
      
      The debug message looks like this (btrfs header skipped):
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        ^^^^
         Still partial read, even 389152768 is already cached by the first.
         write.
      
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=0 physical=22020096 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        ^^^^
         Still partial read for 298844160.
      
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      This means every 32K writes, even they are in the same full stripe,
      still trigger read for previously cached data.
      
      This would cause extra RAID56 IO, making the btrfs raid56 cache useless.
      
      [CAUSE]
      Commit d4e28d9b ("btrfs: raid56: make steal_rbio() subpage
      compatible") tries to make steal_rbio() subpage compatible, but during
      that conversion, there is one thing missing.
      
      We no longer rely on PageUptodate(rbio->stripe_pages[i]), but
      rbio->stripe_nsectors[i].uptodate to determine if a sector is uptodate.
      
      This means, previously if we switch the pointer, everything is done,
      as the PageUptodate flag is still bound to that page.
      
      But now we have to manually mark the involved sectors uptodate, or later
      raid56_rmw_stripe() will find the stolen sector is not uptodate, and
      assemble the read bio for it, wasting IO.
      
      [FIX]
      We can easily fix the bug, by also update the
      rbio->stripe_sectors[].uptodate in steal_rbio().
      
      With this fixed, now the same write pattern no longer leads to the same
      unnecessary read:
      
        partial rmw, full stripe=389152768 opf=0x0 devid=3 type=1 offset=32768 physical=323059712 len=32768
        partial rmw, full stripe=389152768 opf=0x0 devid=1 type=2 offset=0 physical=67174400 len=65536
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=0 physical=323026944 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=0 physical=323026944 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=1 type=1 offset=32768 physical=22052864 len=32768
        partial rmw, full stripe=298844160 opf=0x0 devid=2 type=2 offset=0 physical=277872640 len=65536
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=0 physical=22020096 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=0 physical=277872640 len=32768
        ^^^ No more partial read, directly into the write path.
        full stripe rmw, full stripe=389152768 opf=0x1 devid=3 type=1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=389152768 opf=0x1 devid=2 type=-1 offset=32768 physical=323059712 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=1 type=1 offset=32768 physical=22052864 len=32768
        full stripe rmw, full stripe=298844160 opf=0x1 devid=3 type=-1 offset=32768 physical=277905408 len=32768
      
      Fixes: d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d100466
    • David Sterba's avatar
      btrfs: remove redundant calls to flush_dcache_page · 21a8935e
      David Sterba authored
      Both memzero_page and memcpy_to_page already call flush_dcache_page so
      we can remove the calls from btrfs code.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      21a8935e
    • Qu Wenruo's avatar
      btrfs: only write the sectors in the vertical stripe which has data stripes · bd8f7e62
      Qu Wenruo authored
      If we have only 8K partial write at the beginning of a full RAID56
      stripe, we will write the following contents:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XXXXXXXXXXXXXXX|XXXXXXXXXXXXXXX|
      
      |X| means the sector will be written back to disk.
      
      Note that, although we won't write any sectors from disk 2, but we will
      write the full 64KiB of parity to disk.
      
      This behavior is fine for now, but not for the future (especially for
      RAID56J, as we waste quite some space to journal the unused parity
      stripes).
      
      So here we will also utilize the btrfs_raid_bio::dbitmap, anytime we
      queue a higher level bio into an rbio, we will update rbio::dbitmap to
      indicate which vertical stripes we need to writeback.
      
      And at finish_rmw(), we also check dbitmap to see if we need to write
      any sector in the vertical stripe.
      
      So after the patch, above example will only lead to the following
      writeback pattern:
      
                          0  8K           32K             64K
      Disk 1	(data):     |XX|            |               |
      Disk 2  (data):     |               |               |
      Disk 3  (parity):   |XX|            |               |
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd8f7e62
    • Qu Wenruo's avatar
      btrfs: use integrated bitmaps for scrub_parity::dbitmap and ebitmap · 381b9b4c
      Qu Wenruo authored
      Previously we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for scrub_parity.
      
      To be extra safe, add an ASSERT() making sure calclulated @nsectors is
      always smaller than BITS_PER_LONG.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      381b9b4c
    • Qu Wenruo's avatar
      btrfs: use integrated bitmaps for btrfs_raid_bio::dbitmap and finish_pbitmap · c67c68eb
      Qu Wenruo authored
      Previsouly we use "unsigned long *" for those two bitmaps.
      
      But since we only support fixed stripe length (64KiB, already checked in
      tree-checker), "unsigned long *" is really a waste of memory, while we
      can just use "unsigned long".
      
      This saves us 8 bytes in total for btrfs_raid_bio.
      
      To be extra safe, add an ASSERT() making sure calculated
      @stripe_nsectors is always smaller than BITS_PER_LONG.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c67c68eb
    • Nikolay Borisov's avatar
      btrfs: use btrfs_try_lock_balance in btrfs_ioctl_balance · 099aa972
      Nikolay Borisov authored
      This eliminates 2 labels and makes the code generally more streamlined.
      Also rename the 'out_bargs' label to 'out_unlock' since bargs is going
      to be freed under the 'out' label. This also fixes a memory leak since
      bargs wasn't correctly freed in one of the condition which are now moved
      in btrfs_try_lock_balance.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      099aa972
    • Nikolay Borisov's avatar
      btrfs: introduce btrfs_try_lock_balance · 7fb10ed8
      Nikolay Borisov authored
      This function contains the factored out locking sequence of
      btrfs_ioctl_balance. Having this piece of code separate helps to
      simplify btrfs_ioctl_balance which has too complicated.  This will be
      used in the next patch to streamline the logic in btrfs_ioctl_balance.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7fb10ed8
    • Christoph Hellwig's avatar
      btrfs: use btrfs_bio_for_each_sector in btrfs_check_read_dio_bio · 1e87770c
      Christoph Hellwig authored
      Use the new btrfs_bio_for_each_sector iterator to simplify
      btrfs_check_read_dio_bio.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e87770c
    • Qu Wenruo's avatar
      btrfs: add a helper to iterate through a btrfs_bio with sector sized chunks · 261d812b
      Qu Wenruo authored
      Add a helper that works similar to __bio_for_each_segment, but instead of
      iterating over PAGE_SIZE chunks it iterates over each sector.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [hch: split from a larger patch, and iterate over the offset instead of
            the offset bits]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ add parameter comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      261d812b
    • Christoph Hellwig's avatar
      btrfs: factor out a btrfs_csum_ptr helper · a89ce08c
      Christoph Hellwig authored
      Add a helper to find the csum for a byte offset into the csum buffer.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a89ce08c
    • Christoph Hellwig's avatar
      btrfs: refactor end_bio_extent_readpage code flow · 97861cd1
      Christoph Hellwig authored
      Untangle the goto and move the code it jumps to so it goes in the order
      of the most likely states first.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97861cd1
    • Christoph Hellwig's avatar
      btrfs: factor out a helper to end a single sector buffer I/O · a5aa7ab6
      Christoph Hellwig authored
      Add a helper to end I/O on a single sector, which will come in handy
      with the new read repair code.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a5aa7ab6
    • Qu Wenruo's avatar
      btrfs: remove duplicated parameters from submit_data_read_repair() · fd5a6f63
      Qu Wenruo authored
      The function submit_data_read_repair() is only called for buffered data
      read path, thus those members can be calculated using bvec directly:
      
      - start
        start = page_offset(bvec->bv_page) + bvec->bv_offset;
      
      - end
        end = start + bvec->bv_len - 1;
      
      - page
        page = bvec->bv_page;
      
      - pgoff
        pgoff = bvec->bv_offset;
      
      Thus we can safely replace those 4 parameters with just one bio_vec.
      
      Also remove the unused return value.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [hch: also remove the return value]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fd5a6f63
    • Qu Wenruo's avatar
      btrfs: introduce a data checksum checking helper · ae643a74
      Qu Wenruo authored
      Although we have several data csum verification code, we never have a
      function really just to verify checksum for one sector.
      
      Function check_data_csum() do extra work for error reporting, thus it
      requires a lot of extra things like file offset, bio_offset etc.
      
      Function btrfs_verify_data_csum() is even worse, it will utilize page
      checked flag, which means it can not be utilized for direct IO pages.
      
      Here we introduce a new helper, btrfs_check_sector_csum(), which really
      only accept a sector in page, and expected checksum pointer.
      
      We use this function to implement check_data_csum(), and export it for
      incoming patch.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [hch: keep passing the csum array as an arguments, as the callers want
            to print it, rename per request]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae643a74
    • Qu Wenruo's avatar
      btrfs: quit early if the fs has no RAID56 support for raid56 related checks · b036f479
      Qu Wenruo authored
      The following functions do special handling for RAID56 chunks:
      
      - btrfs_is_parity_mirror()
        Check if the range is in RAID56 chunks.
      
      - btrfs_full_stripe_len()
        Either return sectorsize for non-RAID56 profiles or full stripe length
        for RAID56 chunks.
      
      But if a filesystem without any RAID56 chunks, it will not have RAID56
      incompat flags, and we can skip the chunk tree looking up completely.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b036f479
    • Fanjun Kong's avatar
      btrfs: use PAGE_ALIGNED instead of IS_ALIGNED · 1280d2d1
      Fanjun Kong authored
      The <linux/mm.h> already provides the PAGE_ALIGNED macro. Let's
      use it instead of IS_ALIGNED and passing PAGE_SIZE directly.
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFanjun Kong <bh1scw@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1280d2d1
    • Pankaj Raghav's avatar
      btrfs: zoned: fix comment description for sb_write_pointer logic · 31f37269
      Pankaj Raghav authored
      Fix the comment to represent the actual logic used for sb_write_pointer
      
      - Empty[0] && In use[1] should be an invalid state instead of returning
        zone 0 wp
      - Empty[0] && Full[1] should be returning zone 0 wp instead of zone 1 wp
      - In use[0] && Empty[1] should be returning zone 0 wp instead of being an
        invalid state
      - In use[0] && Full[1] should be returning zone 0 wp instead of returning
        zone 1 wp
      - Full[0] && Empty[1] should be returning zone 1 wp instead of returning
        zone 0 wp
      - Full[0] && In use[1] should be returning zone 1 wp instead of returning
        zone 0 wp
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31f37269
    • David Sterba's avatar
      btrfs: fix typos in comments · 143823cf
      David Sterba authored
      Codespell has found a few typos.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      143823cf
  2. 24 Jul, 2022 6 commits
  3. 23 Jul, 2022 1 commit
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 515f7141
      Linus Torvalds authored
      Pull kvm fixes from Paolo Bonzini:
      
       - Check for invalid flags to KVM_CAP_X86_USER_SPACE_MSR
      
       - Fix use of sched_setaffinity in selftests
      
       - Sync kernel headers to tools
      
       - Fix KVM_STATS_UNIT_MAX
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: x86: Protect the unused bits in MSR exiting flags
        tools headers UAPI: Sync linux/kvm.h with the kernel sources
        KVM: selftests: Fix target thread to be migrated in rseq_test
        KVM: stats: Fix value for KVM_STATS_UNIT_MAX for boolean stats
      515f7141