1. 26 Oct, 2021 40 commits
    • Qu Wenruo's avatar
      btrfs: refactor submit_compressed_extents() · b4ccace8
      Qu Wenruo authored
      We have a big chunk of code inside a while() loop, with tons of strange
      jumps for error handling.  It's definitely not to the code standard of
      today.  Move the code into a new function, submit_one_async_extent().
      
      Since we're here, also do the following changes:
      
      - Comment style change
        To follow the current scheme
      
      - Don't fallback to non-compressed write then hitting ENOSPC
        If we hit ENOSPC for compressed write, how could we reserve more space
        for non-compressed write?
        Thus we go error path directly.
        This removes the retry: label.
      
      - Add more comment for super long parameter list
        Explain which parameter is for, so we don't need to check the
        prototype.
      
      - Move the error handling to submit_one_async_extent()
        Thus no strange code like:
      
        out_free:
      	...
      	goto again;
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b4ccace8
    • Qu Wenruo's avatar
      btrfs: remove unused function btrfs_bio_fits_in_stripe() · 6aabd858
      Qu Wenruo authored
      As the last caller in compression.c has been removed, we don't need that
      function anymore.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6aabd858
    • Qu Wenruo's avatar
      btrfs: determine stripe boundary at bio allocation time in btrfs_submit_compressed_write · 91507240
      Qu Wenruo authored
      Currently btrfs_submit_compressed_write() will check
      btrfs_bio_fits_in_stripe() each time a new page is going to be added.
      Even if compressed extent is small, we don't really need to do that for
      every page.
      
      Align the behavior to extent_io.c, by determining the stripe boundary
      when allocating a bio.
      
      Unlike extent_io.c, in compressed.c we don't need to bother things like
      different bio flags, thus no need to re-use bio_ctrl.
      
      Here we just manually introduce new local variable, next_stripe_start,
      and use that value returned from alloc_compressed_bio() to calculate
      the stripe boundary.
      
      Then each time we add some page range into the bio, we check if we
      reached the boundary.  And if reached, submit it.
      
      Also, since we have @cur_disk_bytenr to determine whether we're the last
      bio, we don't need a explicit last_bio: tag for error handling any more.
      
      And since we use @cur_disk_bytenr to wait, there is no need for
      pending_bios, also remove it to save some memory of compressed_bio.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91507240
    • Qu Wenruo's avatar
      btrfs: determine stripe boundary at bio allocation time in btrfs_submit_compressed_read · f472c28f
      Qu Wenruo authored
      Currently btrfs_submit_compressed_read() will check
      btrfs_bio_fits_in_stripe() each time a new page is going to be added.
      Even if compressed extent is small, we don't really need to do that for
      every page.
      
      This patch will align the behavior to extent_io.c, by determining the
      stripe boundary when allocating a bio.
      
      Unlike extent_io.c, in compressed.c we don't need to bother things like
      different bio flags, thus no need to re-use bio_ctrl.
      
      Here we just manually introduce new local variable, next_stripe_start,
      and teach alloc_compressed_bio() to calculate the stripe boundary.
      
      Then each time we add some page range into the bio, we check if we
      reached the boundary.  And if reached, submit it.
      
      Also, since we have @cur_disk_byte to determine whether we're the last
      bio, we don't need a explicit last_bio: tag for error handling any more.
      
      And we can use @cur_disk_byte to track which range has been added to
      bio, we can also use @cur_disk_byte to calculate the wait condition, no
      need for @pending_bios.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f472c28f
    • Qu Wenruo's avatar
      btrfs: introduce alloc_compressed_bio() for compression · 22c306fe
      Qu Wenruo authored
      Just aggregate the bio allocation code into one helper, so that we can
      replace 4 call sites.
      
      There is one special note for zoned write.
      
      Currently btrfs_submit_compressed_write() will only allocate the first
      bio using ZONE_APPEND.  If we have to submit current bio due to stripe
      boundary, the new bio allocated will not use ZONE_APPEND.
      
      In theory this should be a bug, but considering zoned mode currently
      only support SINGLE profile, which doesn't have any stripe boundary
      limit, it should never be a problem and we have assertions in place.
      
      This function will provide a good entrance for any work which needs to
      be done at bio allocation time. Like determining the stripe boundary.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      22c306fe
    • Qu Wenruo's avatar
      btrfs: introduce submit_compressed_bio() for compression · 2d4e0b84
      Qu Wenruo authored
      The new helper, submit_compressed_bio(), will aggregate the following
      work:
      
      - Increase compressed_bio::pending_bios
      - Remap the endio function
      - Map and submit the bio
      
      This slightly reorders calls to btrfs_csum_one_bio or
      btrfs_lookup_bio_sums but but none of them does anything regarding IO
      submission so this is effectively no change. We mainly care about order
      of
      
      - atomic_inc
      - btrfs_bio_wq_end_io
      - btrfs_map_bio
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d4e0b84
    • Qu Wenruo's avatar
      btrfs: handle errors properly inside btrfs_submit_compressed_write() · 6853c64a
      Qu Wenruo authored
      Just like btrfs_submit_compressed_read(), there are quite some BUG_ON()s
      inside btrfs_submit_compressed_write() for the bio submission path.
      
      Fix them using the same method:
      
      - For last bio, just endio the bio
        As in that case, one of the endio function of all these submitted bio
        will be able to free the compressed_bio
      
      - For half-submitted bio, wait and finish the compressed_bio manually
        In this case, as long as all other bio finish, we're the only one
        referring the compressed bio, and can manually finish it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6853c64a
    • Qu Wenruo's avatar
      btrfs: handle errors properly inside btrfs_submit_compressed_read() · 86ccbb4d
      Qu Wenruo authored
      There are quite some BUG_ON()s inside btrfs_submit_compressed_read(),
      namely all errors inside the for() loop relies on BUG_ON() to handle
      -ENOMEM.
      
      Handle these errors properly by:
      
      - Wait for submitted bios to finish first
        Using wake_var_event() APIs to wait without introducing extra memory
        overhead inside compressed_bio.
        This allows us to wait for any submitted bio to finish, while still
        keeps the compressed_bio from being freed.
      
      - Introduce finish_compressed_bio_read() to finish the compressed_bio
      
      - Properly end the bio and finish compressed_bio when error happens
      
      Now in btrfs_submit_compressed_read() even when the bio submission
      failed, we can properly handle the error without triggering BUG_ON().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      86ccbb4d
    • Qu Wenruo's avatar
      btrfs: subpage: add bitmap for PageChecked flag · e4f94347
      Qu Wenruo authored
      Although in btrfs we have very limited usage of PageChecked flag, it's
      still some page flag not yet subpage compatible.
      
      Fix it by introducing btrfs_subpage::checked_offset to do the convert.
      
      For most call sites, especially for free-space cache, COW fixup and
      btrfs_invalidatepage(), they all work in full page mode anyway.
      
      For other call sites, they work as subpage compatible mode.
      
      Some call sites need extra modification:
      
      - btrfs_drop_pages()
        Needs extra parameter to get the real range we need to clear checked
        flag.
      
        Also since btrfs_drop_pages() will accept pages beyond the dirtied
        range, update btrfs_subpage_clamp_range() to handle such case
        by setting @len to 0 if the page is beyond target range.
      
      - btrfs_invalidatepage()
        We need to call subpage helper before calling __btrfs_releasepage(),
        or it will trigger ASSERT() as page->private will be cleared.
      
      - btrfs_verify_data_csum()
        In theory we don't need the io_bio->csum check anymore, but it's
        won't hurt.  Just change the comment.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e4f94347
    • Qu Wenruo's avatar
      btrfs: introduce compressed_bio::pending_sectors to trace compressed bio · 6ec9765d
      Qu Wenruo authored
      For btrfs_submit_compressed_read() and btrfs_submit_compressed_write(),
      we have a pretty weird dance around compressed_bio::pending_bios:
      
        btrfs_submit_compressed_read/write()
        {
      	cb = kmalloc()
      	refcount_set(&cb->pending_bios, 0);
      	bio = btrfs_alloc_bio();
      
      	/* NOTE here, we haven't yet submitted any bio */
      	refcount_set(&cb->pending_bios, 1);
      
      	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
      		if (submit) {
      			/* Here we submit bio, but we always have one
      			 * extra pending_bios */
      			refcount_inc(&cb->pending_bios);
      			ret = btrfs_map_bio();
      		}
      	}
      
      	/* Submit the last bio */
      	ret = btrfs_map_bio();
        }
      
      There are two reasons why we do this:
      
      - compressed_bio::pending_bios is a refcount
        Thus if it's reduced to 0, it can not be increased again.
      
      - To ensure the compressed_bio is not freed by some submitted bios
        If the submitted bio is finished before the next bio submitted,
        we can free the compressed_bio completely.
      
      But the above code is sometimes confusing, and we can do it better by
      introducing a new member, compressed_bio::pending_sectors.
      
      Now we use compressed_bio::pending_sectors to indicate whether we have
      any pending sectors under IO or not yet submitted.
      
      If pending_sectors == 0, we're definitely the last bio of compressed_bio,
      and is OK to release the compressed bio.
      
      Now the workflow looks like this:
      
        btrfs_submit_compressed_read/write()
        {
      	cb = kmalloc()
      	atomic_set(&cb->pending_bios, 0);
      	refcount_set(&cb->pending_sectors,
      		     compressed_len >> sectorsize_bits);
      	bio = btrfs_alloc_bio();
      
      	for (pg_index = 0; pg_index < cb->nr_pages; pg_index++) {
      		if (submit) {
      			refcount_inc(&cb->pending_bios);
      			ret = btrfs_map_bio();
      		}
      	}
      
      	/* Submit the last bio */
      	refcount_inc(&cb->pending_bios);
      	ret = btrfs_map_bio();
        }
      
      For now we still need pending_bios for later error handling, but will
      remove pending_bios eventually after properly handling the errors.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6ec9765d
    • Qu Wenruo's avatar
      btrfs: subpage: make add_ra_bio_pages() compatible · 6a404910
      Qu Wenruo authored
      [BUG]
      If we remove the subpage limitation in add_ra_bio_pages(), then read a
      compressed extent which has part of its range in next page, like the
      following inode layout:
      
      	0	32K	64K	96K	128K
      	|<--------------|-------------->|
      
      Btrfs will trigger ASSERT() in endio function:
      
        assertion failed: atomic_read(&subpage->readers) >= nbits
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/ctree.h:3431!
        Internal error: Oops - BUG: 0 [#1] SMP
        Workqueue: btrfs-endio btrfs_work_helper [btrfs]
        Call trace:
         assertfail.constprop.0+0x28/0x2c [btrfs]
         btrfs_subpage_end_reader+0x148/0x14c [btrfs]
         end_page_read+0x8c/0x100 [btrfs]
         end_bio_extent_readpage+0x320/0x6b0 [btrfs]
         bio_endio+0x15c/0x1dc
         end_workqueue_fn+0x44/0x64 [btrfs]
         btrfs_work_helper+0x74/0x250 [btrfs]
         process_one_work+0x1d4/0x47c
         worker_thread+0x180/0x400
         kthread+0x11c/0x120
         ret_from_fork+0x10/0x30
        ---[ end trace c8b7b552d3bb408c ]---
      
      [CAUSE]
      When we read the page range [0, 64K), we find it's a compressed extent,
      and we will try to add extra pages in add_ra_bio_pages() to avoid
      reading the same compressed extent.
      
      But when we add such page into the read bio, it doesn't follow the
      behavior of btrfs_do_readpage() to properly set subpage::readers.
      
      This means, for page [64K, 128K), its subpage::readers is still 0.
      
      And when endio is executed on both pages, since page [64K, 128K) has 0
      subpage::readers, it triggers above ASSERT()
      
      [FIX]
      Function add_ra_bio_pages() is far from subpage compatible, it always
      assume PAGE_SIZE == sectorsize, thus when it skip to next range it
      always just skip PAGE_SIZE.
      
      Make it subpage compatible by:
      
      - Skip to next page properly when needed
        If we find there is already a page cache, we need to skip to next page.
        For that case, we shouldn't just skip PAGE_SIZE bytes, but use
        @pg_index to calculate the next bytenr and continue.
      
      - Only add the page range covered by current extent map
        We need to calculate which range is covered by current extent map and
        only add that part into the read bio.
      
      - Update subpage::readers before submitting the bio
      
      - Use proper cursor other than confusing @last_offset
      
      - Calculate the missed threshold based on sector size
        It's no longer using missed pages, as for 64K page size, we have at
        most 3 pages to skip. (If aligned only 2 pages)
      
      - Add ASSERT() to make sure our bytenr is always aligned
      
      - Add comment for the function
        Add a special note for subpage case, as the function won't really
        work well for subpage cases.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a404910
    • Qu Wenruo's avatar
      btrfs: don't pass compressed pages to btrfs_writepage_endio_finish_ordered() · 58469174
      Qu Wenruo authored
      Since async_extent holds the compressed page, it would trigger the new
      ASSERT() in btrfs_mark_ordered_io_finished() which checks that the range
      is inside the page.
      
      Now btrfs_writepage_endio_finish_ordered() can accept @page == NULL,
      just pass NULL to btrfs_writepage_endio_finish_ordered().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      58469174
    • Qu Wenruo's avatar
      btrfs: use async_chunk::async_cow to replace the confusing pending pointer · 9e895a8f
      Qu Wenruo authored
      For structure async_chunk, we use a very strange member layout to grab
      structure async_cow who owns this async_chunk.
      
      At initialization, it goes like this:
      
      		async_chunk[i].pending = &ctx->num_chunks;
      
      Then at async_cow_free() we do a super weird freeing:
      
      	/*
      	 * Since the pointer to 'pending' is at the beginning of the array of
      	 * async_chunk's, freeing it ensures the whole array has been freed.
      	 */
      	if (atomic_dec_and_test(async_chunk->pending))
      		kvfree(async_chunk->pending);
      
      This is absolutely an abuse of kvfree().
      
      Replace async_chunk::pending with async_chunk::async_cow, so that we can
      grab the async_cow structure directly, without this strange dancing.
      
      And with this change, there is no requirement for any specific member
      location.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e895a8f
    • Qu Wenruo's avatar
      btrfs: remove unnecessary parameter delalloc_start for writepage_delalloc() · cf3075fb
      Qu Wenruo authored
      In function __extent_writepage() we always pass page start to
      @delalloc_start for writepage_delalloc().
      
      Thus we don't really need @delalloc_start parameter as we can extract it
      from @page.
      
      Remove @delalloc_start parameter and make __extent_writepage() to
      declare @page_start and @page_end as const.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cf3075fb
    • Qu Wenruo's avatar
      btrfs: remove unused parameter nr_pages in add_ra_bio_pages() · cd9255be
      Qu Wenruo authored
      Variable @nr_pages only gets increased but never used.  Remove it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd9255be
    • Filipe Manana's avatar
      btrfs: use single bulk copy operations when logging directories · da1b811f
      Filipe Manana authored
      When logging a directory and inserting a batch of directory items, we are
      copying the data of each item from a leaf in the fs/subvolume tree to a
      leaf in a log tree, separately. This is not really needed, since we are
      copying from a contiguous memory area into another one, so we can use a
      single copy operation to copy all items at once.
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 3/3.
      
      The following test was used to compare performance of a branch without the
      patchset versus one branch that has the whole patchset applied:
      
        $ cat dir-fsync-test.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_NEW_FILES=1000000
        NUM_FILE_DELETES=1000
        LEAF_SIZE=16K
      
        mkfs.btrfs -f -n $LEAF_SIZE $DEV
        mount -o ssd $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        # Fsync the directory, this will log the new dir items and the inodes
        # they point to, because these are new inodes.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
      
        # sync to force transaction commit and wipeout the log.
        sync
      
        del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
        for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
            rm -f $MNT/testdir/file_$i
        done
      
        # Fsync the directory, this will only log dir items, there are no
        # dentries pointing to new inodes.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
      
        umount $MNT
      
      The tests were run on a non-debug kernel (Debian's default kernel config)
      and were the following:
      
      *** with a leaf size of 16K, before patchset ***
      
      dir fsync took 8482 ms after adding 1000000 files
      dir fsync took 166 ms after deleting 1000 files
      
      *** with a leaf size of 16K, after patchset ***
      
      dir fsync took 8196 ms after adding 1000000 files  (-3.4%)
      dir fsync took 143 ms after deleting 1000 files    (-14.9%)
      
      *** with a leaf size of 64K, before patchset ***
      
      dir fsync took 12851 ms after adding 1000000 files
      dir fsync took 466 ms after deleting 1000 files
      
      *** with a leaf size of 64K, after  patchset ***
      
      dir fsync took 12287 ms after adding 1000000 files (-4.5%)
      dir fsync took 414 ms after deleting 1000 files    (-11.8%)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      da1b811f
    • Filipe Manana's avatar
      btrfs: unexport setup_items_for_insert() · f0641656
      Filipe Manana authored
      Since setup_items_for_insert() is not used anymore outside of ctree.c,
      make it static and remove its prototype from ctree.h. This also requires
      to move the definition of setup_item_for_insert() from ctree.h to ctree.c
      and move down btrfs_duplicate_item() so that it's defined after
      setup_items_for_insert().
      
      Further, since setup_item_for_insert() is used outside ctree.c, rename it
      to btrfs_setup_item_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 2/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f0641656
    • Filipe Manana's avatar
      btrfs: loop only once over data sizes array when inserting an item batch · b7ef5f3a
      Filipe Manana authored
      When inserting a batch of items into a btree, we end up looping over the
      data sizes array 3 times:
      
      1) Once in the caller of btrfs_insert_empty_items(), when it populates the
         array with the data sizes for each item;
      
      2) Once at btrfs_insert_empty_items() to sum the elements of the data
         sizes array and compute the total data size;
      
      3) And then once again at setup_items_for_insert(), where we do exactly
         the same as what we do at btrfs_insert_empty_items(), to compute the
         total data size.
      
      That is not bad for small arrays, but when the arrays have hundreds of
      elements, the time spent on looping is not negligible. For example when
      doing batch inserts of delayed items for dir index items or when logging
      a directory, it's common to have 200 to 260 dir index items in a single
      batch when using a leaf size of 16K and using file names between 8 and 12
      characters. For a 64K leaf size, multiply that by 4. Taking into account
      that during directory logging or when flushing delayed dir index items we
      can have many of those large batches, the time spent on the looping adds
      up quickly.
      
      It's also more important to avoid it at setup_items_for_insert(), since
      we are holding a write lock on a leaf and, in some cases, on upper nodes
      of the btree, which causes us to block other tasks that want to access
      the leaf and nodes for longer than necessary.
      
      So change the code so that setup_items_for_insert() and
      btrfs_insert_empty_items() no longer compute the total data size, and
      instead rely on the caller to supply it. This makes us loop over the
      array only once, where we can both populate the data size array and
      compute the total data size, taking advantage of spatial and temporal
      locality. To make this more manageable, use a structure to contain
      all the relevant details for a batch of items (keys array, data sizes
      array, total data size, number of items), and use it as an argument
      for btrfs_insert_empty_items() and setup_items_for_insert().
      
      This patch is part of a small patchset that is comprised of the following
      patches:
      
        btrfs: loop only once over data sizes array when inserting an item batch
        btrfs: unexport setup_items_for_insert()
        btrfs: use single bulk copy operations when logging directories
      
      This is patch 1/3 and performance results, and the specific tests, are
      included in the changelog of patch 3/3.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b7ef5f3a
    • Qu Wenruo's avatar
      btrfs: remove btrfs_raid_bio::fs_info member · 6a258d72
      Qu Wenruo authored
      We can grab fs_info reliably from btrfs_raid_bio::bioc, as the bioc is
      always passed into alloc_rbio(), and only get released when the raid bio
      is released.
      
      Remove btrfs_raid_bio::fs_info member, and cleanup all the @fs_info
      parameters for alloc_rbio() callers.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a258d72
    • Qu Wenruo's avatar
      btrfs: make sure btrfs_io_context::fs_info is always initialized · 731ccf15
      Qu Wenruo authored
      Currently btrfs_io_context::fs_info is only initialized in
      btrfs_map_bio, but there are call sites like btrfs_map_sblock() which
      calls __btrfs_map_block() directly, leaving bioc::fs_info uninitialized
      (NULL).
      
      Currently this is fine, but later cleanup will rely on bioc::fs_info to
      grab fs_info, and this can be a hidden problem for such usage.
      
      This patch will remove such hidden uninitialized member by always
      assigning bioc::fs_info at alloc_btrfs_io_context().
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      731ccf15
    • Filipe Manana's avatar
      btrfs: assert that extent buffers are write locked instead of only locked · 49d0c642
      Filipe Manana authored
      We currently use lockdep_assert_held() at btrfs_assert_tree_locked(), and
      that checks that we hold a lock either in read mode or write mode.
      
      However in all contexts we use btrfs_assert_tree_locked(), we actually
      want to check if we are holding a write lock on the extent buffer's rw
      semaphore - it would be a bug if in any of those contexts we were holding
      a read lock instead.
      
      So change btrfs_assert_tree_locked() to use lockdep_assert_held_write()
      instead and, to make it more explicit, rename btrfs_assert_tree_locked()
      to btrfs_assert_tree_write_locked(), so that it's clear we want to check
      we are holding a write lock.
      
      For now there are no contexts where we want to assert that we must have
      a read lock, but in case that is needed in the future, we can add a new
      helper function that just calls out lockdep_assert_held_read().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49d0c642
    • Josef Bacik's avatar
      btrfs: do not take the uuid_mutex in btrfs_rm_device · 8ef9dc0f
      Josef Bacik authored
      We got the following lockdep splat while running fstests (specifically
      btrfs/003 and btrfs/020 in a row) with the new rc.  This was uncovered
      by 87579e9b ("loop: use worker per cgroup instead of kworker") which
      converted loop to using workqueues, which comes with lockdep
      annotations that don't exist with kworkers.  The lockdep splat is as
      follows:
      
        WARNING: possible circular locking dependency detected
        5.14.0-rc2-custom+ #34 Not tainted
        ------------------------------------------------------
        losetup/156417 is trying to acquire lock:
        ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600
      
        but task is already holding lock:
        ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #5 (&lo->lo_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0xba/0x7c0
      	 lo_open+0x28/0x60 [loop]
      	 blkdev_get_whole+0x28/0xf0
      	 blkdev_get_by_dev.part.0+0x168/0x3c0
      	 blkdev_open+0xd2/0xe0
      	 do_dentry_open+0x163/0x3a0
      	 path_openat+0x74d/0xa40
      	 do_filp_open+0x9c/0x140
      	 do_sys_openat2+0xb1/0x170
      	 __x64_sys_openat+0x54/0x90
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        -> #4 (&disk->open_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0xba/0x7c0
      	 blkdev_get_by_dev.part.0+0xd1/0x3c0
      	 blkdev_get_by_path+0xc0/0xd0
      	 btrfs_scan_one_device+0x52/0x1f0 [btrfs]
      	 btrfs_control_ioctl+0xac/0x170 [btrfs]
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        -> #3 (uuid_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0xba/0x7c0
      	 btrfs_rm_device+0x48/0x6a0 [btrfs]
      	 btrfs_ioctl+0x2d1c/0x3110 [btrfs]
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        -> #2 (sb_writers#11){.+.+}-{0:0}:
      	 lo_write_bvec+0x112/0x290 [loop]
      	 loop_process_work+0x25f/0xcb0 [loop]
      	 process_one_work+0x28f/0x5d0
      	 worker_thread+0x55/0x3c0
      	 kthread+0x140/0x170
      	 ret_from_fork+0x22/0x30
      
        -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
      	 process_one_work+0x266/0x5d0
      	 worker_thread+0x55/0x3c0
      	 kthread+0x140/0x170
      	 ret_from_fork+0x22/0x30
      
        -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
      	 __lock_acquire+0x1130/0x1dc0
      	 lock_acquire+0xf5/0x320
      	 flush_workqueue+0xae/0x600
      	 drain_workqueue+0xa0/0x110
      	 destroy_workqueue+0x36/0x250
      	 __loop_clr_fd+0x9a/0x650 [loop]
      	 lo_ioctl+0x29d/0x780 [loop]
      	 block_ioctl+0x3f/0x50
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x3b/0x90
      	 entry_SYSCALL_64_after_hwframe+0x44/0xae
      
        other info that might help us debug this:
        Chain exists of:
          (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
         Possible unsafe locking scenario:
      	 CPU0                    CPU1
      	 ----                    ----
          lock(&lo->lo_mutex);
      				 lock(&disk->open_mutex);
      				 lock(&lo->lo_mutex);
          lock((wq_completion)loop0);
      
         *** DEADLOCK ***
        1 lock held by losetup/156417:
         #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]
      
        stack backtrace:
        CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack_lvl+0x57/0x72
         check_noncircular+0x10a/0x120
         __lock_acquire+0x1130/0x1dc0
         lock_acquire+0xf5/0x320
         ? flush_workqueue+0x84/0x600
         flush_workqueue+0xae/0x600
         ? flush_workqueue+0x84/0x600
         drain_workqueue+0xa0/0x110
         destroy_workqueue+0x36/0x250
         __loop_clr_fd+0x9a/0x650 [loop]
         lo_ioctl+0x29d/0x780 [loop]
         ? __lock_acquire+0x3a0/0x1dc0
         ? update_dl_rq_load_avg+0x152/0x360
         ? lock_is_held_type+0xa5/0x120
         ? find_held_lock.constprop.0+0x2b/0x80
         block_ioctl+0x3f/0x50
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0x90
         entry_SYSCALL_64_after_hwframe+0x44/0xae
        RIP: 0033:0x7f645884de6b
      
      Usually the uuid_mutex exists to protect the fs_devices that map
      together all of the devices that match a specific uuid.  In rm_device
      we're messing with the uuid of a device, so it makes sense to protect
      that here.
      
      However in doing that it pulls in a whole host of lockdep dependencies,
      as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus
      we end up with the dependency chain under the uuid_mutex being added
      under the normal sb write dependency chain, which causes problems with
      loop devices.
      
      We don't need the uuid mutex here however.  If we call
      btrfs_scan_one_device() before we scratch the super block we will find
      the fs_devices and not find the device itself and return EBUSY because
      the fs_devices is open.  If we call it after the scratch happens it will
      not appear to be a valid btrfs file system.
      
      We do not need to worry about other fs_devices modifying operations here
      because we're protected by the exclusive operations locking.
      
      So drop the uuid_mutex here in order to fix the lockdep splat.
      
      A more detailed explanation from the discussion:
      
      We are worried about rm and scan racing with each other, before this
      change we'll zero the device out under the UUID mutex so when scan does
      run it'll make sure that it can go through the whole device scan thing
      without rm messing with us.
      
      We aren't worried if the scratch happens first, because the result is we
      don't think this is a btrfs device and we bail out.
      
      The only case we are concerned with is we scratch _after_ scan is able
      to read the superblock and gets a seemingly valid super block, so lets
      consider this case.
      
      Scan will call device_list_add() with the device we're removing.  We'll
      call find_fsid_with_metadata_uuid() and get our fs_devices for this
      UUID.  At this point we lock the fs_devices->device_list_mutex.  This is
      what protects us in this case, but we have two cases here.
      
      1. We aren't to the device removal part of the RM.  We found our device,
         and device name matches our path, we go down and we set total_devices
         to our super number of devices, which doesn't affect anything because
         we haven't done the remove yet.
      
      2. We are past the device removal part, which is protected by the
         device_list_mutex.  Scan doesn't find the device, it goes down and
         does the
      
         if (fs_devices->opened)
      	   return -EBUSY;
      
         check and we bail out.
      
      Nothing about this situation is ideal, but the lockdep splat is real,
      and the fix is safe, tho admittedly a bit scary looking.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ copy more from the discussion ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ef9dc0f
    • Qu Wenruo's avatar
      btrfs: rename struct btrfs_io_bio to btrfs_bio · c3a3b19b
      Qu Wenruo authored
      Previously we had "struct btrfs_bio", which records IO context for
      mirrored IO and RAID56, and "strcut btrfs_io_bio", which records extra
      btrfs specific info for logical bytenr bio.
      
      With "btrfs_bio" renamed to "btrfs_io_context", we are safe to rename
      "btrfs_io_bio" to "btrfs_bio" which is a more suitable name now.
      
      The struct btrfs_bio changes meaning by this commit. There was a
      suggested name like btrfs_logical_bio but it's a bit long and we'd
      prefer to use a shorter name.
      
      This could be a concern for backports to older kernels where the
      different meaning could possibly cause confusion or bugs. Comparing the
      new and old structures, there's no overlap among the struct members so a
      build would break in case of incorrect backport.
      
      We haven't had many backports to bio code anyway so this is more of a
      theoretical cause of bugs and a matter of precaution but we'll need to
      keep the semantic change in mind.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c3a3b19b
    • Qu Wenruo's avatar
      btrfs: remove btrfs_bio_alloc() helper · cd8e0cca
      Qu Wenruo authored
      The helper btrfs_bio_alloc() is almost the same as btrfs_io_bio_alloc(),
      except it's allocating using BIO_MAX_VECS as @nr_iovecs, and initializes
      bio->bi_iter.bi_sector.
      
      However the naming itself is not using "btrfs_io_bio" to indicate its
      parameter is "strcut btrfs_io_bio" and can be easily confused with
      "struct btrfs_bio".
      
      Considering assigned bio->bi_iter.bi_sector is such a simple work and
      there are already tons of call sites doing that manually, there is no
      need to do that in a helper.
      
      Remove btrfs_bio_alloc() helper, and enhance btrfs_io_bio_alloc()
      function to provide a fail-safe value for its @nr_iovecs.
      
      And then replace all btrfs_bio_alloc() callers with
      btrfs_io_bio_alloc().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd8e0cca
    • Qu Wenruo's avatar
      btrfs: rename btrfs_bio to btrfs_io_context · 4c664611
      Qu Wenruo authored
      The structure btrfs_bio is used by two different sites:
      
      - bio->bi_private for mirror based profiles
        For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records
        how many mirrors are still pending, and save the original endio
        function of the bio.
      
      - RAID56 code
        In that case, RAID56 only utilize the stripes info, and no long uses
        that to trace the pending mirrors.
      
      So btrfs_bio is not always bind to a bio, and contains more info for IO
      context, thus renaming it will make the naming less confusing.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4c664611
    • Filipe Manana's avatar
      btrfs: keep track of the last logged keys when logging a directory · dc287224
      Filipe Manana authored
      After the first time we log a directory in the current transaction, for
      each directory item in a changed leaf of the subvolume tree, we have to
      check if we previously logged the item, in order to overwrite it in case
      its data changed or skip it in case its data hasn't changed.
      
      Checking if we have logged each item before not only wastes times, but it
      also adds lock contention on the log tree. So in order to minimize the
      number of times we do such checks, keep track of the offset of the last
      key we logged for a directory and, on the next time we log the directory,
      skip the checks for any new keys that have an offset greater than the
      offset we have previously saved. This is specially effective for index
      keys, because the offset for these keys comes from a monotonically
      increasing counter.
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 5/5.
      
      The following test was used on a non-debug kernel to measure the impact
      it has on a directory fsync:
      
        $ cat test-dir-fsync.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
      
        NUM_NEW_FILES=100000
        NUM_FILE_DELETES=1000
      
        mkfs.btrfs -f $DEV
        mount -o ssd $DEV $MNT
      
        mkdir $MNT/testdir
      
        for ((i = 1; i <= $NUM_NEW_FILES; i++)); do
            echo -n > $MNT/testdir/file_$i
        done
      
        # fsync the directory, this will log the new dir items and the inodes
        # they point to, because these are new inodes.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files"
      
        # sync to force transaction commit and wipeout the log.
        sync
      
        del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES ))
        for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do
            rm -f $MNT/testdir/file_$i
        done
      
        # fsync the directory, this will only log dir items, there are no
        # dentries pointing to new inodes.
        start=$(date +%s%N)
        xfs_io -c "fsync" $MNT/testdir
        end=$(date +%s%N)
      
        dur=$(( (end - start) / 1000000 ))
        echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
      
        umount $MNT
      
      Test results with NUM_NEW_FILES set to 100 000 and 1 000 000:
      
      **** before patchset, 100 000 files, 1000 deletes ****
      
      dir fsync took 848 ms after adding 100000 files
      dir fsync took 175 ms after deleting 1000 files
      
      **** after patchset, 100 000 files, 1000 deletes ****
      
      dir fsync took 758 ms after adding 100000 files  (-11.2%)
      dir fsync took 63 ms after deleting 1000 files   (-94.1%)
      
      **** before patchset, 1 000 000 files, 1000 deletes ****
      
      dir fsync took 9945 ms after adding 1000000 files
      dir fsync took 473 ms after deleting 1000 files
      
      **** after patchset, 1 000 000 files, 1000 deletes ****
      
      dir fsync took 8677 ms after adding 1000000 files (-13.6%)
      dir fsync took 146 ms after deleting 1000 files   (-105.6%)
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc287224
    • Filipe Manana's avatar
      btrfs: insert items in batches when logging a directory when possible · 086dcbfa
      Filipe Manana authored
      When logging a directory, we scan its directory items from the subvolume
      tree and then copy one by one into the log tree. This is not efficient
      since we generally are able to insert several items in a batch, using a
      single btree operation for adding several items at once. The reason we
      copy items one by one is that we must check if each item was previously
      logged in the current transaction, and if it was we either overwrite it
      or skip it in case its content did not change in the subvolume tree (this
      can happen only for dir item keys, but not for dir index keys), and doing
      such check makes it a bit cumbersome to attempt batch insertions.
      
      However the chances for doing batch insertions are very frequent and
      always happen when:
      
      1) Logging the directory for the first time in the current transaction,
         as none of the items exist in the log tree yet;
      
      2) Logging new dir index keys, because the offset for new dir index keys
         comes from a monotonically increasing counter. This means if we keep
         adding dentries to a directory, through creation of new files and
         sub-directories or by adding new links or renaming from some other
         directory into the one we are logging, all the new dir index keys
         have a new offset that is greater than the offset of any previously
         logged index keys, so we can insert them in batches into the log tree.
      
      For dir item keys, since their offset depends on the result of an hash
      function against the dentry's name, unless the directory is being logged
      for the first time in the current transaction, the chances being able to
      insert the items in the log using batches is pretty much random and not
      predictable, as it depends on the names of the dentries, but still happens
      often enough.
      
      So change directory logging to keep track of consecutive directory items
      that don't exist yet in the log and batch insert them.
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 4/5. The change log of the last patch (5/5) has performance
      results.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      086dcbfa
    • Filipe Manana's avatar
      btrfs: factor out the copying loop of dir items from log_dir_items() · eb10d85e
      Filipe Manana authored
      In preparation for the next change, move the loop that processes a leaf
      and copies its directory items to the log, into a separate helper
      function. This makes the next change simpler and it also helps making
      log_dir_items() a bit shorter (specially after the next change).
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 3/5. The change log of the last patch (5/5) has performance
      results.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb10d85e
    • Filipe Manana's avatar
      btrfs: remove redundant log root assignment from log_dir_items() · d46fb845
      Filipe Manana authored
      At log_dir_items() we are assigning the exact same value to the local
      variable 'log', once when it's declared and once again shortly after.
      Remove the later assignment as it's pointless.
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 2/5. The change log of the last patch (5/5) has performance
      results.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d46fb845
    • Filipe Manana's avatar
      btrfs: remove root argument from btrfs_log_inode() and its callees · 90d04510
      Filipe Manana authored
      The root argument passed to btrfs_log_inode() is unncessary, as it is
      always the root of the inode we are going to log. This root also gets
      unnecessarily propagated to several functions called by btrfs_log_inode(),
      and all of them take the inode as an argument as well. So just remove
      the root argument from these functions and have them get the root from
      the inode where needed.
      
      This patch is part of a patchset comprised of the following 5 patches:
      
        btrfs: remove root argument from btrfs_log_inode() and its callees
        btrfs: remove redundant log root assignment from log_dir_items()
        btrfs: factor out the copying loop of dir items from log_dir_items()
        btrfs: insert items in batches when logging a directory when possible
        btrfs: keep track of the last logged keys when logging a directory
      
      This is patch 1/5. The change log of the last patch (5/5) has performance
      results.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      90d04510
    • Johannes Thumshirn's avatar
      btrfs: zoned: let the for_treelog test in the allocator stand out · 2d81eb1c
      Johannes Thumshirn authored
      The statement which decides if an extent allocation on a zoned device is
      for the dedicated tree-log block group or not and if we can use the block
      group we picked for this allocation is not easy to read but an important
      part of the allocator.
      
      Rewrite into an if condition instead of a plain boolean test to make it
      stand out more, like the version which tests for the dedicated
      data-relocation block group.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2d81eb1c
    • Johannes Thumshirn's avatar
      btrfs: rename setup_extent_mapping in relocation code · 4b01c44f
      Johannes Thumshirn authored
      In btrfs code we have two functions called setup_extent_mapping, one in
      the extent_map code and one in the relocation code. While both are
      private to their respective implementation, this can still be confusing
      for the reader.
      
      So rename the version in relocation.c to setup_relocation_extent_mapping.
      No functional changes.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4b01c44f
    • Johannes Thumshirn's avatar
      btrfs: zoned: allow preallocation for relocation inodes · 960a3166
      Johannes Thumshirn authored
      Now that we use a dedicated block group and regular writes for data
      relocation, we can preallocate the space needed for a relocated inode,
      just like we do in regular mode.
      
      Essentially this reverts commit 32430c61 ("btrfs: zoned: enable
      relocation on a zoned filesystem") as it is not needed anymore.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      960a3166
    • Johannes Thumshirn's avatar
      btrfs: check for relocation inodes on zoned btrfs in should_nocow · 2adada88
      Johannes Thumshirn authored
      Prepare for allowing preallocation for relocation inodes.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2adada88
    • Johannes Thumshirn's avatar
      btrfs: zoned: use regular writes for relocation · e6d261e3
      Johannes Thumshirn authored
      Now that we have a dedicated block group for relocation, we can use
      REQ_OP_WRITE instead of  REQ_OP_ZONE_APPEND for writing out the data on
      relocation.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e6d261e3
    • Johannes Thumshirn's avatar
      btrfs: zoned: only allow one process to add pages to a relocation inode · 35156d85
      Johannes Thumshirn authored
      Don't allow more than one process to add pages to a relocation inode on
      a zoned filesystem, otherwise we cannot guarantee the sequential write
      rule once we're filling preallocated extents on a zoned filesystem.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      35156d85
    • Johannes Thumshirn's avatar
      btrfs: zoned: add a dedicated data relocation block group · c2707a25
      Johannes Thumshirn authored
      Relocation in a zoned filesystem can fail with a transaction abort with
      error -22 (EINVAL). This happens because the relocation code assumes that
      the extents we relocated the data to have the same size the source extents
      had and ensures this by preallocating the extents.
      
      But in a zoned filesystem we currently can't preallocate the extents as
      this would break the sequential write required rule. Therefore it can
      happen that the writeback process kicks in while we're still adding pages
      to a delalloc range and starts writing out dirty pages.
      
      This then creates destination extents that are smaller than the source
      extents, triggering the following safety check in get_new_location():
      
       1034         if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) {
       1035                 ret = -EINVAL;
       1036                 goto out;
       1037         }
      
      Temporarily create a dedicated block group for the relocation process, so
      no non-relocation data writes can interfere with the relocation writes.
      
      This is needed that we can switch the relocation process on a zoned
      filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme
      like in a non-zoned filesystem using REQ_OP_WRITE and preallocation.
      
      Fixes: 32430c61 ("btrfs: zoned: enable relocation on a zoned filesystem")
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c2707a25
    • Johannes Thumshirn's avatar
      btrfs: introduce btrfs_is_data_reloc_root · 37f00a6d
      Johannes Thumshirn authored
      There are several places in our codebase where we check if a root is the
      root of the data reloc tree and subsequent patches will introduce more.
      
      Factor out the check into a small helper function instead of open coding
      it multiple times.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      37f00a6d
    • Qu Wenruo's avatar
      btrfs: unexport repair_io_failure() · 38d5e541
      Qu Wenruo authored
      Function repair_io_failure() is no longer used out of extent_io.c since
      commit 8b9b6f25 ("btrfs: scrub: cleanup the remaining nodatasum
      fixup code"), which removes the last external caller.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      38d5e541
    • Filipe Manana's avatar
      btrfs: do not commit delayed inode when logging a file in full sync mode · f6df27dd
      Filipe Manana authored
      When logging a regular file in full sync mode, we are currently committing
      its delayed inode item. This is to ensure that we never miss copying the
      inode item, with its most up to date data, into the log tree.
      
      However that is not necessary since commit e4545de5 ("Btrfs: fix fsync
      data loss after append write"), because even if we don't find the leaf
      with the inode item when looking for leaves that changed in the current
      transaction, we end up logging the inode item later using the in-memory
      content. In case we find the leaf containing the inode item, we already
      end up using the in-memory inode for filling the inode item in the log
      tree, and not the inode item that is in the fs/subvolume tree, as it
      might be not up to date (copy_items() -> fill_inode_item()).
      
      So don't commit the delayed inode item, which brings a couple of benefits:
      
      1) Avoid writing the inode item to the fs/subvolume btree, saving time and
         reducing lock contention on the btree;
      
      2) In case no other item for the inode was changed, added or deleted in
         the same leaf where the inode item is located, we ended up copying
         all the items in that leaf to the log tree - it's harmless from a
         functional point of view, but it wastes time and log tree space.
      
      This patch is part of a patch set comprised of the following patches:
      
        btrfs: check if a log tree exists at inode_logged()
        btrfs: remove no longer needed checks for NULL log context
        btrfs: do not log new dentries when logging that a new name exists
        btrfs: always update the logged transaction when logging new names
        btrfs: avoid expensive search when dropping inode items from log
        btrfs: add helper to truncate inode items when logging inode
        btrfs: avoid expensive search when truncating inode items from the log
        btrfs: avoid search for logged i_size when logging inode if possible
        btrfs: avoid attempt to drop extents when logging inode for the first time
        btrfs: do not commit delayed inode when logging a file in full sync mode
      
      This is patch 10/10 and the following test results compare a branch with
      the whole patch set applied versus a branch without any of the patches
      applied.
      
      The following script was used to test dbench with 8 and 16 jobs on a
      machine with 12 cores, 64G of RAM, a NVME device and using a non-debug
      kernel config (Debian's default):
      
        $ cat test.sh
        #!/bin/bash
      
        if [ $# -ne 1 ]; then
            echo "Use $0 NUM_JOBS"
            exit 1
        fi
      
        NUM_JOBS=$1
      
        DEV=/dev/nvme0n1
        MNT=/mnt/nvme0n1
        MOUNT_OPTIONS="-o ssd"
        MKFS_OPTIONS="-m single -d single"
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $MKFS_OPTIONS $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 120 $NUM_JOBS
      
        umount $MNT
      
      The results were the following:
      
      8 jobs, before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4113896     0.009   238.665
       Close        3021699     0.001     0.590
       Rename        174215     0.082   238.733
       Unlink        830977     0.049   238.642
       Deltree           96     2.232     8.022
       Mkdir             48     0.003     0.005
       Qpathinfo    3729013     0.005     2.672
       Qfileinfo     653206     0.001     0.152
       Qfsinfo       683866     0.002     0.526
       Sfileinfo     335055     0.004     1.571
       Find         1441800     0.016     4.288
       WriteX       2049644     0.010     3.982
       ReadX        6449786     0.003     0.969
       LockX          13400     0.002     0.043
       UnlockX        13400     0.001     0.075
       Flush         288349     2.521   245.516
      
      Throughput 1075.73 MB/sec  8 clients  8 procs  max_latency=245.520 ms
      
      8 jobs, after patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    4154282     0.009   156.675
       Close        3051450     0.001     0.843
       Rename        175912     0.072     4.444
       Unlink        839067     0.048    66.050
       Deltree           96     2.131     5.979
       Mkdir             48     0.002     0.004
       Qpathinfo    3765575     0.005     3.079
       Qfileinfo     659582     0.001     0.099
       Qfsinfo       690474     0.002     0.155
       Sfileinfo     338366     0.004     1.419
       Find         1455816     0.016     3.423
       WriteX       2069538     0.010     4.328
       ReadX        6512429     0.003     0.840
       LockX          13530     0.002     0.078
       UnlockX        13530     0.001     0.051
       Flush         291158     2.500   163.468
      
      Throughput 1105.45 MB/sec  8 clients  8 procs  max_latency=163.474 ms
      
      +2.7% throughput, -40.1% max latency
      
      16 jobs, before patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5457602     0.033   337.098
       Close        4008979     0.002     2.018
       Rename        231051     0.323   254.054
       Unlink       1102209     0.202   337.243
       Deltree          160     6.521    31.720
       Mkdir             80     0.003     0.007
       Qpathinfo    4946147     0.014     6.988
       Qfileinfo     867440     0.001     1.642
       Qfsinfo       907081     0.003     1.821
       Sfileinfo     444433     0.005     2.053
       Find         1912506     0.067     7.854
       WriteX       2724852     0.018     7.428
       ReadX        8553883     0.003     2.059
       LockX          17770     0.003     0.350
       UnlockX        17770     0.002     0.627
       Flush         382533     2.810   353.691
      
      Throughput 1413.09 MB/sec  16 clients  16 procs  max_latency=353.696 ms
      
      16 jobs, after patchset:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    5393156     0.034   303.181
       Close        3961986     0.002     1.502
       Rename        228359     0.320   253.379
       Unlink       1088920     0.206   303.409
       Deltree          160     6.419    30.088
       Mkdir             80     0.003     0.004
       Qpathinfo    48879676     0.015     7.722
       Qfileinfo     857408     0.001     1.651
       Qfsinfo       896343     0.002     2.147
       Sfileinfo     439317     0.005     4.298
       Find         1890018     0.073     8.347
       WriteX       2693356     0.018     6.373
       ReadX        8453485     0.003     3.836
       LockX          17562     0.003     0.486
       UnlockX        17562     0.002     0.635
       Flush         378023     2.802   315.904
      
      Throughput 1454.46 MB/sec  16 clients  16 procs  max_latency=315.910 ms
      
      +2.9% throughput, -11.3% max latency
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f6df27dd