1. 17 Apr, 2023 40 commits
    • Filipe Manana's avatar
      btrfs: avoid iterating over all indexes when logging directory · fa4b8cb1
      Filipe Manana authored
      When logging a directory, after copying all directory index items from the
      subvolume tree to the log tree, we iterate over the subvolume tree to find
      all dir index items that are located in leaves COWed (or created) in the
      current transaction. If we keep logging a directory several times during
      the same transaction, we end up iterating over the same dir index items
      everytime we log the directory, wasting time and adding extra lock
      contention on the subvolume tree.
      
      So just keep track of the last logged dir index offset in order to start
      the search for that index (+1) the next time the directory is logged, as
      dir index values (key offsets) come from a monotonically increasing
      counter.
      
      The following test measures the difference before and after this change:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nullb0
        MNT=/mnt/nullb0
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $DEV
        mount -o ssd $DEV $MNT
      
        # Time values in milliseconds.
        declare -a fsync_times
        # Total number of files added to the test directory.
        num_files=1000000
        # Fsync directory after every N files are added.
        fsync_period=100
      
        mkdir $MNT/testdir
      
        fsync_total_time=0
        for ((i = 1; i <= $num_files; i++)); do
              echo -n > $MNT/testdir/file_$i
      
              if [ $((i % fsync_period)) -eq 0 ]; then
                      start=$(date +%s%N)
                      xfs_io -c "fsync" $MNT/testdir
                      end=$(date +%s%N)
                      fsync_total_time=$((fsync_total_time + (end - start)))
                      fsync_times[i]=$(( (end - start) / 1000000 ))
                      echo -n -e "Progress $i / $num_files\r"
              fi
        done
      
        echo -e "\nHistogram of directory fsync duration in ms:\n"
      
        printf '%s\n' "${fsync_times[@]}" | \
           perl -MStatistics::Histogram -e '@d = <>; print get_histogram(\@d);'
      
        fsync_total_time=$((fsync_total_time / 1000000))
        echo -e "\nTotal time spent in fsync: $fsync_total_time ms\n"
        echo
      
        umount $MNT
      
      The test was run on a non-debug kernel (Debian's default kernel config)
      against a 15G null block device.
      
      Result before this change:
      
         Histogram of directory fsync duration in ms:
      
         Count: 10000
         Range:  3.000 - 362.000; Mean: 34.556; Median: 31.000; Stddev: 25.751
         Percentiles:  90th: 71.000; 95th: 77.000; 99th: 81.000
            3.000 -    5.278:  1423 #################################
            5.278 -    8.854:  1173 ###########################
            8.854 -   14.467:   591 ##############
           14.467 -   23.277:  1025 #######################
           23.277 -   37.105:  1422 #################################
           37.105 -   58.809:  2036 ###############################################
           58.809 -   92.876:  2316 #####################################################
           92.876 -  146.346:     6 |
          146.346 -  230.271:     6 |
          230.271 -  362.000:     2 |
      
         Total time spent in fsync: 350527 ms
      
      Result after this change:
      
         Histogram of directory fsync duration in ms:
      
         Count: 10000
         Range:  3.000 - 1088.000; Mean:  8.704; Median:  8.000; Stddev: 12.576
         Percentiles:  90th: 12.000; 95th: 14.000; 99th: 17.000
            3.000 -    6.007:  3222 #################################
            6.007 -   11.276:  5197 #####################################################
           11.276 -   20.506:  1551 ################
           20.506 -   36.674:    24 |
           36.674 -  201.552:     1 |
          201.552 -  353.841:     4 |
          353.841 - 1088.000:     1 |
      
         Total time spent in fsync: 92114 ms
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa4b8cb1
    • Qu Wenruo's avatar
      btrfs: dev-replace: error out if we have unrepaired metadata error during · 8eb3dd17
      Qu Wenruo authored
      [BUG]
      Even before the scrub rework, if we have some corrupted metadata failed
      to be repaired during replace, we still continue replacing and let it
      finish just as there is nothing wrong:
      
       BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
       BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
       BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
       BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
       BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
      
      This can lead to unexpected problems for the resulting filesystem.
      
      [CAUSE]
      Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
      But unlike scrub, dev-replace doesn't really bother to check the scrub
      progress, which records all the errors found during replace.
      
      And even if we check the progress, we cannot really determine which
      errors are minor, which are critical just by the plain numbers.
      (remember we don't treat metadata/data checksum error differently).
      
      This behavior is there from the very beginning.
      
      [FIX]
      Instead of continuing the replace, just error out if we hit an
      unrepaired metadata sector.
      
      Now the dev-replace would be rejected with -EIO, to let the user know.
      Although it also means, the filesystem has some metadata error which
      cannot be repaired, the user would be upset anyway.
      
      The new dmesg would look like this:
      
       BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
       BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
       BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
       BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
       BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8eb3dd17
    • Filipe Manana's avatar
      btrfs: remove pointless loop at btrfs_get_next_valid_item() · 524f14bb
      Filipe Manana authored
      It's pointless to have a while loop at btrfs_get_next_valid_item(), as if
      the slot on the current leaf is beyond the last item, we call
      btrfs_next_leaf(), which leaves us at a valid slot of the next leaf (or
      a valid slot in the current leaf if after releasing the path an item gets
      pushed from the next leaf to the current leaf).
      
      So just call btrfs_next_leaf() if the current slot on the current leaf is
      beyond the last item.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      524f14bb
    • Qu Wenruo's avatar
      btrfs: scrub: reject unsupported scrub flags · 604e6681
      Qu Wenruo authored
      Since the introduction of scrub interface, the only flag that we support
      is BTRFS_SCRUB_READONLY.  Thus there is no sanity checks, if there are
      some undefined flags passed in, we just ignore them.
      
      This is problematic if we want to introduce new scrub flags, as we have
      no way to determine if such flags are supported.
      
      Address the problem by introducing a check for the flags, and if
      unsupported flags are set, return -EOPNOTSUPP to inform the user space.
      
      This check should be backported for all supported kernels before any new
      scrub flags are introduced.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      604e6681
    • Boris Burkov's avatar
      btrfs: reinterpret async discard iops_limit=0 as no delay · f263a7c3
      Boris Burkov authored
      Currently, a limit of 0 results in a hard coded metering over 6 hours.
      Since the default is a set limit, I suspect no one truly depends on this
      rather arbitrary setting. Repurpose it for an arguably more useful
      "unlimited" mode, where the delay is 0.
      
      Note that if block groups are too new, or go fully empty, there is still
      a delay associated with those conditions. Those delays implement
      heuristics for not trimming a region we are relatively likely to fully
      overwrite soon.
      
      CC: stable@vger.kernel.org # 6.2+
      Reviewed-by: default avatarNeal Gompa <neal@gompa.dev>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f263a7c3
    • Boris Burkov's avatar
      btrfs: set default discard iops_limit to 1000 · cfe3445a
      Boris Burkov authored
      Previously, the default was a relatively conservative 10. This results
      in a 100ms delay, so with ~300 discards in a commit, it takes the full
      30s till the next commit to finish the discards. On a workstation, this
      results in the disk never going idle, wasting power/battery, etc.
      
      Set the default to 1000, which results in using the smallest possible
      delay, currently, which is 1ms. This has shown to not pathologically
      keep the disk busy by the original reporter.
      
      Link: https://lore.kernel.org/linux-btrfs/Y%2F+n1wS%2F4XAH7X1p@nz/
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182228
      CC: stable@vger.kernel.org # 6.2+
      Reviewed-by: Neal Gompa <neal@gompa.dev
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cfe3445a
    • Qu Wenruo's avatar
      btrfs: remove unused raid56 functions which were dedicated for scrub · aca43fe8
      Qu Wenruo authored
      Since the scrub rework, the following RAID56 functions are no longer
      called:
      
      - raid56_add_scrub_pages()
      - raid56_alloc_missing_rbio()
      - raid56_submit_missing_rbio()
      
      Those functions are all utilized by scrub to handle missing device cases
      for RAID56.
      
      However the new scrub code handle them in a completely different way:
      
      - If it's data stripe, go recovery path through btrfs_submit_bio()
      - If it's P/Q stripe, it would be handled through
        raid56_parity_submit_scrub_rbio()
        And that function would handle dev-replace and repair properly.
      
      Thus we can safely remove those functions.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aca43fe8
    • Qu Wenruo's avatar
      btrfs: scrub: remove scrub_bio structure · 13a62fd9
      Qu Wenruo authored
      Since scrub path has been fully moved to scrub_stripe based facilities,
      no more scrub_bio would be submitted.
      Thus we can remove it completely, this involves:
      
      - SCRUB_SECTORS_PER_BIO macro
      - SCRUB_BIOS_PER_SCTX macro
      - SCRUB_MAX_PAGES macro
      - BTRFS_MAX_MIRRORS macro
      - scrub_bio structure
      - scrub_ctx::bios member
      - scrub_ctx::curr member
      - scrub_ctx::bios_in_flight member
      - scrub_ctx::workers_pending member
      - scrub_ctx::list_lock member
      - scrub_ctx::list_wait member
      
      - function scrub_bio_end_io_worker()
      - function scrub_pending_bio_inc()
      - function scrub_pending_bio_dec()
      - function scrub_throttle()
      - function scrub_submit()
      
      - function scrub_find_csum()
      - function drop_csum_range()
      
      - Some unnecessary flush and scrub pauses
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13a62fd9
    • Qu Wenruo's avatar
      btrfs: scrub: remove scrub_block and scrub_sector structures · 001e3fc2
      Qu Wenruo authored
      Those two structures are used to represent a bunch of sectors for scrub,
      but now they are fully replaced by scrub_stripe in one go, so we can
      remove them. This involves:
      
      - structure scrub_block
      - structure scrub_sector
      
      - structure scrub_page_private
      - function attach_scrub_page_private()
      - function detach_scrub_page_private()
        Now we no longer need to use page::private to handle subpage.
      
      - function alloc_scrub_block()
      - function alloc_scrub_sector()
      - function scrub_sector_get_page()
      - function scrub_sector_get_page_offset()
      - function scrub_sector_get_kaddr()
      - function bio_add_scrub_sector()
      
      - function scrub_checksum_data()
      - function scrub_checksum_tree_block()
      - function scrub_checksum_super()
      - function scrub_check_fsid()
      - function scrub_block_get()
      - function scrub_block_put()
      - function scrub_sector_get()
      - function scrub_sector_put()
      - function scrub_bio_end_io()
      - function scrub_block_complete()
      - function scrub_add_sector_to_rd_bio()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      001e3fc2
    • Qu Wenruo's avatar
      btrfs: scrub: remove the old scrub recheck code · e9255d6c
      Qu Wenruo authored
      The old scrub code has different entrance to verify the content, and
      since we have removed the writeback path, now we can start removing the
      re-check part, including:
      
      - scrub_recover structure
      - scrub_sector::recover member
      - function scrub_setup_recheck_block()
      - function scrub_recheck_block()
      - function scrub_recheck_block_checksum()
      - function scrub_repair_block_group_good_copy()
      - function scrub_repair_sector_from_good_copy()
      - function scrub_is_page_on_raid56()
      
      - function full_stripe_lock()
      - function search_full_stripe_lock()
      - function get_full_stripe_logical()
      - function insert_full_stripe_lock()
      - function lock_full_stripe()
      - function unlock_full_stripe()
      - btrfs_block_group::full_stripe_locks_root member
      - btrfs_full_stripe_locks_tree structure
        This infrastructure is to ensure RAID56 scrub is properly handling
        recovery and P/Q scrub correctly.
      
        This is no longer needed, before P/Q scrub we will wait for all
        the involved data stripes to be scrubbed first, and RAID56 code has
        internal lock to ensure no race in the same full stripe.
      
      - function scrub_print_warning()
      - function scrub_get_recover()
      - function scrub_put_recover()
      - function scrub_handle_errored_block()
      - function scrub_setup_recheck_block()
      - function scrub_bio_wait_endio()
      - function scrub_submit_raid56_bio_wait()
      - function scrub_recheck_block_on_raid56()
      - function scrub_recheck_block()
      - function scrub_recheck_block_checksum()
      - function scrub_repair_block_from_good_copy()
      - function scrub_repair_sector_from_good_copy()
      
      And two more functions exported temporarily for later cleanup:
      
      - alloc_scrub_sector()
      - alloc_scrub_block()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9255d6c
    • Qu Wenruo's avatar
      btrfs: scrub: remove the old writeback infrastructure · 16f93993
      Qu Wenruo authored
      Since the whole scrub path has been switched to scrub_stripe based
      solution, the old writeback path can be removed completely, which
      involves:
      
      - scrub_ctx::wr_curr_bio member
      - scrub_ctx::flush_all_writes member
      - function scrub_write_block_to_dev_replace()
      - function scrub_write_sector_to_dev_replace()
      - function scrub_add_sector_to_wr_bio()
      - function scrub_wr_submit()
      - function scrub_wr_bio_end_io()
      - function scrub_wr_bio_end_io_worker()
      
      And one more function needs to be exported temporarily:
      
      - scrub_sector_get()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      16f93993
    • Qu Wenruo's avatar
      btrfs: scrub: remove scrub_parity structure · 5dc96f8d
      Qu Wenruo authored
      The structure scrub_parity is used to indicate that some extents are
      scrubbed for the purpose of RAID56 P/Q scrubbing.
      
      Since the whole RAID56 P/Q scrubbing path has been replaced with new
      scrub_stripe infrastructure, and we no longer need to use scrub_parity
      to modify the behavior of data stripes, we can remove it completely.
      
      This removal involves:
      
      - scrub_parity_workers
        Now only one worker would be utilized, scrub_workers, to do the read
        and repair.
        All writeback would happen at the main scrub thread.
      
      - scrub_block::sparity member
      - scrub_parity structure
      - function scrub_parity_get()
      - function scrub_parity_put()
      - function scrub_free_parity()
      
      - function __scrub_mark_bitmap()
      - function scrub_parity_mark_sectors_error()
      - function scrub_parity_mark_sectors_data()
        These helpers are no longer needed, scrub_stripe has its bitmaps and
        we can use bitmap helpers to get the error/data status.
      
      - scrub_parity_bio_endio()
      - scrub_parity_check_and_repair()
      - function scrub_sectors_for_parity()
      - function scrub_extent_for_parity()
      - function scrub_raid56_data_stripe_for_parity()
      - function scrub_raid56_parity()
        The new code would reuse the scrub read-repair and writeback path.
        Just skip the dev-replace phase.
        And scrub_stripe infrastructure allows us to submit and wait for those
        data stripes before scrubbing P/Q, without extra infrastructure.
      
      The following two functions are temporarily exported for later cleanup:
      
      - scrub_find_csum()
      - scrub_add_sector_to_rd_bio()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5dc96f8d
    • Qu Wenruo's avatar
      btrfs: scrub: use scrub_stripe to implement RAID56 P/Q scrub · 1009254b
      Qu Wenruo authored
      Implement the only missing part for scrub: RAID56 P/Q stripe scrub.
      
      The workflow is pretty straightforward for the new function,
      scrub_raid56_parity_stripe():
      
      - Go through the regular scrub path for each data stripe
      
      - Wait for the verification and repair to finish
      
      - Writeback the repaired sectors to data stripes
      
      - Make sure all stripes are properly repaired
        If we have sectors unrepaired, we cannot continue, or we could further
        corrupt the P/Q stripe.
      
      - Submit the rbio for P/Q stripe
        The dev-replace would be handled inside
        raid56_parity_submit_scrub_rbio() path.
      
      - Wait for the above bio to finish
      
      Although the old code is no longer used, we still keep the declaration,
      as the cleanup can be several times larger than this patch itself.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1009254b
    • Qu Wenruo's avatar
      btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure · e02ee89b
      Qu Wenruo authored
      Switch scrub_simple_mirror() to the new scrub_stripe infrastructure.
      
      Since scrub_simple_mirror() is the core part of scrub (only RAID56
      P/Q stripes don't utilize it), we can get rid of a big chunk of code,
      mostly scrub_extent(), scrub_sectors() and directly called functions.
      
      There is a functionality change:
      
      - Scrub speed throttle now only affects read on the scrubbing device
        Writes (for repair and replace), and reads from other mirrors won't
        be limited by the set limits.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e02ee89b
    • Qu Wenruo's avatar
      btrfs: scrub: introduce helper to queue a stripe for scrub · 54765392
      Qu Wenruo authored
      The new helper, queue_scrub_stripe(), would try to queue a stripe for
      scrub.  If all stripes are already in use, we will submit all the
      existing ones and wait for them to finish.
      
      Currently we would queue up to 8 stripes, to enlarge the blocksize to
      512KiB to improve the performance. Sectors repaired on zoned need to be
      relocated instead of in-place fix.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      54765392
    • Qu Wenruo's avatar
      btrfs: scrub: introduce error reporting functionality for scrub_stripe · 00965807
      Qu Wenruo authored
      The new helper, scrub_stripe_report_errors(), will report the result of
      the scrub to system log.
      
      The main reporting is done by introducing a new helper,
      scrub_print_common_warning(), which is mostly the same content from
      scrub_print_wanring(), but without the need for a scrub_block.
      
      Since we're reporting the errors, it's the perfect time to update the
      scrub stats too.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      00965807
    • Qu Wenruo's avatar
      btrfs: scrub: introduce a writeback helper for scrub_stripe · 058e09e6
      Qu Wenruo authored
      Add a new helper, scrub_write_sectors(), to submit write bios for
      specified sectors to the target disk.
      
      There are several differences compared to read path:
      
      - Utilize btrfs_submit_scrub_write()
        Now we still rely on the @mirror_num based writeback, but the
        requirement is also a little different than regular writeback or read,
        thus we have to call btrfs_submit_scrub_write().
      
      - We cannot write the full stripe back
        We can only write the sectors we have.  There will be two call sites
        later, one for repaired sectors, one for all utilized sectors of
        dev-replace.
      
        Thus the callers should specify their own write_bitmap.
      
      This function only submit the bios, will not wait for them unless for
      zoned case.
      
      Caller must explicitly wait for the IO to finish.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      058e09e6
    • Qu Wenruo's avatar
      btrfs: scrub: introduce the main read repair worker for scrub_stripe · 9ecb5ef5
      Qu Wenruo authored
      The new helper, scrub_stripe_read_repair_worker(), would handle the
      read-repair part:
      
      - Wait for the previous submitted read IO to finish
      
      - Verify the contents of the stripe
      
      - Go through the remaining mirrors, using as large blocksize as possible
        At this stage, we just read out all the failed sectors from each
        mirror and re-verify.
        If no more failed sector, we can exit.
      
      - Go through all mirrors again, sector-by-sector
        This time, we read sector by sector, this is to address cases where
        one bad sector mismatches the drive's internal checksum, and cause the
        whole read range to fail.
      
        We put this recovery method as the last resort, as sector-by-sector
        reading is slow, and reading from other mirrors may have already fixed
        the errors.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ecb5ef5
    • Qu Wenruo's avatar
      btrfs: scrub: introduce a helper to verify one scrub_stripe · 97cf8f37
      Qu Wenruo authored
      The new helper, scrub_verify_stripe(), shares the same main workflow of
      the old scrub code.
      
      The major differences are:
      
      - How pages/page_offset is grabbed
        Everything can be grabbed from scrub_stripe easily.
      
      - When error report happens
        Currently the helper only verifies the sectors, not really doing any
        error reporting.
        The error reporting would be done after we have done the repair.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97cf8f37
    • Qu Wenruo's avatar
      btrfs: scrub: introduce a helper to verify one metadata block · a3ddbaeb
      Qu Wenruo authored
      The new helper, scrub_verify_one_metadata(), is almost the same as
      scrub_checksum_tree_block().
      
      The difference is in how we grab the pages from other structures.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a3ddbaeb
    • Qu Wenruo's avatar
      btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripe · b9795475
      Qu Wenruo authored
      The new helper will search the extent tree to find the first extent of a
      logical range, then fill the sectors array by two loops:
      
      - Loop 1 to fill common bits and metadata generation
      
      - Loop 2 to fill csum data (only for data bgs)
        This loop will use the new btrfs_lookup_csums_bitmap() to fill
        the full csum buffer, and set scrub_sector_verification::csum.
      
      With all the needed info filled by this function, later we only need to
      submit and verify the stripe.
      
      Here we temporarily export the helper to avoid warning on unused static
      function.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b9795475
    • Qu Wenruo's avatar
      btrfs: scrub: introduce structure for new BTRFS_STRIPE_LEN based interface · 2af2aaf9
      Qu Wenruo authored
      This patch introduces the following structures:
      
      - scrub_sector_verification
        Contains all the needed info to verify one sector (data or metadata).
      
      - scrub_stripe
        Contains all needed members (mostly bitmap based) to scrub one stripe
        (with a length of BTRFS_STRIPE_LEN).
      
      The basic idea is, we keep the existing per-device scrub behavior, but
      merge all the scrub_bio/scrub_bio into one generic structure, and read
      the full BTRFS_STRIPE_LEN stripe on the first try.
      
      This means we will read some sectors which are not scrub target, but
      that's fine. At dev-replace time we only writeback the utilized and good
      sectors, and for read-repair we only writeback the repaired sectors.
      
      With every read submitted in BTRFS_STRIPE_LEN, the need for complex bio
      form shaping would be gone.
      Although to get the same performance of the old scrub behavior, we would
      need to submit the initial read for two stripes at once.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2af2aaf9
    • Qu Wenruo's avatar
      btrfs: introduce a new helper to submit write bio for repair · 4886ff7b
      Qu Wenruo authored
      Both scrub and read-repair are utilizing a special repair writes that:
      
      - Only writes back to a single device
        Even for read-repair on RAID56, we only update the corrupted data
        stripe itself, not triggering the full RMW path.
      
      - Requires a valid @mirror_num
        For RAID56 case, only @mirror_num == 1 is valid.
        For non-RAID56 cases, we need @mirror_num to locate our stripe.
      
      - No data csum generation needed
      
      These two call sites still have some differences though:
      
      - Read-repair goes plain bio
        It doesn't need a full btrfs_bio, and goes submit_bio_wait().
      
      - New scrub repair would go btrfs_bio
        To simplify both read and write path.
      
      So here this patch would:
      
      - Introduce a common helper, btrfs_map_repair_block()
        Due to the single device nature, we can use an on-stack
        btrfs_io_stripe to pass device and its physical bytenr.
      
      - Introduce a new interface, btrfs_submit_repair_bio(), for later scrub
        code
        This is for the incoming scrub code.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4886ff7b
    • Qu Wenruo's avatar
      btrfs: introduce btrfs_bio::fs_info member · 4317ff00
      Qu Wenruo authored
      Currently we're doing a lot of work for btrfs_bio:
      
      - Checksum verification for data read bios
      - Bio splits if it crosses stripe boundary
      - Read repair for data read bios
      
      However for the incoming scrub patches, we don't want this extra
      functionality at all, just plain logical + mirror -> physical mapping
      ability.
      
      Thus here we do the following changes:
      
      - Introduce btrfs_bio::fs_info
        This is for the new scrub specific btrfs_bio, which would not populate
        btrfs_bio::inode.
        Thus we need such new member to grab a fs_info
      
        This new member will always be populated.
      
      - Replace @inode argument with @fs_info for btrfs_bio_init() and its
        caller
        Since @inode is no longer a mandatory member, replace it with
        @fs_info, and let involved users populate @inode.
      
      - Skip checksum verification and generation if @bbio->inode is NULL
      
      - Add extra ASSERT()s
        To make sure:
      
        * bbio->inode is properly set for involved read repair path
        * if @file_offset is set, bbio->inode is also populated
      
      - Grab @fs_info from @bbio directly
        We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be
        NULL. This involves:
      
        * btrfs_simple_end_io()
        * should_async_write()
        * btrfs_wq_submit_bio()
        * btrfs_use_zone_append()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4317ff00
    • Qu Wenruo's avatar
      btrfs: scrub: use dedicated super block verification function to scrub one super block · 2a2dc22f
      Qu Wenruo authored
      There is really no need to go through the super complex scrub_sectors()
      to just handle super blocks.  Introduce a dedicated function to handle
      super block scrubbing.
      
      This new function will introduce a behavior change, instead of using the
      complex but concurrent scrub_bio system, here we just go submit-and-wait.
      
      There is really not much sense to care the performance of super block
      scrubbing. It only has 3 super blocks at most, and they are all
      scattered around the devices already.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2a2dc22f
    • Anand Jain's avatar
      btrfs: remove redundant release of btrfs_device::alloc_state · f0bb5474
      Anand Jain authored
      Commit 321f69f8 ("btrfs: reset device back to allocation state when
      removing") included adding extent_io_tree_release(&device->alloc_state)
      to btrfs_close_one_device(), which had already been called in
      btrfs_free_device().
      
      The alloc_state tree (IO_TREE_DEVICE_ALLOC_STATE), is created in
      btrfs_alloc_device() and released in btrfs_close_one_device(). Therefore,
      the additional call to extent_io_tree_release(&device->alloc_state) in
      btrfs_free_device() is unnecessary and can be removed.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f0bb5474
    • Anand Jain's avatar
      btrfs: warn for any missed cleanup at btrfs_close_one_device · 1f16033c
      Anand Jain authored
      During my recent search for the root cause of a reported bug, I realized
      that it's a good idea to issue a warning for missed cleanup instead of
      using debug-only assertions. Since most installations run with debug off,
      missed cleanups and premature calls to close could go unnoticed. However,
      these issues are serious enough to warrant reporting and fixing.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f16033c
    • Christoph Hellwig's avatar
      libcrc32c: remove crc32c_impl · 7533583e
      Christoph Hellwig authored
      This was only ever used by btrfs, and the usage just went away.
      This effectively reverts df91f56a ("libcrc32c: Add crc32c_impl
      function").
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7533583e
    • Christoph Hellwig's avatar
      btrfs: don't print the crc32c implementation at module load time · 6e7a367e
      Christoph Hellwig authored
      Btrfs can use various different checksumming algorithms, and prints
      the one used for a given file system at mount time.  Don't bother
      printing the crc32c implementation at module load time, the information
      is available in /sys/fs/btrfs/FSID/checksum.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e7a367e
    • Christoph Hellwig's avatar
      btrfs: tree-log: factor out a clean_log_buffer helper · e6b430f8
      Christoph Hellwig authored
      The tree-log code has three almost identical copies for the accounting on
      an extent_buffer that doesn't need to be written any more.  The only
      difference is that walk_down_log_tree passed the bytenr used to find the
      buffer instead of extent_buffer.start and calculates the length using the
      nodesize, while the other two callers look at the extent_buffer.len
      field that must always be equivalent to the nodesize.
      
      Factor the code into a common helper.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e6b430f8
    • Christoph Hellwig's avatar
      block: make blkcg_punt_bio_submit optional · 2c275afe
      Christoph Hellwig authored
      Guard all the code to punt bios to a per-cgroup submission helper by a
      new CONFIG_BLK_CGROUP_PUNT_BIO symbol that is selected by btrfs.
      This way non-btrfs kernel builds don't need to have this code.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2c275afe
    • Christoph Hellwig's avatar
      block: async_bio_lock does not need to be bh-safe · 12be09fe
      Christoph Hellwig authored
      async_bio_lock is only taken from bio submission and workqueue context,
      both are never in bottom halves.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12be09fe
    • Christoph Hellwig's avatar
      btrfs, block: move REQ_CGROUP_PUNT to btrfs · 3480373e
      Christoph Hellwig authored
      REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
      a branch to the bio submission hot path.  To fix this, export
      blkcg_punt_bio_submit and let btrfs call it directly.  Add a new
      REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
      bio submission code that a punt to the cgroup submission helper
      is required.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3480373e
    • Christoph Hellwig's avatar
      btrfs, mm: remove the punt_to_cgroup field in struct writeback_control · 0a0596fb
      Christoph Hellwig authored
      punt_to_cgroup is only used by extent_write_locked_range, but that
      function also directly controls the bio flags for the actual submission.
      Remove th punt_to_cgroup field, and just set REQ_CGROUP_PUNT directly
      in extent_write_locked_range.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a0596fb
    • Christoph Hellwig's avatar
      btrfs: also use kthread_associate_blkcg for uncompressible ranges · 896d7c1a
      Christoph Hellwig authored
      submit_one_async_extent needs to use submit_one_async_extent no matter
      if the range it handles ends up beeing compressed or not as the deadlock
      risk due to cgroup thottling is the same.  Call kthread_associate_blkcg
      earlier to cover submit_uncompressed_range case as well.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      896d7c1a
    • Christoph Hellwig's avatar
      btrfs: don't free the async_extent in submit_uncompressed_range · e43a6210
      Christoph Hellwig authored
      Let submit_one_async_extent, which is the only caller of
      submit_uncompressed_range handle freeing of the async_extent in one
      central place.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e43a6210
    • Christoph Hellwig's avatar
      btrfs: move kthread_associate_blkcg out of btrfs_submit_compressed_write · 05d06a5c
      Christoph Hellwig authored
      btrfs_submit_compressed_write should not have to care if it is called
      from a helper thread or not.  Move the kthread_associate_blkcg handling
      into submit_one_async_extent, as that is the one caller that needs it.
      Also move the assignment of REQ_CGROUP_PUNT into cow_file_range_async,
      as that is the routine that sets up the helper thread offload.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      05d06a5c
    • Filipe Manana's avatar
      btrfs: correctly calculate delayed ref bytes when starting transaction · 0f69d1f4
      Filipe Manana authored
      When starting a transaction, we are assuming the number of bytes used for
      each delayed ref update matches the number of bytes used for each item
      update, that is the return value of:
      
         btrfs_calc_insert_metadata_size(fs_info, num_items)
      
      However that is not correct when we are using the free space tree, as we
      need to multiply that value by 2, since delayed ref updates need to modify
      the free space tree besides the extent tree.
      
      So fix this by using btrfs_calc_delayed_ref_bytes() to get the correct
      number of bytes used for delayed ref updates.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f69d1f4
    • Filipe Manana's avatar
      btrfs: make btrfs_block_rsv_full() check more boolean when starting transaction · e4773b57
      Filipe Manana authored
      When starting a transaction we are comparing the result of a call to
      btrfs_block_rsv_full() with 0, but the function returns a boolean. While
      in practice it is not incorrect, as 0 is equivalent to false, it makes it
      a bit odd and less readable. So update the check to not compare against 0
      and instead use the logical not (!) operator.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e4773b57
    • Boris Burkov's avatar
      btrfs: split partial dio bios before submit · b73a6fd1
      Boris Burkov authored
      If an application is doing direct io to a btrfs file and experiences a
      page fault reading from the write buffer, iomap will issue a partial
      bio, and allow the fs to keep going. However, there was a subtle bug in
      this code path in the btrfs dio iomap implementation that led to the
      partial write ending up as a gap in the file's extents and to be read
      back as zeros.
      
      The sequence of events in a partial write, lightly summarized and
      trimmed down for brevity is as follows:
      
      ==== WRITING TASK ====
       btrfs_direct_write
       __iomap_dio_write
       iomap_iter
       btrfs_dio_iomap_begin # create full ordered extent
       iomap_dio_bio_iter
       bio_iov_iter_get_pages # page fault; partial read
       submit_bio # partial bio
       iomap_iter
       btrfs_dio_iomap_end
       btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR;
      				# submit to finish_ordered_fn wq
       fault_in_iov_iter_readable # btrfs_direct_write detects partial write
       __iomap_dio_write
       iomap_iter
       btrfs_dio_iomap_begin # create second partial ordered extent
       iomap_dio_bio_iter
       bio_iov_iter_get_pages # read all of remainder
       submit_bio # partial bio with all of remainder
       iomap_iter
       btrfs_dio_iomap_end # nothing exciting to do with ordered io
      
      ==== DIO ENDIO ====
      == FIRST PARTIAL BIO ==
       btrfs_dio_end_io
       btrfs_mark_ordered_io_finished # bytes_left > 0
      			        # don't submit to finish_ordered_fn wq
      == SECOND PARTIAL BIO ==
       btrfs_dio_end_io
       btrfs_mark_ordered_io_finished # bytes_left == 0
      			        # submit to finish_ordered_fn wq
      
      ==== BTRFS FINISH ORDERED WQ ====
      == FIRST PARTIAL BIO ==
       btrfs_finish_ordered_io # called by dio_iomap_end_io, sees
      		         # BTRFS_ORDERED_IOERR, just drops the
      		         # ordered_extent
      ==SECOND PARTIAL BIO==
       btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file
      		         # extents, csums, etc...
      
      The essence of the problem is that while btrfs_direct_write and iomap
      properly interact to submit all the correct bios, there is insufficient
      logic in the btrfs dio functions (btrfs_dio_iomap_begin,
      btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to
      ensure that every bio is at least a part of a completed ordered_extent.
      And it is completing an ordered_extent that results in crucial
      functionality like writing out a file extent for the range.
      
      More specifically, btrfs_dio_end_io treats the ordered extent as
      unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it.
      Thus, the finish io work doesn't result in file extents, csums, etc.
      In the aftermath, such a file behaves as though it has a hole in it,
      instead of the purportedly written data.
      
      We considered a few options for fixing the bug:
      
        1. treat the partial bio as if we had truncated the file, which would
           result in properly finishing it.
        2. split the ordered extent when submitting a partial bio.
        3. cache the ordered extent across calls to __iomap_dio_rw in
           iter->private, so that we could reuse it and correctly apply
           several bios to it.
      
      I had trouble with 1, and it felt the most like a hack, so I tried 2
      and 3. Since 3 has the benefit of also not creating an extra file
      extent, and avoids an ordered extent lookup during bio submission, it
      felt like the best option. However, that turned out to re-introduce a
      deadlock which this code discarding the ordered_extent between faults
      was meant to fix in the first place. (Link to an explanation of the
      deadlock below.)
      
      Therefore, go with fix 2, which requires a bit more setup work but fixes
      the corruption without introducing the deadlock, which is fundamentally
      caused by the ordered extent existing when we attempt to fault in a
      range that overlaps with it.
      
      Put succinctly, what this patch does is: when we submit a dio bio, check
      if it is partial against the ordered extent stored in dio_data, and if it
      is, extract the ordered_extent that matches the bio exactly out of the
      larger ordered_extent. Keep the remaining ordered_extent around in dio_data
      for cancellation in iomap_end.
      
      Thanks to Josef, Christoph, and Filipe with their help figuring out the
      bug and the fix.
      
      Fixes: 51bd9563 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947
      Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/
      Link: https://pastebin.com/3SDaH8C6
      Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#tReviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      [ hch: refactored the ordered_extent extraction ]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b73a6fd1