1. 25 Jul, 2022 40 commits
    • Christoph Hellwig's avatar
      btrfs: remove extent writepage address space operation · f3e90c1c
      Christoph Hellwig authored
      Same as in commit 21b4ee70 ("xfs: drop ->writepage completely"): we
      can remove the callback as it's only used in one place - single page
      writeback from memory reclaim and is not called for cgroup writeback at
      all.
      
      We only allow such writeback from kswapd, not from direct memory
      reclaim, and so it is rarely used. When it comes from kswapd, it is
      effectively random dirty page shoot-down, which is horrible for IO
      patterns. We can rely on background writeback to clean all dirty pages
      in an efficient way and not let it be interrupted by kswapd.
      Suggested-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3e90c1c
    • David Sterba's avatar
      btrfs: send: use boolean types for current inode status · 9555e1f1
      David Sterba authored
      The new, new_gen and deleted indicate a status, use boolean type instead
      of int.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9555e1f1
    • David Sterba's avatar
      btrfs: send: remove old TODO regarding ERESTARTSYS · cec3dad9
      David Sterba authored
      The whole send operation is restartable and handling properly a buffer
      write may not be easy. We can't know what caused that and if a short
      delay and retry will fix it or how many retries should be performed in
      case it's a temporary condition.
      
      The error value is returned to the ioctl caller so in case it's
      transient problem, the user would be notified about the reason. Remove
      the TODO note as there's no plan to handle ERESTARTSYS.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cec3dad9
    • David Sterba's avatar
      btrfs: send: simplify includes · 8234d3f6
      David Sterba authored
      We don't need the whole ctree.h in send.h, none of the data types
      defined there are used.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8234d3f6
    • David Sterba's avatar
      btrfs: send: drop __KERNEL__ ifdef from send.h · e3b4b904
      David Sterba authored
      We don't need this ifdef as the header file is not shared, the protocol
      definition used by userspace should be from libbtrfs or libbtrfsutil.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3b4b904
    • Christoph Hellwig's avatar
      btrfs: increase direct io read size limit to 256 sectors · ee5b46a3
      Christoph Hellwig authored
      Btrfs currently limits direct I/O reads to a single sector, which goes
      back to commit c329861d ("Btrfs: don't allocate a separate csums
      array for direct reads") from Josef.  That commit changes the direct I/O
      code to ".. use the private part of the io_tree for our csums.", but ten
      years later that isn't how checksums for direct reads work, instead they
      use a csums allocation on a per-btrfs_dio_private basis (which have their
      own performance problem for small I/O, but that will be addressed later).
      
      There is no fundamental limit in btrfs itself to limit the I/O size
      except for the size of the checksum array that scales linearly with
      the number of sectors in an I/O.  Pick a somewhat arbitrary limit of
      256 limits, which matches what the buffered reads typically see as
      the upper limit as the limit for direct I/O as well.
      
      This significantly improves direct read performance.  For example a fio
      run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
      throughput:
      
      Baseline:
      
      READ: bw=65.3MiB/s (68.5MB/s), 65.3MiB/s-65.3MiB/s (68.5MB/s-68.5MB/s), io=19.1GiB (20.6GB), run=300013-300013msec
      
      With this patch:
      
      READ: bw=196MiB/s (206MB/s), 196MiB/s-196MiB/s (206MB/s-206MB/s), io=57.5GiB (61.7GB), run=300006-300006msc
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ee5b46a3
    • Qu Wenruo's avatar
      btrfs: raid56: don't trust any cached sector in __raid56_parity_recover() · f6065f8e
      Qu Wenruo authored
      [BUG]
      There is a small workload which will always fail with recent kernel:
      (A simplified version from btrfs/125 test case)
      
        mkfs.btrfs -f -m raid5 -d raid5 -b 1G $dev1 $dev2 $dev3
        mount $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xee 0 1M" $mnt/file1
        sync
        umount $mnt
        btrfs dev scan -u $dev3
        mount -o degraded $dev1 $mnt
        xfs_io -f -c "pwrite -S 0xff 0 128M" $mnt/file2
        umount $mnt
        btrfs dev scan
        mount $dev1 $mnt
        btrfs balance start --full-balance $mnt
        umount $mnt
      
      The failure is always failed to read some tree blocks:
      
        BTRFS info (device dm-4): relocating block group 217710592 flags data|raid5
        BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
        BTRFS error (device dm-4): parent transid verify failed on 38993920 wanted 9 found 7
        ...
      
      [CAUSE]
      With the recently added debug output, we can see all RAID56 operations
      related to full stripe 38928384:
      
        56.1183: raid56_read_partial: full_stripe=38928384 devid=2 type=DATA1 offset=0 opf=0x0 physical=9502720 len=65536
        56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=16384 opf=0x0 physical=9519104 len=16384
        56.1185: raid56_read_partial: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x0 physical=9551872 len=16384
        56.1187: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=0 opf=0x1 physical=9502720 len=16384
        56.1188: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=32768 opf=0x1 physical=9535488 len=16384
        56.1188: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=0 opf=0x1 physical=30474240 len=16384
        56.1189: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=32768 opf=0x1 physical=30507008 len=16384
        56.1218: raid56_write_stripe: full_stripe=38928384 devid=3 type=DATA2 offset=49152 opf=0x1 physical=9551872 len=16384
        56.1219: raid56_write_stripe: full_stripe=38928384 devid=1 type=PQ1 offset=49152 opf=0x1 physical=30523392 len=16384
        56.2721: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
        56.2723: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
        56.2724: raid56_parity_recover: full stripe=38928384 eb=39010304 mirror=2
      
      Before we enter raid56_parity_recover(), we have triggered some metadata
      write for the full stripe 38928384, this leads to us to read all the
      sectors from disk.
      
      Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
      avoid unnecessary read.
      
      This means, for that full stripe, after any partial write, we will have
      stale data, along with P/Q calculated using that stale data.
      
      Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
      which has data stripes" we haven't submitted all the corrupted P/Q to disk.
      
      When we really need to recover certain range, aka in
      raid56_parity_recover(), we will use the cached rbio, along with its
      cached sectors (the full stripe is all cached).
      
      This explains why we have no event raid56_scrub_read_recover()
      triggered.
      
      Since we have the cached P/Q which is calculated using the stale data,
      the recovered one will just be stale.
      
      In our particular test case, it will always return the same incorrect
      metadata, thus causing the same error message "parent transid verify
      failed on 39010304 wanted 9 found 7" again and again.
      
      [BTRFS DESTRUCTIVE RMW PROBLEM]
      
      Test case btrfs/125 (and above workload) always has its trouble with
      the destructive read-modify-write (RMW) cycle:
      
              0       32K     64K
      Data1:  | Good  | Good  |
      Data2:  | Bad   | Bad   |
      Parity: | Good  | Good  |
      
      In above case, if we trigger any write into Data1, we will use the bad
      data in Data2 to re-generate parity, killing the only chance to recovery
      Data2, thus Data2 is lost forever.
      
      This destructive RMW cycle is not specific to btrfs RAID56, but there
      are some btrfs specific behaviors making the case even worse:
      
      - Btrfs will cache sectors for unrelated vertical stripes.
      
        In above example, if we're only writing into 0~32K range, btrfs will
        still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
        This behavior is to cache sectors for later update.
      
        Incidentally commit d4e28d9b ("btrfs: raid56: make steal_rbio()
        subpage compatible") has a bug which makes RAID56 to never trust the
        cached sectors, thus slightly improve the situation for recovery.
      
        Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
        steal_rbio" will revert the behavior back to the old one.
      
      - Btrfs raid56 partial write will update all P/Q sectors and cache them
      
        This means, even if data at (64K ~ 96K) of Data2 is free space, and
        only (96K ~ 128K) of Data2 is really stale data.
        And we write into that (96K ~ 128K), we will update all the parity
        sectors for the full stripe.
      
        This unnecessary behavior will completely kill the chance of recovery.
      
        Thankfully, an unrelated optimization "btrfs: only write the sectors
        in the vertical stripe which has data stripes" will prevent
        submitting the write bio for untouched vertical sectors.
      
        That optimization will keep the on-disk P/Q untouched for a chance for
        later recovery.
      
      [FIX]
      Although we have no good way to completely fix the destructive RMW
      (unless we go full scrub for each partial write), we can still limit the
      damage.
      
      With patch "btrfs: only write the sectors in the vertical stripe which
      has data stripes" now we won't really submit the P/Q of unrelated
      vertical stripes, so the on-disk P/Q should still be fine.
      
      Now we really need to do is just drop all the cached sectors when doing
      recovery.
      
      By this, we have a chance to read the original P/Q from disk, and have a
      chance to recover the stale data, while still keep the cache to speed up
      regular write path.
      
      In fact, just dropping all the cache for recovery path is good enough to
      allow the test case btrfs/125 along with the small script to pass
      reliably.
      
      The lack of metadata write after the degraded mount, and forced metadata
      COW is saving us this time.
      
      So this patch will fix the behavior by not trust any cache in
      __raid56_parity_recover(), to solve the problem while still keep the
      cache useful.
      
      But please note that this test pass DOES NOT mean we have solved the
      destructive RMW problem, we just do better damage control a little
      better.
      
      Related patches:
      
      - btrfs: only write the sectors in the vertical stripe
      - d4e28d9b ("btrfs: raid56: make steal_rbio() subpage compatible")
      - btrfs: update stripe_sectors::uptodate in steal_rbio
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f6065f8e
    • Christoph Hellwig's avatar
      btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finished · 711f447b
      Christoph Hellwig authored
      finish_func is always set to finish_ordered_fn, so remove it and also
      the now pointless and somewhat confusingly named
      __endio_write_update_ordered wrapper.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      711f447b
    • Nikolay Borisov's avatar
      btrfs: batch up release of reserved metadata for delayed items used for deletion · 1f4f639f
      Nikolay Borisov authored
      With Filipe's recent rework of the delayed inode code one aspect which
      isn't batched is the release of the reserved metadata of delayed inode's
      delete items. With this patch on top of Filipe's rework and running the
      same test as provided in the description of a patch titled
      "btrfs: improve batch deletion of delayed dir index items" I observe
      the following change of the number of calls to btrfs_block_rsv_release:
      
      Before this change:
      - block_rsv_release:                      1004
      - btrfs_delete_delayed_items_total_time: 14602
      - delete_batches:                          505
      
      After:
      - block_rsv_release:                       510
      - btrfs_delete_delayed_items_total_time: 13643
      - delete_batches:                          507
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f4f639f
    • Qu Wenruo's avatar
      btrfs: warn about dev extents that are inside the reserved range · 3613249a
      Qu Wenruo authored
      Btrfs on-disk format has reserved the first 1MiB for the primary super
      block (at 64KiB offset) and bootloaders may also use this space.
      
      This behavior is only introduced since v4.1 btrfs-progs release,
      although kernel can ensure we never touch the reserved range of super
      blocks, it's better to inform the end users, and a balance will resolve
      the problem.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      [ update changelog and message ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3613249a
    • Qu Wenruo's avatar
      btrfs: use named constant for reserved device space · 37f85ec3
      Qu Wenruo authored
      There's a reserved space on each device of size 1MiB that can be used by
      bootloaders or to avoid accidental overwrite. Use a symbolic constant
      with the explaining comment instead of hard coding the value and
      multiple comments.
      
      Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for
      the primary super block (at offset 64KiB), until then the range could
      have been used by mistake. Kernel has been always respecting the 1MiB
      range for writes.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      37f85ec3
    • David Sterba's avatar
    • David Sterba's avatar
      btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino · e3059ec0
      David Sterba authored
      There's only one function we pass to iterate_inodes_from_logical as
      iterator, so we can drop the indirection and call it directly, after
      moving the function to backref.c
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e3059ec0
    • David Sterba's avatar
      btrfs: simplify parameters of backref iterators · 875d1daa
      David Sterba authored
      The inode reference iterator interface takes parameters that are derived
      from the context parameter, but as it's a void* type the values are
      passed individually.
      
      Change the ctx type to inode_fs_path as it's the only thing we pass and
      drop any parameters that are derived from that.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      875d1daa
    • David Sterba's avatar
      btrfs: call inode_to_path directly and drop indirection · ad6240f6
      David Sterba authored
      The functions for iterating inode reference take a function parameter
      but there's only one value, inode_to_path(). Remove the indirection and
      call the function. As paths_from_inode would become just an alias for
      iterate_irefs(), merge the two into one function.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ad6240f6
    • Qu Wenruo's avatar
      btrfs: use ncopies from btrfs_raid_array in btrfs_num_copies() · 6d322b48
      Qu Wenruo authored
      For all non-RAID56 profiles, we can use btrfs_raid_array[].ncopies
      directly, only for RAID5 and RAID6 we need some extra handling as
      there's no table value for that.
      
      For RAID10 there's a change from sub_stripes to ncopies. The values are
      the same but semantically we want to use number of copies, as this is
      what btrfs_num_copies does.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d322b48
    • Qu Wenruo's avatar
      btrfs: use btrfs_raid_array to calculate number of parity stripes · 0b30f719
      Qu Wenruo authored
      Use the raid table instead of hard coded values and rename the helper as
      it is exported.  This could make later extension on RAID56 based
      profiles easier.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b30f719
    • Qu Wenruo's avatar
      btrfs: use btrfs_chunk_max_errors() to replace tolerance calculation · 6dead96c
      Qu Wenruo authored
      In __btrfs_map_block() we have an assignment to @max_errors using
      nr_parity_stripes().
      
      Although it works for RAID56 it's confusing.  Replace it with
      btrfs_chunk_max_errors().
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6dead96c
    • Qu Wenruo's avatar
      btrfs: remove parameter dev_extent_len from scrub_stripe() · bc88b486
      Qu Wenruo authored
      For scrub_stripe() we can easily calculate the dev extent length as we
      have the full info of the chunk.
      
      Thus there is no need to pass @dev_extent_len from the caller, and we
      introduce a helper, btrfs_calc_stripe_length(), to do the calculation
      from extent_map structure.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bc88b486
    • David Sterba's avatar
      btrfs: unify tree search helper returning prev and next nodes · 9db33891
      David Sterba authored
      Simplify helper to return only next and prev pointers, we don't need all
      the node/parent/prev/next pointers of __etree_search as there are now
      other specialized helpers. Rename parameters so they follow the naming.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9db33891
    • David Sterba's avatar
      btrfs: make tree search for insert more generic and use it for tree_search · ec60c76f
      David Sterba authored
      With a slight extension of tree_search_for_insert (fill the return node
      and parent return parameters) we can avoid calling __etree_search from
      tree_search, that could be removed eventually in followup patches.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ec60c76f
    • David Sterba's avatar
      btrfs: open code inexact rbtree search in tree_search · bebb22c1
      David Sterba authored
      The call chain from
      
      tree_search
        tree_search_for_insert
          __etree_search
      
      can be open coded and allow further simplifications, here we need a tree
      search with fallback to the next node in case it's not found. This is
      represented as __etree_search parameters next_ret=valid, prev_ret=NULL.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bebb22c1
    • David Sterba's avatar
      btrfs: remove node and parent parameters from insert_state · c367602a
      David Sterba authored
      There's no caller left that would pass valid pointers to insert_state so
      we can drop them.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c367602a
    • David Sterba's avatar
      btrfs: add fast path for extent_state insertion · fb8f07d2
      David Sterba authored
      In two cases the exact location where to insert the extent state is
      known at the call time so we don't need to pass it to insert_state that
      takes the fast path.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb8f07d2
    • David Sterba's avatar
      btrfs: pass bits by value not by pointer for extent_state helpers · 6d92b304
      David Sterba authored
      The bits are passed to all extent state helpers for no apparent reason,
      the value only read and never updated so remove the indirection and pass
      it directly. Also unify the type to u32 where needed.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6d92b304
    • David Sterba's avatar
      btrfs: lift start and end parameters to callers of insert_state · cee51268
      David Sterba authored
      Let callers of insert_state to set up the extent state to allow further
      simplifications of the parameters.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cee51268
    • David Sterba's avatar
      btrfs: open code rbtree search in insert_state · c7e118cf
      David Sterba authored
      The rbtree search is a known pattern and can be open coded, allowing to
      remove the tree_insert and further cleanups.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c7e118cf
    • David Sterba's avatar
      btrfs: open code rbtree search in split_state · 12c9cdda
      David Sterba authored
      Preparatory work to remove tree_insert from extent_io.c, the rbtree
      search loop is a known and simple so it can be open coded.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      12c9cdda
    • Qu Wenruo's avatar
      btrfs: raid56: avoid double for loop inside raid56_parity_scrub_stripe() · 1c10702e
      Qu Wenruo authored
      Originally it's iterating all the sectors which has dbitmap sector for
      the vertical stripe.
      
      It can be easily converted to sector bytenr iteration with an test_bit()
      call.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c10702e
    • Qu Wenruo's avatar
      btrfs: raid56: avoid double for loop inside raid56_rmw_stripe() · 550cdeb3
      Qu Wenruo authored
      This function doesn't even utilize full stripe skip, just iterate all
      the data sectors is definitely enough.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      550cdeb3
    • Qu Wenruo's avatar
      btrfs: raid56: avoid double for loop inside alloc_rbio_essential_pages() · aee35e4b
      Qu Wenruo authored
      The double loop is just checking if the page for the vertical stripe
      is allocated.
      
      We can easily convert it to single loop and get rid of @stripe variable.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aee35e4b
    • Qu Wenruo's avatar
      btrfs: raid56: avoid double for loop inside __raid56_parity_recover() · ef340fcc
      Qu Wenruo authored
      The double for loop can be easily converted to single for loop as we're
      really iterating the sectors in their bytenr order.
      
      The only exception is the full stripe skip, however that can also easily
      be done inside the loop.  Add an ASSERT() along with a comment for that
      specific case.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ef340fcc
    • Qu Wenruo's avatar
      btrfs: raid56: avoid double for loop inside finish_rmw() · 36920044
      Qu Wenruo authored
      We can easily calculate the stripe number and sector number inside the
      stripe.  Thus there is not much need for a double for loop.
      
      For the only case we want to skip the whole stripe, we can manually
      increase @total_sector_nr.
      This is not a recommended behavior, thus every time the iterator gets
      modified there will be a comment along with an ASSERT() for it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36920044
    • Josef Bacik's avatar
      btrfs: tree-log: make the return value for log syncing consistent · f31f09f6
      Josef Bacik authored
      Currently we will return 1 or -EAGAIN if we decide we need to commit
      the transaction rather than sync the log.  In practice this doesn't
      really matter, we interpret any !0 and !BTRFS_NO_LOG_SYNC as needing to
      commit the transaction.  However this makes it hard to figure out what
      the correct thing to do is.
      
      Fix this up by defining BTRFS_LOG_FORCE_COMMIT and using this in all the
      places where we want to force the transaction to be committed.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f31f09f6
    • Johannes Thumshirn's avatar
      btrfs: add tracepoints for ordered extents · 5bea2508
      Johannes Thumshirn authored
      When debugging a reference counting issue with ordered extents, I've found
      we're lacking a lot of tracepoint coverage in the ordered extent code.
      
      Close these gaps by adding tracepoints after every refcount_inc() in the
      ordered extent code.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5bea2508
    • David Sterba's avatar
      btrfs: sysfs: advertise zoned support among features · 15dcccdb
      David Sterba authored
      We've hidden the zoned support in sysfs under debug config for the first
      releases but now the stability is reasonable, though not all features
      have been implemented.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      15dcccdb
    • Christoph Hellwig's avatar
      btrfs: split discard handling out of btrfs_map_block · a4012f06
      Christoph Hellwig authored
      Mapping block for discard doesn't really share any code with the regular
      block mapping case.  Split it out into an entirely separate helper
      that just returns an array of btrfs_discard_stripe structures and the
      number of stripes.
      
      This removes the need for the length field in the btrfs_io_context
      structure, so remove tht.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4012f06
    • Christoph Hellwig's avatar
      btrfs: stop looking at btrfs_bio->iter in index_one_bio · 5eecef71
      Christoph Hellwig authored
      All the bios that index_one_bio operates on are the bios submitted by the
      upper layer.  These are never resubmitted to an actual device by the
      raid56 code, and thus the iter never changes from the initial state.
      Thus we can always just use bi_iter directly as it will be the same as
      the saved copy.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5eecef71
    • Qu Wenruo's avatar
      btrfs: reject log replay if there is unsupported RO compat flag · dc4d3168
      Qu Wenruo authored
      [BUG]
      If we have a btrfs image with dirty log, along with an unsupported RO
      compatible flag:
      
      log_root		30474240
      ...
      compat_flags		0x0
      compat_ro_flags		0x40000003
      			( FREE_SPACE_TREE |
      			  FREE_SPACE_TREE_VALID |
      			  unknown flag: 0x40000000 )
      
      Then even if we can only mount it RO, we will still cause metadata
      update for log replay:
      
        BTRFS info (device dm-1): flagging fs with big metadata feature
        BTRFS info (device dm-1): using free space tree
        BTRFS info (device dm-1): has skinny extents
        BTRFS info (device dm-1): start tree-log replay
      
      This is definitely against RO compact flag requirement.
      
      [CAUSE]
      RO compact flag only forces us to do RO mount, but we will still do log
      replay for plain RO mount.
      
      Thus this will result us to do log replay and update metadata.
      
      This can be very problematic for new RO compat flag, for example older
      kernel can not understand v2 cache, and if we allow metadata update on
      RO mount and invalidate/corrupt v2 cache.
      
      [FIX]
      Just reject the mount unless rescue=nologreplay is provided:
      
        BTRFS error (device dm-1): cannot replay dirty log with unsupport optional features (0x40000000), try rescue=nologreplay instead
      
      We don't want to set rescue=nologreply directly, as this would make the
      end user to read the old data, and cause confusion.
      
      Since the such case is really rare, we're mostly fine to just reject the
      mount with an error message, which also includes the proper workaround.
      
      CC: stable@vger.kernel.org #4.9+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dc4d3168
    • Qu Wenruo's avatar
      btrfs: make btrfs_super_block::log_root_transid deprecated · 97f09d55
      Qu Wenruo authored
      When using "btrfs inspect-internal dump-super" to inspect an fs with
      dirty log, it always shows the log_root_transid as 0:
      
        log_root                30474240
        log_root_transid        0 <<<
        log_root_level          0
      
      It turns out that, btrfs_super_block::log_root_transid is never really
      utilized (even no read for it).
      
      This can date back to the introduction of btrfs into upstream kernel.
      
      In fact, when reading log tree root, we always use
      btrfs_super_block::generation + 1 as the expected generation.
      So here we're completely safe to mark this member deprecated.
      
      In theory we can easily reuse this member for other purposes, but to be
      extra safe, here we follow the leafsize way, by adding "__unused_" for
      log_root_transid.
      And we can safely remove the accessors, since there is no such callers
      from the very beginning.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97f09d55