1. 19 Jun, 2023 40 commits
    • Boris Burkov's avatar
      btrfs: insert tree mod log move in push_node_left · 5cead542
      Boris Burkov authored
      There is a fairly unlikely race condition in tree mod log rewind that
      can result in a kernel panic which has the following trace:
      
        [530.569] BTRFS critical (device sda3): unable to find logical 0 length 4096
        [530.585] BTRFS critical (device sda3): unable to find logical 0 length 4096
        [530.602] BUG: kernel NULL pointer dereference, address: 0000000000000002
        [530.618] #PF: supervisor read access in kernel mode
        [530.629] #PF: error_code(0x0000) - not-present page
        [530.641] PGD 0 P4D 0
        [530.647] Oops: 0000 [#1] SMP
        [530.654] CPU: 30 PID: 398973 Comm: below Kdump: loaded Tainted: G S         O  K   5.12.0-0_fbk13_clang_7455_gb24de3bdb045 #1
        [530.680] Hardware name: Quanta Mono Lake-M.2 SATA 1HY9U9Z001G/Mono Lake-M.2 SATA, BIOS F20_3A15 08/16/2017
        [530.703] RIP: 0010:__btrfs_map_block+0xaa/0xd00
        [530.755] RSP: 0018:ffffc9002c2f7600 EFLAGS: 00010246
        [530.767] RAX: ffffffffffffffea RBX: ffff888292e41000 RCX: f2702d8b8be15100
        [530.784] RDX: ffff88885fda6fb8 RSI: ffff88885fd973c8 RDI: ffff88885fd973c8
        [530.800] RBP: ffff888292e410d0 R08: ffffffff82fd7fd0 R09: 00000000fffeffff
        [530.816] R10: ffffffff82e57fd0 R11: ffffffff82e57d70 R12: 0000000000000000
        [530.832] R13: 0000000000001000 R14: 0000000000001000 R15: ffffc9002c2f76f0
        [530.848] FS:  00007f38d64af000(0000) GS:ffff88885fd80000(0000) knlGS:0000000000000000
        [530.866] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [530.880] CR2: 0000000000000002 CR3: 00000002b6770004 CR4: 00000000003706e0
        [530.896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [530.912] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [530.928] Call Trace:
        [530.934]  ? btrfs_printk+0x13b/0x18c
        [530.943]  ? btrfs_bio_counter_inc_blocked+0x3d/0x130
        [530.955]  btrfs_map_bio+0x75/0x330
        [530.963]  ? kmem_cache_alloc+0x12a/0x2d0
        [530.973]  ? btrfs_submit_metadata_bio+0x63/0x100
        [530.984]  btrfs_submit_metadata_bio+0xa4/0x100
        [530.995]  submit_extent_page+0x30f/0x360
        [531.004]  read_extent_buffer_pages+0x49e/0x6d0
        [531.015]  ? submit_extent_page+0x360/0x360
        [531.025]  btree_read_extent_buffer_pages+0x5f/0x150
        [531.037]  read_tree_block+0x37/0x60
        [531.046]  read_block_for_search+0x18b/0x410
        [531.056]  btrfs_search_old_slot+0x198/0x2f0
        [531.066]  resolve_indirect_ref+0xfe/0x6f0
        [531.076]  ? ulist_alloc+0x31/0x60
        [531.084]  ? kmem_cache_alloc_trace+0x12e/0x2b0
        [531.095]  find_parent_nodes+0x720/0x1830
        [531.105]  ? ulist_alloc+0x10/0x60
        [531.113]  iterate_extent_inodes+0xea/0x370
        [531.123]  ? btrfs_previous_extent_item+0x8f/0x110
        [531.134]  ? btrfs_search_path_in_tree+0x240/0x240
        [531.146]  iterate_inodes_from_logical+0x98/0xd0
        [531.157]  ? btrfs_search_path_in_tree+0x240/0x240
        [531.168]  btrfs_ioctl_logical_to_ino+0xd9/0x180
        [531.179]  btrfs_ioctl+0xe2/0x2eb0
      
      This occurs when logical inode resolution takes a tree mod log sequence
      number, and then while backref walking hits a rewind on a busy node
      which has the following sequence of tree mod log operations (numbers
      filled in from a specific example, but they are somewhat arbitrary)
      
        REMOVE_WHILE_FREEING slot 532
        REMOVE_WHILE_FREEING slot 531
        REMOVE_WHILE_FREEING slot 530
        ...
        REMOVE_WHILE_FREEING slot 0
        REMOVE slot 455
        REMOVE slot 454
        REMOVE slot 453
        ...
        REMOVE slot 0
        ADD slot 455
        ADD slot 454
        ADD slot 453
        ...
        ADD slot 0
        MOVE src slot 0 -> dst slot 456 nritems 533
        REMOVE slot 455
        REMOVE slot 454
        REMOVE slot 453
        ...
        REMOVE slot 0
      
      When this sequence gets applied via btrfs_tree_mod_log_rewind, it
      allocates a fresh rewind eb, and first inserts the correct key info for
      the 533 elements, then overwrites the first 456 of them, then decrements
      the count by 456 via the add ops, then rewinds the move by doing a
      memmove from 456:988->0:532. We have never written anything past 532, so
      that memmove writes garbage into the 0:532 range. In practice, this
      results in a lot of fully 0 keys. The rewind then puts valid keys into
      slots 0:455 with the last removes, but 456:532 are still invalid.
      
      When search_old_slot uses this eb, if it uses one of those invalid
      slots, it can then read the extent buffer and issue a bio for offset 0
      which ultimately panics looking up extent mappings.
      
      This bad tree mod log sequence gets generated when the node balancing
      code happens to do a balance_node_right followed by a push_node_left
      while logging in the tree mod log. Illustrated for ebs L and R (left and
      right):
      
      	L                 R
        start:
        [XXX|YYY|...]      [ZZZ|...|...]
        balance_node_right:
        [XXX|YYY|...]      [...|ZZZ|...] move Z to make room for Y
        [XXX|...|...]      [YYY|ZZZ|...] copy Y from L to R
        push_node_left:
        [XXX|YYY|...]      [...|ZZZ|...] copy Y from R to L
        [XXX|YYY|...]      [ZZZ|...|...] move Z into emptied space (NOT LOGGED!)
      
      This is because balance_node_right logs a move, but push_node_left
      explicitly doesn't. That is because logging the move would remove the
      overwritten src < dst range in the right eb, which was already logged
      when we called btrfs_tree_mod_log_eb_copy. The correct sequence would
      include a move from 456:988 to 0:532 after remove 0:455 and before
      removing 0:532. Reversing that sequence would entail creating keys for
      0:532, then moving those keys out to 456:988, then creating more keys
      for 0:455.
      
      i.e.,
      
        REMOVE_WHILE_FREEING slot 532
        REMOVE_WHILE_FREEING slot 531
        REMOVE_WHILE_FREEING slot 530
        ...
        REMOVE_WHILE_FREEING slot 0
        MOVE src slot 456 -> dst slot 0 nritems 533
        REMOVE slot 455
        REMOVE slot 454
        REMOVE slot 453
        ...
        REMOVE slot 0
        ADD slot 455
        ADD slot 454
        ADD slot 453
        ...
        ADD slot 0
        MOVE src slot 0 -> dst slot 456 nritems 533
        REMOVE slot 455
        REMOVE slot 454
        REMOVE slot 453
        ...
        REMOVE slot 0
      
      Fix this to log the move but avoid the double remove by putting all the
      logging logic in btrfs_tree_mod_log_eb_copy which has enough information
      to detect these cases and properly log moves, removes, and adds. Leave
      btrfs_tree_mod_log_insert_move to handle insert_ptr and delete_ptr's
      tree mod logging.
      
      (Un)fortunately, this is quite difficult to reproduce, and I was only
      able to reproduce it by adding sleeps in btrfs_search_old_slot that
      would encourage more log rewinding during ino_to_logical ioctls. I was
      able to hit the warning in the previous patch in the series without the
      fix quite quickly, but not after this patch.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5cead542
    • Boris Burkov's avatar
      btrfs: warn on invalid slot in tree mod log rewind · 95c8e349
      Boris Burkov authored
      The way that tree mod log tracks the ultimate length of the eb, the
      variable 'n', eventually turns up the correct value, but at intermediate
      steps during the rewind, n can be inaccurate as a representation of the
      end of the eb. For example, it doesn't get updated on move rewinds, and
      it does get updated for add/remove in the middle of the eb.
      
      To detect cases with invalid moves, introduce a separate variable called
      max_slot which tries to track the maximum valid slot in the rewind eb.
      We can then warn if we do a move whose src range goes beyond the max
      valid slot.
      
      There is a commented caveat that it is possible to have this value be an
      overestimate due to the challenge of properly handling 'add' operations
      in the middle of the eb, but in practice it doesn't cause enough of a
      problem to throw out the max idea in favor of tracking every valid slot.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      95c8e349
    • David Sterba's avatar
      btrfs: disable allocation warnings for compression workspaces · 8ab546bb
      David Sterba authored
      The workspaces for compression are typically much larger than a page and
      for high zstd levels in the range of megabytes. There's a fallback to
      vmalloc but this can still fail (see the report).
      
      Some of the workspaces are preallocated at module load time so we have a
      safe fallback, otherwise when a new workspace is needed it's allocated
      but if this fails then the process waits. Which means the warning is
      only causing noise and we can use the GFP flag to disable it.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=217466Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ab546bb
    • Christoph Hellwig's avatar
      btrfs: open code need_full_stripe conditions · 8680e587
      Christoph Hellwig authored
      need_full_stripe is just a somewhat complicated way to say
      "op != BTRFS_MAP_READ".  Just spell that explicit check out, which makes
      a lot of the code currently using the helper easier to understand.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8680e587
    • Christoph Hellwig's avatar
      btrfs: open code btrfs_map_sblock · 723b8bb1
      Christoph Hellwig authored
      btrfs_map_sblock just hard codes three arguments and calls
      btrfs_map_sblock.  Remove it as it doesn't provide any real value, but
      makes following the btrfs_map_block call chains harder.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      723b8bb1
    • Christoph Hellwig's avatar
      btrfs: rename __btrfs_map_block to btrfs_map_block · cd4efd21
      Christoph Hellwig authored
      Now that the old btrfs_map_block is gone, drop the leading underscores
      from __btrfs_map_block.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd4efd21
    • Christoph Hellwig's avatar
      btrfs: remove unused btrfs_map_block · d69d7ffc
      Christoph Hellwig authored
      There are no users of btrfs_map_block left, so remove it.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d69d7ffc
    • Christoph Hellwig's avatar
      btrfs: optimize simple reads in btrfsic_map_block · 78a213a0
      Christoph Hellwig authored
      Pass a smap into __btrfs_map_block so that the usual case of a read that
      doesn't require parity raid recovery doesn't need an extra memory
      allocation for the btrfs_io_context.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      78a213a0
    • Christoph Hellwig's avatar
      btrfs: remove unused BTRFS_MAP_DISCARD · 3965a4c7
      Christoph Hellwig authored
      BTRFS_MAP_DISCARD is never set, as REQ_OP_DISCARD is never passed to
      btrfs_op() only only checked in two ASSERTS.
      
      Remove it and let the catchall WARN_ON in btrfs_op() deal with accidental
      REQ_OP_DISCARDs leaked into btrfs_op(). Last use was in a4012f06
      ("btrfs: split discard handling out of btrfs_map_block").
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3965a4c7
    • David Sterba's avatar
      btrfs: add xxhash to fast checksum implementations · efcfcbc6
      David Sterba authored
      The implementation of XXHASH is now CPU only but still fast enough to be
      considered for the synchronous checksumming, like non-generic crc32c.
      
      A userspace benchmark comparing it to various implementations (patched
      hash-speedtest from btrfs-progs):
      
        Block size:     4096
        Iterations:     1000000
        Implementation: builtin
        Units:          CPU cycles
      
      	NULL-NOP: cycles:     73384294, cycles/i       73
           NULL-MEMCPY: cycles:    228033868, cycles/i      228,    61664.320 MiB/s
            CRC32C-ref: cycles:  24758559416, cycles/i    24758,      567.950 MiB/s
             CRC32C-NI: cycles:   1194350470, cycles/i     1194,    11773.433 MiB/s
        CRC32C-ADLERSW: cycles:   6150186216, cycles/i     6150,     2286.372 MiB/s
        CRC32C-ADLERHW: cycles:    626979180, cycles/i      626,    22427.453 MiB/s
            CRC32C-PCL: cycles:    466746732, cycles/i      466,    30126.699 MiB/s
      	  XXHASH: cycles:    860656400, cycles/i      860,    16338.188 MiB/s
      
      Comparing purely software implementation (ref), current outdated
      accelerated using crc32q instruction (NI), optimized implementations by
      M. Adler (https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software/17646775#17646775)
      and the best one that was taken from kernel using the PCLMULQDQ
      instruction (PCL).
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      efcfcbc6
    • Christoph Hellwig's avatar
      btrfs: pass the new logical address to split_extent_map · f000bc6f
      Christoph Hellwig authored
      split_extent_map splits off the first chunk of an extent map into a new
      one.  One of the two users is the zoned I/O completion code that wants to
      rewrite the logical block start address right after this split.  Pass in
      the logical address to be set in the split off first extent_map as an
      argument to avoid an extra extent tree lookup for this case.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f000bc6f
    • Christoph Hellwig's avatar
      btrfs: defer splitting of ordered extents until I/O completion · 71df088c
      Christoph Hellwig authored
      The btrfs zoned completion code currently needs an ordered_extent and
      extent_map per bio so that it can account for the non-predictable
      write location from Zone Append.  To archive that it currently splits
      the ordered_extent and extent_map at I/O submission time, and then
      records the actual physical address in the ->physical field of the
      ordered_extent.
      
      This patch instead switches to record the "original" physical address
      that the btrfs allocator assigned in spare space in the btrfs_bio,
      and then rewrites the logical address in the btrfs_ordered_sum
      structure at I/O completion time.  This allows the ordered extent
      completion handler to simply walk the list of ordered csums and
      split the ordered extent as needed.  This removes an extra ordered
      extent and extent_map lookup and manipulation during the I/O
      submission path, and instead batches it in the I/O completion path
      where we need to touch these anyway.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71df088c
    • Christoph Hellwig's avatar
      btrfs: handle completed ordered extents in btrfs_split_ordered_extent · 52b1fdca
      Christoph Hellwig authored
      To delay splitting ordered_extents to I/O completion time we need to be
      able to handle fully completed ordered extents in
      btrfs_split_ordered_extent.  Besides a bit of accounting this primarily
      involved moving over the csums to the split bio for the range that it
      covers, which is simple enough because we always have one
      btrfs_ordered_sum per bio.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      52b1fdca
    • Christoph Hellwig's avatar
      btrfs: atomically insert the new extent in btrfs_split_ordered_extent · 816f589b
      Christoph Hellwig authored
      Currently there is a small race window in btrfs_split_ordered_extent,
      where the reduced old extent can be looked up on the per-inode rbtree
      or the per-root list while the newly split out one isn't visible yet.
      
      Fix this by open coding btrfs_alloc_ordered_extent in
      btrfs_split_ordered_extent, and holding the tree lock and
      root->ordered_extent_lock over the entire tree and extent manipulation.
      
      Note that this introduces new lock ordering because previously
      ordered_extent_lock was never held over the tree lock.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      816f589b
    • Christoph Hellwig's avatar
      btrfs: split btrfs_alloc_ordered_extent to allocation and insertion helpers · 53d9981c
      Christoph Hellwig authored
      Split two low-level helpers out of btrfs_alloc_ordered_extent to allocate
      and insert the logic extent.  The pure alloc helper will be used to
      improve btrfs_split_ordered_extent.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53d9981c
    • Christoph Hellwig's avatar
      btrfs: return the new ordered_extent from btrfs_split_ordered_extent · b0307e28
      Christoph Hellwig authored
      Return the ordered_extent split from the passed in one.  This will be
      needed to be able to store an ordered_extent in the btrfs_bio.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0307e28
    • Christoph Hellwig's avatar
      btrfs: reorder conditions in btrfs_extract_ordered_extent · ebdb44a0
      Christoph Hellwig authored
      There is no good reason for doing one before the other in terms of
      failure implications, but doing the extent_map split first will
      simplify some upcoming refactoring.
      Reviewed-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ebdb44a0
    • Christoph Hellwig's avatar
      btrfs: move split_extent_map to extent_map.c · a6f3e205
      Christoph Hellwig authored
      split_extent_map doesn't have anything to do with the other code in
      inode.c, so move it to extent_map.c.
      
      This also allows marking replace_extent_mapping static.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a6f3e205
    • Christoph Hellwig's avatar
      btrfs: record orig_physical only for the original bio · 3887653c
      Christoph Hellwig authored
      btrfs_submit_dev_bio is also called for clone bios that aren't embedded
      into a btrfs_bio structure, but previous commit "btrfs: optimize the
      logical to physical mapping for zoned writes" added code to assign
      btrfs_bio.orig_physical in it.
      
      This is harmless right now as only the single data profile can be used
      on zoned devices, but will blow up when the RAID stripe tree is added.
      Move it out into the single I/O specific branch in the caller.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3887653c
    • Christoph Hellwig's avatar
      btrfs: optimize the logical to physical mapping for zoned writes · cbfce4c7
      Christoph Hellwig authored
      The current code to store the final logical to physical mapping for a
      zone append write in the extent tree is rather inefficient.  It first has
      to split the ordered extent so that there is one ordered extent per bio,
      so that it can look up the ordered extent on I/O completion in
      btrfs_record_physical_zoned and store the physical LBA returned by the
      block driver in the ordered extent.
      
      btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to
      see what physical address the logical address for this bio / ordered
      extent is mapped to, and then rewrite it in the extent tree.
      
      To optimize this process, we can store the physical address assigned in
      the chunk tree to the original logical address and a pointer to
      btrfs_ordered_sum structure the in the btrfs_bio structure, and then use
      this information to rewrite the logical address in the btrfs_ordered_sum
      structure directly at I/O completion time in btrfs_record_physical_zoned.
      btrfs_rewrite_logical_zoned then simply updates the logical address in
      the extent tree and the ordered_extent itself.
      
      The code in btrfs_rewrite_logical_zoned now runs for all data I/O
      completions in zoned file systems, which is fine as there is no remapping
      to do for non-append writes to conventional zones or for relocation, and
      the overhead for quickly breaking out of the loop is very low.
      
      Because zoned file systems now need the ordered_sums structure to
      record the actual write location returned by zone append, allocate dummy
      structures without the csum array for them when the I/O doesn't use
      checksums, and free them when completing the ordered_extent.
      
      Note that the btrfs_bio doesn't grow as the new field are places into
      a union that is so far not used for data writes and has plenty of space
      left in it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cbfce4c7
    • Christoph Hellwig's avatar
      btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical · 5cfe76f8
      Christoph Hellwig authored
      btrfs_ordered_sum::bytendr stores a logical address.  Make that clear by
      renaming it to ->logical.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5cfe76f8
    • Christoph Hellwig's avatar
      btrfs: mark the len field in struct btrfs_ordered_sum as unsigned · 6e4b2479
      Christoph Hellwig authored
      len can't ever be negative, so mark it as an u32 instead of int.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e4b2479
    • Christoph Hellwig's avatar
      btrfs: don't call btrfs_record_physical_zoned for failed append · e9cb93b9
      Christoph Hellwig authored
      When a zoned append command fails there is no written address reported,
      so don't try to record it.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e9cb93b9
    • Christoph Hellwig's avatar
      btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED · dd8b7b04
      Christoph Hellwig authored
      Add an IS_ENABLED check for CONFIG_BLK_DEV_ZONED in addition to the
      run-time check for the zone size.  This will allow to make use of
      compiler dead code elimination for code guarded by btrfs_is_zoned, and
      for example provide just a dangling prototype for a function instead of
      adding a stub.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd8b7b04
    • Filipe Manana's avatar
      btrfs: make btrfs_destroy_delayed_refs() return void · 99f09ce3
      Filipe Manana authored
      btrfs_destroy_delayed_refs() always returns 0 and its single caller does
      not check its return value, as it also returns void, and so does the
      callers' caller and so on. This is because we are in the transaction abort
      path, where we have no way to deal with errors (we are in a critical
      situation) and all cleanup of resources works in a best effort fashion.
      So make btrfs_destroy_delayed_refs() return void.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      99f09ce3
    • Filipe Manana's avatar
      btrfs: remove unnecessary prototype declarations at disk-io.c · 184533e3
      Filipe Manana authored
      We have a few static functions at disk-io.c for which we have a forward
      declaration of their prototype, but it's not needed because all those
      functions are defined before they are called, so remove them.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      184533e3
    • Filipe Manana's avatar
      btrfs: use a single switch statement when initializing delayed ref head · f1ed785a
      Filipe Manana authored
      At init_delayed_ref_head(), we are using two separate if statements to
      check the delayed ref head action, and initializing 'must_insert_reserved'
      to false twice, once when the variable is declared and once again in an
      else branch.
      
      Make this simpler and more straightforward by having a single switch
      statement, also moving the comment about a drop action to the
      corresponding switch case to make it more clear and eliminating the
      duplicated initialization of 'must_insert_reserved' to false.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f1ed785a
    • Filipe Manana's avatar
      btrfs: use bool type for delayed ref head fields that are used as booleans · 61c681fe
      Filipe Manana authored
      There's no point in have several fields defined as 1 bit unsigned int in
      struct btrfs_delayed_ref_head, we can instead use a bool type, it makes
      the code a bit more readable and it doesn't change the structure size.
      So switch them to proper booleans.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      61c681fe
    • Filipe Manana's avatar
      btrfs: assert correct lock is held at btrfs_select_ref_head() · 1e6b71c3
      Filipe Manana authored
      The function btrfs_select_ref_head() iterates over the red black tree of
      delayed reference heads, which is protected by the spinlock in the delayed
      refs root. The function doesn't take the lock, it's taken by its single
      caller, btrfs_obtain_ref_head(), because it needs to call that function
      and btrfs_delayed_ref_lock() in the same critical section (delimited by
      that spinlock). So assert at btrfs_select_ref_head() that we are holding
      the expected lock.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1e6b71c3
    • Filipe Manana's avatar
      btrfs: get rid of label and goto at insert_delayed_ref() · 798f4d95
      Filipe Manana authored
      At insert_delayed_ref() there's no point of having a label and goto in the
      case we were able to insert the delayed ref head. We can just add the code
      under label to the if statement's body and return immediately, and also
      there is no need to track the return value in a variable, we can just
      return a literal true or false value directly. So do those changes.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      798f4d95
    • Filipe Manana's avatar
      btrfs: make insert_delayed_ref() return a bool instead of an int · f38462c4
      Filipe Manana authored
      insert_delayed_ref() can only return 0 or 1, to indicate if the given
      delayed reference was added to the head reference or if it was merged
      into an existing delayed ref, respectively. So just make it return a
      boolean instead.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f38462c4
    • Filipe Manana's avatar
      btrfs: use a bool to track qgroup record insertion when adding ref head · 293f8197
      Filipe Manana authored
      We are using an integer as a boolean to track the qgroup record insertion
      status when adding a delayed reference head. Since all we need is a
      boolean, switch the type from int to bool to make it more obvious.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      293f8197
    • Filipe Manana's avatar
      btrfs: remove pointless in_tree field from struct btrfs_delayed_ref_node · 4d34ad34
      Filipe Manana authored
      The 'in_tree' field is really not needed in struct btrfs_delayed_ref_node,
      as we can check whether a reference is in the tree or not simply by
      checking its red black tree node member with RB_EMPTY_NODE(), as when we
      remove it from the tree we always call RB_CLEAR_NODE(). So remove that
      field and use RB_EMPTY_NODE().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4d34ad34
    • Filipe Manana's avatar
      btrfs: remove unused is_head field from struct btrfs_delayed_ref_node · 53499d5f
      Filipe Manana authored
      The 'is_head' field of struct btrfs_delayed_ref_node is no longer after
      commit d278850e ("btrfs: remove delayed_ref_node from ref_head"),
      so remove it.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53499d5f
    • Filipe Manana's avatar
      btrfs: reorder some members of struct btrfs_delayed_ref_head · 315dd5cc
      Filipe Manana authored
      Currently struct delayed_ref_head has its 'bytenr' and 'href_node' members
      in different cache lines (even on a release, non-debug, kernel). This is
      not optimal because when iterating the red black tree of delayed ref heads
      for inserting a new delayed ref head (htree_insert()) we have to pull in 2
      cache lines of delayed ref heads we find in a patch, one for the tree node
      (struct rb_node) and another one for the 'bytenr' field. The same applies
      when searching for an existing delayed ref head (find_ref_head()).
      On a release (non-debug) kernel, the structure also has two 4 bytes holes,
      which makes it 8 bytes longer than necessary. Its current layout is the
      following:
      
        struct btrfs_delayed_ref_head {
                u64                        bytenr;               /*     0     8 */
                u64                        num_bytes;            /*     8     8 */
                refcount_t                 refs;                 /*    16     4 */
      
                /* XXX 4 bytes hole, try to pack */
      
                struct mutex               mutex;                /*    24    32 */
                spinlock_t                 lock;                 /*    56     4 */
      
                /* XXX 4 bytes hole, try to pack */
      
                /* --- cacheline 1 boundary (64 bytes) --- */
                struct rb_root_cached      ref_tree;             /*    64    16 */
                struct list_head           ref_add_list;         /*    80    16 */
                struct rb_node             href_node __attribute__((__aligned__(8))); /*    96    24 */
                struct btrfs_delayed_extent_op * extent_op;      /*   120     8 */
                /* --- cacheline 2 boundary (128 bytes) --- */
                int                        total_ref_mod;        /*   128     4 */
                int                        ref_mod;              /*   132     4 */
                unsigned int               must_insert_reserved:1; /*   136: 0  4 */
                unsigned int               is_data:1;            /*   136: 1  4 */
                unsigned int               is_system:1;          /*   136: 2  4 */
                unsigned int               processing:1;         /*   136: 3  4 */
      
                /* size: 144, cachelines: 3, members: 15 */
                /* sum members: 128, holes: 2, sum holes: 8 */
                /* sum bitfield members: 4 bits (0 bytes) */
                /* padding: 4 */
                /* bit_padding: 28 bits */
                /* forced alignments: 1 */
                /* last cacheline: 16 bytes */
        } __attribute__((__aligned__(8)));
      
      This change reorders the 'href_node' and 'refs' members so that we have
      the 'href_node' in the same cache line as the 'bytenr' field, while also
      eliminating the two holes and reducing the structure size from 144 bytes
      down to 136 bytes, so we can now have 30 ref heads per 4K page (on x86_64)
      instead of 28. The new structure layout after this change is now:
      
        struct btrfs_delayed_ref_head {
                u64                        bytenr;               /*     0     8 */
                u64                        num_bytes;            /*     8     8 */
                struct rb_node             href_node __attribute__((__aligned__(8))); /*    16    24 */
                struct mutex               mutex;                /*    40    32 */
                /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
                refcount_t                 refs;                 /*    72     4 */
                spinlock_t                 lock;                 /*    76     4 */
                struct rb_root_cached      ref_tree;             /*    80    16 */
                struct list_head           ref_add_list;         /*    96    16 */
                struct btrfs_delayed_extent_op * extent_op;      /*   112     8 */
                int                        total_ref_mod;        /*   120     4 */
                int                        ref_mod;              /*   124     4 */
                /* --- cacheline 2 boundary (128 bytes) --- */
                unsigned int               must_insert_reserved:1; /*   128: 0  4 */
                unsigned int               is_data:1;            /*   128: 1  4 */
                unsigned int               is_system:1;          /*   128: 2  4 */
                unsigned int               processing:1;         /*   128: 3  4 */
      
                /* size: 136, cachelines: 3, members: 15 */
                /* padding: 4 */
                /* bit_padding: 28 bits */
                /* forced alignments: 1 */
                /* last cacheline: 8 bytes */
        } __attribute__((__aligned__(8)));
      
      Running the following fs_mark test shows some significant improvement.
      
        $ cat test.sh
        #!/bin/bash
      
        # 15G null block device
        DEV=/dev/nullb0
        MNT=/mnt/nullb0
        FILES=100000
        THREADS=$(nproc --all)
        FILE_SIZE=0
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f $DEV
        mount -o ssd $DEV $MNT
      
        OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t $THREADS -k"
        for ((i = 1; i <= $THREADS; i++)); do
            OPTS="$OPTS -d $MNT/d$i"
        done
      
        fs_mark $OPTS
      
        umount $MNT
      
      Before this change:
      
      FSUse%        Count         Size    Files/sec     App Overhead
          10      1200000            0     112631.3         11928055
          16      2400000            0     189943.8         12140777
          23      3600000            0     150719.2         13178480
          50      4800000            0      99137.3         12504293
          53      6000000            0     111733.9         12670836
      
                          Total files/sec: 664165.5
      
      After this change:
      
      FSUse%        Count         Size    Files/sec     App Overhead
          10      1200000            0     148589.5         11565889
          16      2400000            0     227743.8         11561596
          23      3600000            0     191590.5         12550755
          30      4800000            0     179812.3         12629610
          53      6000000            0      92471.4         12352383
      
                          Total files/sec: 840207.5
      
      Measuring the execution times of htree_insert(), in nanoseconds, during
      those fs_mark runs:
      
      Before this change:
      
        Range:  0.000 - 940647.000; Mean: 619.733; Median: 548.000; Stddev: 1834.231
        Percentiles:  90th: 980.000; 95th: 1208.000; 99th: 2090.000
           0.000 -    6.384:       257 |
           6.384 -   26.259:       977 |
          26.259 -   99.635:      4963 |
          99.635 -  370.526:    136800 #############
         370.526 - 1370.603:    566110 #####################################################
        1370.603 - 5062.704:     24945 ##
        5062.704 - 18693.248:      944 |
        18693.248 - 69014.670:     211 |
        69014.670 - 254791.959:     30 |
        254791.959 - 940647.000:     4 |
      
      After this change:
      
        Range:  0.000 - 299200.000; Mean: 587.754; Median: 542.000; Stddev: 1030.422
        Percentiles:  90th: 918.000; 95th: 1113.000; 99th: 1987.000
           0.000 -    5.585:      163 |
           5.585 -   20.678:      452 |
          20.678 -   70.369:     1806 |
          70.369 -  233.965:    26268 ####
         233.965 -  772.564:   333519 #####################################################
         772.564 - 2545.771:    91820 ###############
        2545.771 - 8383.615:     2238 |
        8383.615 - 27603.280:     170 |
        27603.280 - 90879.297:     68 |
        90879.297 - 299200.000:    12 |
      
      Mean, percentiles, maximum times are all better, as well as a lower
      standard deviation.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      315dd5cc
    • Qu Wenruo's avatar
      btrfs: use the same uptodate variable for end_bio_extent_readpage() · 31dd8c81
      Qu Wenruo authored
      In function end_bio_extent_readpage() we call
      endio_readpage_release_extent() to unlock the extent io tree.
      
      However we pass PageUptodate(page) as @uptodate parameter for it, while
      for previous end_page_read() call, we use a dedicated @uptodate local
      variable.
      
      This is not a big deal, as even for subpage cases, either the bio only
      covers part of the page, then the @uptodate is always false, and the
      subpage ranges can still be merged.
      
      But for the sake of consistency, always use @uptodate variable when
      possible.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31dd8c81
    • Qu Wenruo's avatar
      btrfs: subpage: make alloc_extent_buffer() handle previously uptodate range efficiently · 5a963419
      Qu Wenruo authored
      Currently alloc_extent_buffer() would make the extent buffer uptodate if
      the corresponding pages are also uptodate.
      
      But this check is only checking PageUptodate, which is fine for regular
      cases, but not for subpage cases, as we can have multiple extent buffers
      in the same page.
      
      So here we go btrfs_page_test_uptodate() instead.
      
      The old code doesn't cause any problem, but is not efficient, as it
      would cause extra metadata read even if the range is already uptodate.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5a963419
    • David Sterba's avatar
      btrfs: print assertion failure report and stack trace from the same line · b831306b
      David Sterba authored
      Assertions reports are split into two parts, the exact file and location
      of the condition and then the stack trace printed from
      btrfs_assertfail(). This means all the stack traces report the same line
      and this is what's typically reported by various tools, making it harder
      to distinguish the reports.
      
        [403.2467] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4259
        [403.2479] ------------[ cut here ]------------
        [403.2484] kernel BUG at fs/btrfs/messages.c:259!
        [403.2488] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        [403.2493] CPU: 2 PID: 23202 Comm: umount Not tainted 6.2.0-rc4-default+ #67
        [403.2499] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
        [403.2509] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
        ...
        [403.2595] Call Trace:
        [403.2598]  <TASK>
        [403.2601]  btrfs_free_block_groups.cold+0x52/0xae [btrfs]
        [403.2608]  close_ctree+0x6c2/0x761 [btrfs]
        [403.2613]  ? __wait_for_common+0x2b8/0x360
        [403.2618]  ? btrfs_cleanup_one_transaction.cold+0x7a/0x7a [btrfs]
        [403.2626]  ? mark_held_locks+0x6b/0x90
        [403.2630]  ? lockdep_hardirqs_on_prepare+0x13d/0x200
        [403.2636]  ? __call_rcu_common.constprop.0+0x1ea/0x3d0
        [403.2642]  ? trace_hardirqs_on+0x2d/0x110
        [403.2646]  ? __call_rcu_common.constprop.0+0x1ea/0x3d0
        [403.2652]  generic_shutdown_super+0xb0/0x1c0
        [403.2657]  kill_anon_super+0x1e/0x40
        [403.2662]  btrfs_kill_super+0x25/0x30 [btrfs]
        [403.2668]  deactivate_locked_super+0x4c/0xc0
      
      By making btrfs_assertfail a macro we'll get the same line number for
      the BUG output:
      
        [63.5736] assertion failed: 0, in fs/btrfs/super.c:1572
        [63.5758] ------------[ cut here ]------------
        [63.5782] kernel BUG at fs/btrfs/super.c:1572!
        [63.5807] invalid opcode: 0000 [#2] PREEMPT SMP KASAN
        [63.5831] CPU: 0 PID: 859 Comm: mount Tainted: G      D            6.3.0-rc7-default+ #2062
        [63.5868] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a-rebuilt.opensuse.org 04/01/2014
        [63.5905] RIP: 0010:btrfs_mount+0x24/0x30 [btrfs]
        [63.5964] RSP: 0018:ffff88800e69fcd8 EFLAGS: 00010246
        [63.5982] RAX: 000000000000002d RBX: ffff888008fc1400 RCX: 0000000000000000
        [63.6004] RDX: 0000000000000000 RSI: ffffffffb90fd868 RDI: ffffffffbcc3ff20
        [63.6026] RBP: ffffffffc081b200 R08: 0000000000000001 R09: ffff88800e69fa27
        [63.6046] R10: ffffed1001cd3f44 R11: 0000000000000001 R12: ffff888005a3c370
        [63.6062] R13: ffffffffc058e830 R14: 0000000000000000 R15: 00000000ffffffff
        [63.6081] FS:  00007f7b3561f800(0000) GS:ffff88806c600000(0000) knlGS:0000000000000000
        [63.6105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [63.6120] CR2: 00007fff83726e10 CR3: 0000000002a9e000 CR4: 00000000000006b0
        [63.6137] Call Trace:
        [63.6143]  <TASK>
        [63.6148]  legacy_get_tree+0x80/0xd0
        [63.6158]  vfs_get_tree+0x43/0x120
        [63.6166]  do_new_mount+0x1f3/0x3d0
        [63.6176]  ? do_add_mount+0x140/0x140
        [63.6187]  ? cap_capable+0xa4/0xe0
        [63.6197]  path_mount+0x223/0xc10
      
      This comes at a cost of bloating the final btrfs.ko module due all the
      inlining, as long as assertions are compiled in. This is a must for
      debugging builds but this is often enabled on release builds too.
      
      Release build:
      
         text    data     bss     dec     hex filename
      1251676   20317   16088 1288081  13a791 pre/btrfs.ko
      1260612   29473   16088 1306173  13ee3d post/btrfs.ko
      
      DELTA: +8936
      
      CC: Josh Poimboeuf <jpoimboe@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b831306b
    • Qu Wenruo's avatar
      btrfs: subpage: dump extra subpage bitmaps for debug · 75258f20
      Qu Wenruo authored
      There is a bug report that assert_eb_page_uptodate() gets triggered for
      free space tree metadata.
      
      Without proper dump for the subpage bitmaps it's much harder to debug.
      
      Thus this patch would dump all the subpage bitmaps (split them into
      their own bitmaps) for a easier debugging.
      
      The output would look like this:
      (Dumped after a tree block got read from disk)
      
        page:000000006e34bf49 refcount:4 mapcount:0 mapping:0000000067661ac4 index:0x1d1 pfn:0x110e9
        memcg:ffff0000d7d62000
        aops:btree_aops [btrfs] ino:1
        flags: 0x8000000000002002(referenced|private|zone=2)
        page_type: 0xffffffff()
        raw: 8000000000002002 0000000000000000 dead000000000122 ffff00000188bed0
        raw: 00000000000001d1 ffff0000c7992700 00000004ffffffff ffff0000d7d62000
        page dumped because: btrfs subpage dump
        BTRFS warning (device dm-1): start=30490624 len=16384 page=30474240 bitmaps: uptodate=4-7 error= dirty= writeback= ordered= checked=
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      75258f20
    • Tejun Heo's avatar
      btrfs: use alloc_ordered_workqueue() to create ordered workqueues · 58e814fc
      Tejun Heo authored
      BACKGROUND
      ==========
      
      When multiple work items are queued to a workqueue, their execution order
      doesn't match the queueing order. They may get executed in any order and
      simultaneously. When fully serialized execution - one by one in the queueing
      order - is needed, an ordered workqueue should be used which can be created
      with alloc_ordered_workqueue().
      
      However, alloc_ordered_workqueue() was a later addition. Before it, an
      ordered workqueue could be obtained by creating an UNBOUND workqueue with
      @max_active==1. This originally was an implementation side-effect which was
      broken by 4c16bd32 ("workqueue: restore WQ_UNBOUND/max_active==1 to be
      ordered"). Because there were users that depended on the ordered execution,
      5c0338c6 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered")
      made workqueue allocation path to implicitly promote UNBOUND workqueues w/
      @max_active==1 to ordered workqueues.
      
      While this has worked okay, overloading the UNBOUND allocation interface
      this way creates other issues. It's difficult to tell whether a given
      workqueue actually needs to be ordered and users that legitimately want a
      min concurrency level wq unexpectedly gets an ordered one instead. With
      planned UNBOUND workqueue updates to improve execution locality and more
      prevalence of chiplet designs which can benefit from such improvements, this
      isn't a state we wanna be in forever.
      
      This patch series audits all call sites that create an UNBOUND workqueue w/
      @max_active==1 and converts them to alloc_ordered_workqueue() as necessary.
      
      BTRFS
      =====
      
      * fs_info->scrub_workers initialized in scrub_workers_get() was setting
        @max_active to 1 when @is_dev_replace is set and it seems that the
        workqueue actually needs to be ordered if @is_dev_replace. Update the code
        so that alloc_ordered_workqueue() is used if @is_dev_replace.
      
      * fs_info->discard_ctl.discard_workers initialized in
        btrfs_init_workqueues() was directly using alloc_workqueue() w/
        @max_active==1. Converted to alloc_ordered_workqueue().
      
      * fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in
        btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue,
        which are allocated with btrfs_alloc_workqueue().
      
        btrfs_workqueue implements automatic @max_active adjustment which is
        disabled when the specified max limit is below a certain threshold, so
        calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered
        workqueue whose @max_active won't be changed as the auto-tuning is
        disabled.
      
        This is rather brittle in that nothing clearly indicates that the two
        workqueues should be ordered or btrfs_alloc_workqueue() must disable
        auto-tuning when @limit_active==1.
      
        This patch factors out the common btrfs_workqueue init code into
        btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue().
        The two workqueues are converted to use the new ordered allocation
        interface.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      58e814fc