Commits · 2c14f0ffdd30bd3d321ad5fe76fcf701746e1df6 · Kirill Smelkov / linux

19 Jun, 2023 40 commits

btrfs: fix fsverify read error handling in end_page_read · 2c14f0ff

Christoph Hellwig authored May 31, 2023

Also clear the uptodate bit to make sure the page isn't seen as uptodate
in the page cache if fsverity verification fails.

Fixes: 14605409 ("btrfs: initial fsverity support")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

2c14f0ff

btrfs: factor out a btrfs_verify_page helper · ed9ee98e

Christoph Hellwig authored May 31, 2023

Split all the conditionals for the fsverity calls in end_page_read into
a btrfs_verify_page helper to keep the code readable and make additional
refactoring easier.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ed9ee98e

btrfs: fix range_end calculation in extent_write_locked_range · 36614a3b

Christoph Hellwig authored May 31, 2023

The range_end field in struct writeback_control is inclusive, just like
the end parameter passed to extent_write_locked_range.  Not doing this
could cause extra writeout, which is harmless but suboptimal.

Fixes: 771ed689 ("Btrfs: Optimize compressed writeback and reads")
CC: stable@vger.kernel.org # 5.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

36614a3b

btrfs: insert tree mod log move in push_node_left · 5cead542

Boris Burkov authored Jun 01, 2023

There is a fairly unlikely race condition in tree mod log rewind that
can result in a kernel panic which has the following trace:

  [530.569] BTRFS critical (device sda3): unable to find logical 0 length 4096
  [530.585] BTRFS critical (device sda3): unable to find logical 0 length 4096
  [530.602] BUG: kernel NULL pointer dereference, address: 0000000000000002
  [530.618] #PF: supervisor read access in kernel mode
  [530.629] #PF: error_code(0x0000) - not-present page
  [530.641] PGD 0 P4D 0
  [530.647] Oops: 0000 [#1] SMP
  [530.654] CPU: 30 PID: 398973 Comm: below Kdump: loaded Tainted: G S         O  K   5.12.0-0_fbk13_clang_7455_gb24de3bdb045 #1
  [530.680] Hardware name: Quanta Mono Lake-M.2 SATA 1HY9U9Z001G/Mono Lake-M.2 SATA, BIOS F20_3A15 08/16/2017
  [530.703] RIP: 0010:__btrfs_map_block+0xaa/0xd00
  [530.755] RSP: 0018:ffffc9002c2f7600 EFLAGS: 00010246
  [530.767] RAX: ffffffffffffffea RBX: ffff888292e41000 RCX: f2702d8b8be15100
  [530.784] RDX: ffff88885fda6fb8 RSI: ffff88885fd973c8 RDI: ffff88885fd973c8
  [530.800] RBP: ffff888292e410d0 R08: ffffffff82fd7fd0 R09: 00000000fffeffff
  [530.816] R10: ffffffff82e57fd0 R11: ffffffff82e57d70 R12: 0000000000000000
  [530.832] R13: 0000000000001000 R14: 0000000000001000 R15: ffffc9002c2f76f0
  [530.848] FS:  00007f38d64af000(0000) GS:ffff88885fd80000(0000) knlGS:0000000000000000
  [530.866] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [530.880] CR2: 0000000000000002 CR3: 00000002b6770004 CR4: 00000000003706e0
  [530.896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [530.912] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [530.928] Call Trace:
  [530.934]  ? btrfs_printk+0x13b/0x18c
  [530.943]  ? btrfs_bio_counter_inc_blocked+0x3d/0x130
  [530.955]  btrfs_map_bio+0x75/0x330
  [530.963]  ? kmem_cache_alloc+0x12a/0x2d0
  [530.973]  ? btrfs_submit_metadata_bio+0x63/0x100
  [530.984]  btrfs_submit_metadata_bio+0xa4/0x100
  [530.995]  submit_extent_page+0x30f/0x360
  [531.004]  read_extent_buffer_pages+0x49e/0x6d0
  [531.015]  ? submit_extent_page+0x360/0x360
  [531.025]  btree_read_extent_buffer_pages+0x5f/0x150
  [531.037]  read_tree_block+0x37/0x60
  [531.046]  read_block_for_search+0x18b/0x410
  [531.056]  btrfs_search_old_slot+0x198/0x2f0
  [531.066]  resolve_indirect_ref+0xfe/0x6f0
  [531.076]  ? ulist_alloc+0x31/0x60
  [531.084]  ? kmem_cache_alloc_trace+0x12e/0x2b0
  [531.095]  find_parent_nodes+0x720/0x1830
  [531.105]  ? ulist_alloc+0x10/0x60
  [531.113]  iterate_extent_inodes+0xea/0x370
  [531.123]  ? btrfs_previous_extent_item+0x8f/0x110
  [531.134]  ? btrfs_search_path_in_tree+0x240/0x240
  [531.146]  iterate_inodes_from_logical+0x98/0xd0
  [531.157]  ? btrfs_search_path_in_tree+0x240/0x240
  [531.168]  btrfs_ioctl_logical_to_ino+0xd9/0x180
  [531.179]  btrfs_ioctl+0xe2/0x2eb0

This occurs when logical inode resolution takes a tree mod log sequence
number, and then while backref walking hits a rewind on a busy node
which has the following sequence of tree mod log operations (numbers
filled in from a specific example, but they are somewhat arbitrary)

  REMOVE_WHILE_FREEING slot 532
  REMOVE_WHILE_FREEING slot 531
  REMOVE_WHILE_FREEING slot 530
  ...
  REMOVE_WHILE_FREEING slot 0
  REMOVE slot 455
  REMOVE slot 454
  REMOVE slot 453
  ...
  REMOVE slot 0
  ADD slot 455
  ADD slot 454
  ADD slot 453
  ...
  ADD slot 0
  MOVE src slot 0 -> dst slot 456 nritems 533
  REMOVE slot 455
  REMOVE slot 454
  REMOVE slot 453
  ...
  REMOVE slot 0

When this sequence gets applied via btrfs_tree_mod_log_rewind, it
allocates a fresh rewind eb, and first inserts the correct key info for
the 533 elements, then overwrites the first 456 of them, then decrements
the count by 456 via the add ops, then rewinds the move by doing a
memmove from 456:988->0:532. We have never written anything past 532, so
that memmove writes garbage into the 0:532 range. In practice, this
results in a lot of fully 0 keys. The rewind then puts valid keys into
slots 0:455 with the last removes, but 456:532 are still invalid.

When search_old_slot uses this eb, if it uses one of those invalid
slots, it can then read the extent buffer and issue a bio for offset 0
which ultimately panics looking up extent mappings.

This bad tree mod log sequence gets generated when the node balancing
code happens to do a balance_node_right followed by a push_node_left
while logging in the tree mod log. Illustrated for ebs L and R (left and
right):

	L                 R
  start:
  [XXX|YYY|...]      [ZZZ|...|...]
  balance_node_right:
  [XXX|YYY|...]      [...|ZZZ|...] move Z to make room for Y
  [XXX|...|...]      [YYY|ZZZ|...] copy Y from L to R
  push_node_left:
  [XXX|YYY|...]      [...|ZZZ|...] copy Y from R to L
  [XXX|YYY|...]      [ZZZ|...|...] move Z into emptied space (NOT LOGGED!)

This is because balance_node_right logs a move, but push_node_left
explicitly doesn't. That is because logging the move would remove the
overwritten src < dst range in the right eb, which was already logged
when we called btrfs_tree_mod_log_eb_copy. The correct sequence would
include a move from 456:988 to 0:532 after remove 0:455 and before
removing 0:532. Reversing that sequence would entail creating keys for
0:532, then moving those keys out to 456:988, then creating more keys
for 0:455.

i.e.,

  REMOVE_WHILE_FREEING slot 532
  REMOVE_WHILE_FREEING slot 531
  REMOVE_WHILE_FREEING slot 530
  ...
  REMOVE_WHILE_FREEING slot 0
  MOVE src slot 456 -> dst slot 0 nritems 533
  REMOVE slot 455
  REMOVE slot 454
  REMOVE slot 453
  ...
  REMOVE slot 0
  ADD slot 455
  ADD slot 454
  ADD slot 453
  ...
  ADD slot 0
  MOVE src slot 0 -> dst slot 456 nritems 533
  REMOVE slot 455
  REMOVE slot 454
  REMOVE slot 453
  ...
  REMOVE slot 0

Fix this to log the move but avoid the double remove by putting all the
logging logic in btrfs_tree_mod_log_eb_copy which has enough information
to detect these cases and properly log moves, removes, and adds. Leave
btrfs_tree_mod_log_insert_move to handle insert_ptr and delete_ptr's
tree mod logging.

(Un)fortunately, this is quite difficult to reproduce, and I was only
able to reproduce it by adding sleeps in btrfs_search_old_slot that
would encourage more log rewinding during ino_to_logical ioctls. I was
able to hit the warning in the previous patch in the series without the
fix quite quickly, but not after this patch.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

5cead542

btrfs: warn on invalid slot in tree mod log rewind · 95c8e349

Boris Burkov authored Jun 01, 2023

The way that tree mod log tracks the ultimate length of the eb, the
variable 'n', eventually turns up the correct value, but at intermediate
steps during the rewind, n can be inaccurate as a representation of the
end of the eb. For example, it doesn't get updated on move rewinds, and
it does get updated for add/remove in the middle of the eb.

To detect cases with invalid moves, introduce a separate variable called
max_slot which tries to track the maximum valid slot in the rewind eb.
We can then warn if we do a move whose src range goes beyond the max
valid slot.

There is a commented caveat that it is possible to have this value be an
overestimate due to the challenge of properly handling 'add' operations
in the middle of the eb, but in practice it doesn't cause enough of a
problem to throw out the max idea in favor of tracking every valid slot.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

95c8e349

btrfs: disable allocation warnings for compression workspaces · 8ab546bb

David Sterba authored May 22, 2023

The workspaces for compression are typically much larger than a page and
for high zstd levels in the range of megabytes. There's a fallback to
vmalloc but this can still fail (see the report).

Some of the workspaces are preallocated at module load time so we have a
safe fallback, otherwise when a new workspace is needed it's allocated
but if this fails then the process waits. Which means the warning is
only causing noise and we can use the GFP flag to disable it.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=217466Signed-off-by: David Sterba <dsterba@suse.com>

8ab546bb

btrfs: open code need_full_stripe conditions · 8680e587

Christoph Hellwig authored May 31, 2023

need_full_stripe is just a somewhat complicated way to say
"op != BTRFS_MAP_READ".  Just spell that explicit check out, which makes
a lot of the code currently using the helper easier to understand.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8680e587

btrfs: open code btrfs_map_sblock · 723b8bb1

Christoph Hellwig authored May 31, 2023

btrfs_map_sblock just hard codes three arguments and calls
btrfs_map_sblock.  Remove it as it doesn't provide any real value, but
makes following the btrfs_map_block call chains harder.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

723b8bb1

btrfs: rename __btrfs_map_block to btrfs_map_block · cd4efd21

Christoph Hellwig authored May 31, 2023

Now that the old btrfs_map_block is gone, drop the leading underscores
from __btrfs_map_block.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

cd4efd21

btrfs: remove unused btrfs_map_block · d69d7ffc

Christoph Hellwig authored May 31, 2023

There are no users of btrfs_map_block left, so remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d69d7ffc

btrfs: optimize simple reads in btrfsic_map_block · 78a213a0

Christoph Hellwig authored May 31, 2023

Pass a smap into __btrfs_map_block so that the usual case of a read that
doesn't require parity raid recovery doesn't need an extra memory
allocation for the btrfs_io_context.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

78a213a0

btrfs: remove unused BTRFS_MAP_DISCARD · 3965a4c7

Christoph Hellwig authored May 31, 2023

BTRFS_MAP_DISCARD is never set, as REQ_OP_DISCARD is never passed to
btrfs_op() only only checked in two ASSERTS.

Remove it and let the catchall WARN_ON in btrfs_op() deal with accidental
REQ_OP_DISCARDs leaked into btrfs_op(). Last use was in a4012f06
("btrfs: split discard handling out of btrfs_map_block").
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3965a4c7

btrfs: add xxhash to fast checksum implementations · efcfcbc6

David Sterba authored Apr 04, 2023

The implementation of XXHASH is now CPU only but still fast enough to be
considered for the synchronous checksumming, like non-generic crc32c.

A userspace benchmark comparing it to various implementations (patched
hash-speedtest from btrfs-progs):

  Block size:     4096
  Iterations:     1000000
  Implementation: builtin
  Units:          CPU cycles

	NULL-NOP: cycles:     73384294, cycles/i       73
     NULL-MEMCPY: cycles:    228033868, cycles/i      228,    61664.320 MiB/s
      CRC32C-ref: cycles:  24758559416, cycles/i    24758,      567.950 MiB/s
       CRC32C-NI: cycles:   1194350470, cycles/i     1194,    11773.433 MiB/s
  CRC32C-ADLERSW: cycles:   6150186216, cycles/i     6150,     2286.372 MiB/s
  CRC32C-ADLERHW: cycles:    626979180, cycles/i      626,    22427.453 MiB/s
      CRC32C-PCL: cycles:    466746732, cycles/i      466,    30126.699 MiB/s
	  XXHASH: cycles:    860656400, cycles/i      860,    16338.188 MiB/s

Comparing purely software implementation (ref), current outdated
accelerated using crc32q instruction (NI), optimized implementations by
M. Adler (https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software/17646775#17646775)
and the best one that was taken from kernel using the PCLMULQDQ
instruction (PCL).
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

efcfcbc6

btrfs: pass the new logical address to split_extent_map · f000bc6f

Christoph Hellwig authored May 24, 2023

split_extent_map splits off the first chunk of an extent map into a new
one. One of the two users is the zoned I/O completion code that wants to
rewrite the logical block start address right after this split. Pass in
the logical address to be set in the split off first extent_map as an
argument to avoid an extra extent tree lookup for this case.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

f000bc6f

btrfs: defer splitting of ordered extents until I/O completion · 71df088c

Christoph Hellwig authored May 24, 2023

The btrfs zoned completion code currently needs an ordered_extent and
extent_map per bio so that it can account for the non-predictable
write location from Zone Append.  To archive that it currently splits
the ordered_extent and extent_map at I/O submission time, and then
records the actual physical address in the ->physical field of the
ordered_extent.

This patch instead switches to record the "original" physical address
that the btrfs allocator assigned in spare space in the btrfs_bio,
and then rewrites the logical address in the btrfs_ordered_sum
structure at I/O completion time.  This allows the ordered extent
completion handler to simply walk the list of ordered csums and
split the ordered extent as needed.  This removes an extra ordered
extent and extent_map lookup and manipulation during the I/O
submission path, and instead batches it in the I/O completion path
where we need to touch these anyway.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

71df088c

btrfs: handle completed ordered extents in btrfs_split_ordered_extent · 52b1fdca

Christoph Hellwig authored May 24, 2023

To delay splitting ordered_extents to I/O completion time we need to be
able to handle fully completed ordered extents in
btrfs_split_ordered_extent.  Besides a bit of accounting this primarily
involved moving over the csums to the split bio for the range that it
covers, which is simple enough because we always have one
btrfs_ordered_sum per bio.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

52b1fdca

btrfs: atomically insert the new extent in btrfs_split_ordered_extent · 816f589b

Christoph Hellwig authored May 24, 2023

Currently there is a small race window in btrfs_split_ordered_extent,
where the reduced old extent can be looked up on the per-inode rbtree
or the per-root list while the newly split out one isn't visible yet.

Fix this by open coding btrfs_alloc_ordered_extent in
btrfs_split_ordered_extent, and holding the tree lock and
root->ordered_extent_lock over the entire tree and extent manipulation.

Note that this introduces new lock ordering because previously
ordered_extent_lock was never held over the tree lock.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

816f589b

btrfs: split btrfs_alloc_ordered_extent to allocation and insertion helpers · 53d9981c

Christoph Hellwig authored May 24, 2023

Split two low-level helpers out of btrfs_alloc_ordered_extent to allocate
and insert the logic extent.  The pure alloc helper will be used to
improve btrfs_split_ordered_extent.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

53d9981c

btrfs: return the new ordered_extent from btrfs_split_ordered_extent · b0307e28

Christoph Hellwig authored May 24, 2023

Return the ordered_extent split from the passed in one.  This will be
needed to be able to store an ordered_extent in the btrfs_bio.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b0307e28

btrfs: reorder conditions in btrfs_extract_ordered_extent · ebdb44a0

Christoph Hellwig authored May 24, 2023

There is no good reason for doing one before the other in terms of
failure implications, but doing the extent_map split first will
simplify some upcoming refactoring.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ebdb44a0

btrfs: move split_extent_map to extent_map.c · a6f3e205

Christoph Hellwig authored May 24, 2023

split_extent_map doesn't have anything to do with the other code in
inode.c, so move it to extent_map.c.

This also allows marking replace_extent_mapping static.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

a6f3e205

btrfs: record orig_physical only for the original bio · 3887653c

Christoph Hellwig authored Jun 09, 2023

btrfs_submit_dev_bio is also called for clone bios that aren't embedded
into a btrfs_bio structure, but previous commit "btrfs: optimize the
logical to physical mapping for zoned writes" added code to assign
btrfs_bio.orig_physical in it.

This is harmless right now as only the single data profile can be used
on zoned devices, but will blow up when the RAID stripe tree is added.
Move it out into the single I/O specific branch in the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

3887653c

btrfs: optimize the logical to physical mapping for zoned writes · cbfce4c7

Christoph Hellwig authored May 24, 2023

The current code to store the final logical to physical mapping for a
zone append write in the extent tree is rather inefficient. It first has
to split the ordered extent so that there is one ordered extent per bio,
so that it can look up the ordered extent on I/O completion in
btrfs_record_physical_zoned and store the physical LBA returned by the
block driver in the ordered extent.

btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to
see what physical address the logical address for this bio / ordered
extent is mapped to, and then rewrite it in the extent tree.

To optimize this process, we can store the physical address assigned in
the chunk tree to the original logical address and a pointer to
btrfs_ordered_sum structure the in the btrfs_bio structure, and then use
this information to rewrite the logical address in the btrfs_ordered_sum
structure directly at I/O completion time in btrfs_record_physical_zoned.
btrfs_rewrite_logical_zoned then simply updates the logical address in
the extent tree and the ordered_extent itself.

The code in btrfs_rewrite_logical_zoned now runs for all data I/O
completions in zoned file systems, which is fine as there is no remapping
to do for non-append writes to conventional zones or for relocation, and
the overhead for quickly breaking out of the loop is very low.

Because zoned file systems now need the ordered_sums structure to
record the actual write location returned by zone append, allocate dummy
structures without the csum array for them when the I/O doesn't use
checksums, and free them when completing the ordered_extent.

Note that the btrfs_bio doesn't grow as the new field are places into
a union that is so far not used for data writes and has plenty of space
left in it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

cbfce4c7

btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical · 5cfe76f8

Christoph Hellwig authored May 24, 2023

btrfs_ordered_sum::bytendr stores a logical address.  Make that clear by
renaming it to ->logical.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5cfe76f8

btrfs: mark the len field in struct btrfs_ordered_sum as unsigned · 6e4b2479

Christoph Hellwig authored May 24, 2023

len can't ever be negative, so mark it as an u32 instead of int.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

6e4b2479

btrfs: don't call btrfs_record_physical_zoned for failed append · e9cb93b9

Christoph Hellwig authored May 24, 2023

When a zoned append command fails there is no written address reported,
so don't try to record it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e9cb93b9

btrfs: optimize out btrfs_is_zoned for !CONFIG_BLK_DEV_ZONED · dd8b7b04

Christoph Hellwig authored May 24, 2023

Add an IS_ENABLED check for CONFIG_BLK_DEV_ZONED in addition to the
run-time check for the zone size.  This will allow to make use of
compiler dead code elimination for code guarded by btrfs_is_zoned, and
for example provide just a dangling prototype for a function instead of
adding a stub.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

dd8b7b04

btrfs: make btrfs_destroy_delayed_refs() return void · 99f09ce3

Filipe Manana authored Jun 02, 2023

btrfs_destroy_delayed_refs() always returns 0 and its single caller does
not check its return value, as it also returns void, and so does the
callers' caller and so on. This is because we are in the transaction abort
path, where we have no way to deal with errors (we are in a critical
situation) and all cleanup of resources works in a best effort fashion.
So make btrfs_destroy_delayed_refs() return void.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

99f09ce3

btrfs: remove unnecessary prototype declarations at disk-io.c · 184533e3

Filipe Manana authored May 29, 2023

We have a few static functions at disk-io.c for which we have a forward
declaration of their prototype, but it's not needed because all those
functions are defined before they are called, so remove them.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

184533e3

btrfs: use a single switch statement when initializing delayed ref head · f1ed785a

Filipe Manana authored May 29, 2023

At init_delayed_ref_head(), we are using two separate if statements to
check the delayed ref head action, and initializing 'must_insert_reserved'
to false twice, once when the variable is declared and once again in an
else branch.

Make this simpler and more straightforward by having a single switch
statement, also moving the comment about a drop action to the
corresponding switch case to make it more clear and eliminating the
duplicated initialization of 'must_insert_reserved' to false.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f1ed785a

btrfs: use bool type for delayed ref head fields that are used as booleans · 61c681fe

Filipe Manana authored May 29, 2023

There's no point in have several fields defined as 1 bit unsigned int in
struct btrfs_delayed_ref_head, we can instead use a bool type, it makes
the code a bit more readable and it doesn't change the structure size.
So switch them to proper booleans.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

61c681fe

btrfs: assert correct lock is held at btrfs_select_ref_head() · 1e6b71c3

Filipe Manana authored May 29, 2023

The function btrfs_select_ref_head() iterates over the red black tree of
delayed reference heads, which is protected by the spinlock in the delayed
refs root. The function doesn't take the lock, it's taken by its single
caller, btrfs_obtain_ref_head(), because it needs to call that function
and btrfs_delayed_ref_lock() in the same critical section (delimited by
that spinlock). So assert at btrfs_select_ref_head() that we are holding
the expected lock.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1e6b71c3

btrfs: get rid of label and goto at insert_delayed_ref() · 798f4d95

Filipe Manana authored May 29, 2023

At insert_delayed_ref() there's no point of having a label and goto in the
case we were able to insert the delayed ref head. We can just add the code
under label to the if statement's body and return immediately, and also
there is no need to track the return value in a variable, we can just
return a literal true or false value directly. So do those changes.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

798f4d95

btrfs: make insert_delayed_ref() return a bool instead of an int · f38462c4

Filipe Manana authored May 29, 2023

insert_delayed_ref() can only return 0 or 1, to indicate if the given
delayed reference was added to the head reference or if it was merged
into an existing delayed ref, respectively. So just make it return a
boolean instead.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

f38462c4

btrfs: use a bool to track qgroup record insertion when adding ref head · 293f8197

Filipe Manana authored May 29, 2023

We are using an integer as a boolean to track the qgroup record insertion
status when adding a delayed reference head. Since all we need is a
boolean, switch the type from int to bool to make it more obvious.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

293f8197

btrfs: remove pointless in_tree field from struct btrfs_delayed_ref_node · 4d34ad34

Filipe Manana authored May 29, 2023

The 'in_tree' field is really not needed in struct btrfs_delayed_ref_node,
as we can check whether a reference is in the tree or not simply by
checking its red black tree node member with RB_EMPTY_NODE(), as when we
remove it from the tree we always call RB_CLEAR_NODE(). So remove that
field and use RB_EMPTY_NODE().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4d34ad34

btrfs: remove unused is_head field from struct btrfs_delayed_ref_node · 53499d5f

Filipe Manana authored May 29, 2023

The 'is_head' field of struct btrfs_delayed_ref_node is no longer after
commit d278850e ("btrfs: remove delayed_ref_node from ref_head"),
so remove it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

53499d5f

btrfs: reorder some members of struct btrfs_delayed_ref_head · 315dd5cc

Filipe Manana authored May 29, 2023

Currently struct delayed_ref_head has its 'bytenr' and 'href_node' members
in different cache lines (even on a release, non-debug, kernel). This is
not optimal because when iterating the red black tree of delayed ref heads
for inserting a new delayed ref head (htree_insert()) we have to pull in 2
cache lines of delayed ref heads we find in a patch, one for the tree node
(struct rb_node) and another one for the 'bytenr' field. The same applies
when searching for an existing delayed ref head (find_ref_head()).
On a release (non-debug) kernel, the structure also has two 4 bytes holes,
which makes it 8 bytes longer than necessary. Its current layout is the
following:

  struct btrfs_delayed_ref_head {
          u64                        bytenr;               /*     0     8 */
          u64                        num_bytes;            /*     8     8 */
          refcount_t                 refs;                 /*    16     4 */

          /* XXX 4 bytes hole, try to pack */

          struct mutex               mutex;                /*    24    32 */
          spinlock_t                 lock;                 /*    56     4 */

          /* XXX 4 bytes hole, try to pack */

          /* --- cacheline 1 boundary (64 bytes) --- */
          struct rb_root_cached      ref_tree;             /*    64    16 */
          struct list_head           ref_add_list;         /*    80    16 */
          struct rb_node             href_node __attribute__((__aligned__(8))); /*    96    24 */
          struct btrfs_delayed_extent_op * extent_op;      /*   120     8 */
          /* --- cacheline 2 boundary (128 bytes) --- */
          int                        total_ref_mod;        /*   128     4 */
          int                        ref_mod;              /*   132     4 */
          unsigned int               must_insert_reserved:1; /*   136: 0  4 */
          unsigned int               is_data:1;            /*   136: 1  4 */
          unsigned int               is_system:1;          /*   136: 2  4 */
          unsigned int               processing:1;         /*   136: 3  4 */

          /* size: 144, cachelines: 3, members: 15 */
          /* sum members: 128, holes: 2, sum holes: 8 */
          /* sum bitfield members: 4 bits (0 bytes) */
          /* padding: 4 */
          /* bit_padding: 28 bits */
          /* forced alignments: 1 */
          /* last cacheline: 16 bytes */
  } __attribute__((__aligned__(8)));

This change reorders the 'href_node' and 'refs' members so that we have
the 'href_node' in the same cache line as the 'bytenr' field, while also
eliminating the two holes and reducing the structure size from 144 bytes
down to 136 bytes, so we can now have 30 ref heads per 4K page (on x86_64)
instead of 28. The new structure layout after this change is now:

  struct btrfs_delayed_ref_head {
          u64                        bytenr;               /*     0     8 */
          u64                        num_bytes;            /*     8     8 */
          struct rb_node             href_node __attribute__((__aligned__(8))); /*    16    24 */
          struct mutex               mutex;                /*    40    32 */
          /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
          refcount_t                 refs;                 /*    72     4 */
          spinlock_t                 lock;                 /*    76     4 */
          struct rb_root_cached      ref_tree;             /*    80    16 */
          struct list_head           ref_add_list;         /*    96    16 */
          struct btrfs_delayed_extent_op * extent_op;      /*   112     8 */
          int                        total_ref_mod;        /*   120     4 */
          int                        ref_mod;              /*   124     4 */
          /* --- cacheline 2 boundary (128 bytes) --- */
          unsigned int               must_insert_reserved:1; /*   128: 0  4 */
          unsigned int               is_data:1;            /*   128: 1  4 */
          unsigned int               is_system:1;          /*   128: 2  4 */
          unsigned int               processing:1;         /*   128: 3  4 */

          /* size: 136, cachelines: 3, members: 15 */
          /* padding: 4 */
          /* bit_padding: 28 bits */
          /* forced alignments: 1 */
          /* last cacheline: 8 bytes */
  } __attribute__((__aligned__(8)));

Running the following fs_mark test shows some significant improvement.

  $ cat test.sh
  #!/bin/bash

  # 15G null block device
  DEV=/dev/nullb0
  MNT=/mnt/nullb0
  FILES=100000
  THREADS=$(nproc --all)
  FILE_SIZE=0

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  mkfs.btrfs -f $DEV
  mount -o ssd $DEV $MNT

  OPTS="-S 0 -L 5 -n $FILES -s $FILE_SIZE -t $THREADS -k"
  for ((i = 1; i <= $THREADS; i++)); do
      OPTS="$OPTS -d $MNT/d$i"
  done

  fs_mark $OPTS

  umount $MNT

Before this change:

FSUse%        Count         Size    Files/sec     App Overhead
    10      1200000            0     112631.3         11928055
    16      2400000            0     189943.8         12140777
    23      3600000            0     150719.2         13178480
    50      4800000            0      99137.3         12504293
    53      6000000            0     111733.9         12670836

                    Total files/sec: 664165.5

After this change:

FSUse%        Count         Size    Files/sec     App Overhead
    10      1200000            0     148589.5         11565889
    16      2400000            0     227743.8         11561596
    23      3600000            0     191590.5         12550755
    30      4800000            0     179812.3         12629610
    53      6000000            0      92471.4         12352383

                    Total files/sec: 840207.5

Measuring the execution times of htree_insert(), in nanoseconds, during
those fs_mark runs:

Before this change:

  Range:  0.000 - 940647.000; Mean: 619.733; Median: 548.000; Stddev: 1834.231
  Percentiles:  90th: 980.000; 95th: 1208.000; 99th: 2090.000
     0.000 -    6.384:       257 |
     6.384 -   26.259:       977 |
    26.259 -   99.635:      4963 |
    99.635 -  370.526:    136800 #############
   370.526 - 1370.603:    566110 #####################################################
  1370.603 - 5062.704:     24945 ##
  5062.704 - 18693.248:      944 |
  18693.248 - 69014.670:     211 |
  69014.670 - 254791.959:     30 |
  254791.959 - 940647.000:     4 |

After this change:

  Range:  0.000 - 299200.000; Mean: 587.754; Median: 542.000; Stddev: 1030.422
  Percentiles:  90th: 918.000; 95th: 1113.000; 99th: 1987.000
     0.000 -    5.585:      163 |
     5.585 -   20.678:      452 |
    20.678 -   70.369:     1806 |
    70.369 -  233.965:    26268 ####
   233.965 -  772.564:   333519 #####################################################
   772.564 - 2545.771:    91820 ###############
  2545.771 - 8383.615:     2238 |
  8383.615 - 27603.280:     170 |
  27603.280 - 90879.297:     68 |
  90879.297 - 299200.000:    12 |

Mean, percentiles, maximum times are all better, as well as a lower
standard deviation.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

315dd5cc

btrfs: use the same uptodate variable for end_bio_extent_readpage() · 31dd8c81

Qu Wenruo authored May 30, 2023

In function end_bio_extent_readpage() we call
endio_readpage_release_extent() to unlock the extent io tree.

However we pass PageUptodate(page) as @uptodate parameter for it, while
for previous end_page_read() call, we use a dedicated @uptodate local
variable.

This is not a big deal, as even for subpage cases, either the bio only
covers part of the page, then the @uptodate is always false, and the
subpage ranges can still be merged.

But for the sake of consistency, always use @uptodate variable when
possible.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

31dd8c81

btrfs: subpage: make alloc_extent_buffer() handle previously uptodate range efficiently · 5a963419

Qu Wenruo authored May 30, 2023

Currently alloc_extent_buffer() would make the extent buffer uptodate if
the corresponding pages are also uptodate.

But this check is only checking PageUptodate, which is fine for regular
cases, but not for subpage cases, as we can have multiple extent buffers
in the same page.

So here we go btrfs_page_test_uptodate() instead.

The old code doesn't cause any problem, but is not efficient, as it
would cause extra metadata read even if the range is already uptodate.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5a963419