Commits · 221581e485751811db3b43148177be2a0bdb2f3c · Kirill Smelkov / linux

19 Apr, 2021 40 commits

btrfs: handle btrfs_record_root_in_trans failure in create_subvol · 221581e4

Josef Bacik authored Mar 12, 2021

btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in create_subvol.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

221581e4

btrfs: handle btrfs_record_root_in_trans failure in btrfs_recover_log_trees · 2002ae11

Josef Bacik authored Mar 12, 2021

btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in btrfs_recover_log_trees.

This appears tricky, however we have a reference count on the
destination root, so if this fails we need to continue on in the loop to
make sure the proper cleanup is done.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>

2002ae11

btrfs: handle btrfs_record_root_in_trans failure in btrfs_delete_subvolume · 2731f518

Josef Bacik authored Mar 12, 2021

btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in btrfs_delete_subvolume.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

2731f518

btrfs: handle btrfs_record_root_in_trans failure in btrfs_rename · b0fec6fd

Josef Bacik authored Mar 12, 2021

btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in btrfs_rename.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b0fec6fd

btrfs: handle btrfs_record_root_in_trans failure in btrfs_rename_exchange · 00aa8e87

Josef Bacik authored Mar 12, 2021

btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in btrfs_rename_exchange.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

00aa8e87

btrfs: do proper error handling in record_reloc_root_in_trans · 404bccbc

Josef Bacik authored Mar 12, 2021

Generally speaking this shouldn't ever fail, the corresponding fs root
for the reloc root will already be in memory, so we won't get ENOMEM
here.

However if there is no corresponding root for the reloc root then we
could get ENOMEM when we try to allocate it or we could get ENOENT
when we look it up and see that it doesn't exist.

Convert these BUG_ON()'s into ASSERT()'s and add proper error handling
for the case of corruption.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

404bccbc

btrfs: check record_root_in_trans related failures in select_reloc_root · 92de551b

Josef Bacik authored Mar 12, 2021

We will record the fs root or the reloc root in the trans in
select_reloc_root.  These will actually return errors in the following
patches, so check their return value here and return it up the stack.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

92de551b

btrfs: convert BUG_ON()'s in select_reloc_root() to proper errors · 8ee66afe

Josef Bacik authored Mar 12, 2021

We have several BUG_ON()'s in select_reloc_root() that can be tripped if
there is an extent tree corruption.  Convert these to ASSERT()'s, because
if we hit it during testing it really is bad, or could indicate a
problem with the backref walking code.

However if users hit these problems it generally indicates corruption,
I've hit a few machines in the fleet that trip over these with clearly
corrupted extent trees, so be nice and print out an error message and
return an error instead of bringing the whole box down.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

8ee66afe

btrfs: handle errors from select_reloc_root() · cbdc2ebc

Josef Bacik authored Mar 12, 2021

Currently select_reloc_root() doesn't return an error, but followup
patches will make it possible for it to return an error.  We do have
proper error recovery in do_relocation however, so handle the
possibility of select_reloc_root() having an error properly instead of
BUG_ON(!root).

I've also adjusted select_reloc_root() to return ERR_PTR(-ENOENT) if we
don't find a root, instead of NULL, to make the error case easier to
deal with.  I've replaced the BUG_ON(!root) with an ASSERT(0) for this
case as it indicates we messed up the backref walking code, but it could
also indicate corruption.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

cbdc2ebc

btrfs: convert BUG_ON()'s in relocate_tree_block · 1c7bfa15

Josef Bacik authored Mar 12, 2021

We have a couple of BUG_ON()'s in relocate_tree_block() that can be
tripped if we have file system corruption.  Convert these to ASSERT()'s
so developers still get yelled at when they break the backref code, but
error out nicely for users so the whole box doesn't go down.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1c7bfa15

btrfs: convert some BUG_ON()'s to ASSERT()'s in do_relocation · ffe30dd8

Josef Bacik authored Mar 12, 2021

A few of these are checking for correctness, and won't be triggered by
corrupted file systems, so convert them to ASSERT() instead of BUG_ON()
and add a comment explaining their existence.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ffe30dd8

btrfs: add and use readahead_batch_length · 32c0a6bc

Matthew Wilcox (Oracle) authored Mar 21, 2021

Implement readahead_batch_length() to determine the number of bytes in
the current batch of readahead pages and use it in btrfs. Also use the
readahead_pos to get the offset.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

32c0a6bc

btrfs: move forward declarations to the beginning of extent_io.h · 183ebab7

Wan Jiabing authored Apr 01, 2021

There are two forward declarations deep in extent_io.h, move them
to the beginning and remove the duplicate one.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

183ebab7

btrfs: subpage: add overview comments · 894d1378

Qu Wenruo authored Mar 25, 2021

This patch adds an overview how btrfs subpage support works:

- limitations
- behavior
- basic implementation points
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

894d1378

btrfs: make set_btree_ioerr accept extent buffer and be subpage compatible · 5a2c6075

Qu Wenruo authored Mar 25, 2021

Current set_btree_ioerr() only accepts @page parameter and grabs extent
buffer from page::private.  This works fine for sector size == PAGE_SIZE
case, but not for subpage case.

Add an extra parameter, @eb, for callers to pass extent buffer to this
function, so that subpage code can reuse this function.

And also add subpage special handling to update
btrfs_subpage::error_bitmap.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

5a2c6075

btrfs: make set/clear_extent_buffer_dirty() subpage compatible · 0d27797e

Qu Wenruo authored Mar 25, 2021

For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.

For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.

This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.

Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.

But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.

[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:

   T1 (eb1 in the same page)	|  T2 (eb2 in the same page)
 -------------------------------+------------------------------
 set_extent_buffer_dirty()	| clear_extent_buffer_dirty()
 |- was_dirty = false;		| |- clear_subpagE_extent_buffer_dirty()
 |				|    |- btrfs_clear_and_test_dirty()
 |				|    |  Since eb2 is the last dirty page
 |				|    |  we got:
 |				|    |  last == true;
 |				|    |
 |- btrfs_page_set_dirty()	|    |
 |  We set the page dirty and   |    |
 |  subpage dirty bitmap	|    |
 |				|    |- if (last)
 |				|    |  Since we don't have subpage lock
 |				|    |  held, now @last is no longer
 |				|    |  correct
 |				|    |- btree_clear_page_dirty()
 |				|	Now PageDirty == false, even if
 |				|       we have dirty_bitmap not zero.
 |- ASSERT(PageDirty());	|
    ^^^^ CRASH

The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0d27797e

btrfs: support page uptodate assertions in subpage mode · b8f95771

Qu Wenruo authored Mar 25, 2021

There are quite some assert checks on page uptodate in extent buffer
write accessors.  They ensure the destination page is already uptodate.

This is fine for regular sector size case, but not for subpage case, as
for subpage we only mark the page uptodate if the page contains no hole
and all its extent buffers are uptodate.

So instead of checking PageUptodate(), for subpage case we check the
uptodate bitmap of btrfs_subpage structure.

To make the check more elegant, introduce a helper,
assert_eb_page_uptodate() to do the check for both subpage and regular
sector size cases.

The following functions are involved:

- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer()
- extent_buffer_test_bit()
- extent_buffer_bitmap_set()
- extent_buffer_bitmap_clear()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b8f95771

btrfs: make alloc_extent_buffer() check subpage dirty bitmap · 1e5eb3d6

Qu Wenruo authored Mar 25, 2021

In alloc_extent_buffer(), we make sure that the newly allocated page is
never dirty.

This is fine for sector size == PAGE_SIZE case, but for subpage it's
possible that one extent buffer in the page is dirty, thus the whole
page is marked dirty, and could cause false alert.

To support subpage, call btrfs_page_test_dirty() to handle both cases.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

1e5eb3d6

btrfs: subpage: support metadata checksum calculation at write time · eca0f6f6

Qu Wenruo authored Mar 25, 2021

Add a new helper, csum_dirty_subpage_buffers(), to iterate through all
dirty extent buffers in one bvec.

Also extract the code of calculating csum for one extent buffer into
csum_one_extent_buffer(), so that both the existing csum_dirty_buffer()
and the new csum_dirty_subpage_buffers() can reuse the same routine.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

eca0f6f6

btrfs: subpage: do more sanity checks on metadata page dirtying · 139e8cd3

Qu Wenruo authored Mar 25, 2021

For btree_set_page_dirty(), we should also check the extent buffer
sanity for subpage support.

Unlike the regular sector size case, since one page can contain multiple
extent buffers, we need to make sure there is at least one dirty extent
buffer in the page.

So this patch will iterate through the btrfs_subpage::dirty_bitmap
to get the extent buffers, and check if any dirty extent buffer in the page
range has EXTENT_BUFFER_DIRTY and proper refs.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

139e8cd3

btrfs: subpage: introduce helpers for writeback status · 3470da3b

Qu Wenruo authored Mar 25, 2021

Introduces the following functions to handle subpage writeback status:

- btrfs_subpage_set_writeback()
- btrfs_subpage_clear_writeback()
- btrfs_subpage_test_writeback()
  These helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_writeback()
- btrfs_page_clear_writeback()
- btrfs_page_test_writeback()
  These helpers can handle both regular sector size and subpage without
  problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3470da3b

btrfs: subpage: introduce helpers for dirty status · d8a5713e

Qu Wenruo authored Mar 25, 2021

Introduce the following functions to handle subpage dirty status:

- btrfs_subpage_set_dirty()
- btrfs_subpage_clear_dirty()
- btrfs_subpage_test_dirty()
  These helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_dirty()
- btrfs_page_clear_dirty()
- btrfs_page_test_dirty()
  These helpers can handle both regular sector size and subpage without
  problem.
  Thus they would be used to replace PageDirty() related calls in
  later patches.

There is one special point to note here, just like set_page_dirty() and
clear_page_dirty_for_io(), btrfs_*page_set_dirty() and
btrfs_*page_clear_dirty() must be called with page locked.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d8a5713e

btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage() · d239bcb8

Qu Wenruo authored Mar 25, 2021

In btrfs_invalidatepage() we re-declare @tree variable as
btrfs_ordered_inode_tree.

Since it's only used to do the spinlock, we can grab it from inode
directly, and remove the unnecessary declaration completely.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

d239bcb8

btrfs: use min() to replace open-code in btrfs_invalidatepage() · ac5804eb

Qu Wenruo authored Mar 25, 2021

In btrfs_invalidatepage() we introduce a temporary variable, new_len, to
update ordered->truncated_len.  But we can use min() to replace it
completely and no need for the variable.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ac5804eb

btrfs: add sysfs interface for supported sectorsize · fc57ad8d

Qu Wenruo authored Mar 25, 2021

Export supported sector sizes in /sys/fs/btrfs/features/supported_sectorsizes.

Currently all architectures have PAGE_SIZE, There's some disparity
between read-only and read-write support but that will be unified in the
future so there's only one file exporting the size.

The read-only support for systems with 64K pages also works for 4K
sector size.

This new sysfs interface would help eg. mkfs.btrfs to print more
accurate warnings about potentially incompatible option combinations.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

fc57ad8d

btrfs: improve btree readahead for full send operations · ace75066

Filipe Manana authored Mar 31, 2021

Currently a full send operation uses the standard btree readahead when
iterating over the subvolume/snapshot btree, which despite bringing good
performance benefits, it could be improved in a few aspects for use cases
such as full send operations, which are guaranteed to visit every node
and leaf of a btree, in ascending and sequential order. The limitations
of that standard btree readahead implementation are the following:

1) It only triggers readahead for leaves that are physically close
   to the leaf being read, within a 64K range;

2) It only triggers readahead for the next or previous leaves if the
   leaf being read is not currently in memory;

3) It never triggers readahead for nodes.

So add a new readahead mode that addresses all these points and use it
for full send operations.

The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdj
  MNT=/mnt/sdj
  MKFS_OPTIONS="--nodesize 16384"     # default, just to be explicit
  MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit

  mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT

  # Create files with inline data to make it easier and faster to create
  # large btrees.
  add_files()
  {
      local total=$1
      local start_offset=$2
      local number_jobs=$3
      local total_per_job=$(($total / $number_jobs))

      echo "Creating $total new files using $number_jobs jobs"
      for ((n = 0; n < $number_jobs; n++)); do
          (
              local start_num=$(($start_offset + $n * $total_per_job))
              for ((i = 1; i <= $total_per_job; i++)); do
                  local file_num=$((start_num + $i))
                  local file_path="$MNT/file_${file_num}"
                  xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
                  if [ $? -ne 0 ]; then
                      echo "Failed creating file $file_path"
                      break
                  fi
              done
          ) &
          worker_pids[$n]=$!
      done

      wait ${worker_pids[@]}

      sync
      echo
      echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
  }

  initial_file_count=500000
  add_files $initial_file_count 0 4

  echo
  echo "Creating first snapshot..."
  btrfs subvolume snapshot -r $MNT $MNT/snap1

  echo
  echo "Adding more files..."
  add_files $((initial_file_count / 4)) $initial_file_count 4

  echo
  echo "Updating 1/50th of the initial files..."
  for ((i = 1; i < $initial_file_count; i += 50)); do
      xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
  done

  echo
  echo "Creating second snapshot..."
  btrfs subvolume snapshot -r $MNT $MNT/snap2

  umount $MNT

  echo 3 > /proc/sys/vm/drop_caches
  blockdev --flushbufs $DEV &> /dev/null
  hdparm -F $DEV &> /dev/null

  mount $MOUNT_OPTIONS $DEV $MNT

  echo
  echo "Testing full send..."
  start=$(date +%s)
  btrfs send $MNT/snap1 > /dev/null
  end=$(date +%s)
  echo
  echo "Full send took $((end - start)) seconds"

  umount $MNT

The durations of the full send operation in seconds were the following:

Before this change:  217 seconds
After this change:   205 seconds (-5.7%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ace75066

btrfs: fix exhaustion of the system chunk array due to concurrent allocations · eafa4fd0

Filipe Manana authored Mar 31, 2021

When we are running out of space for updating the chunk tree, that is,
when we are low on available space in the system space info, if we have
many task concurrently allocating block groups, via fallocate for example,
many of them can end up all allocating new system chunks when only one is
needed. In extreme cases this can lead to exhaustion of the system chunk
array, which has a size limit of 2048 bytes, and results in a transaction
abort with errno EFBIG, producing a trace in dmesg like the following,
which was triggered on a PowerPC machine with a node/leaf size of 64K:

  [1359.518899] ------------[ cut here ]------------
  [1359.518980] BTRFS: Transaction aborted (error -27)
  [1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
  [1359.519152] Modules linked in: (...)
  [1359.519239] Supported: Yes, External
  [1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G               X    5.3.18-47-default #1 SLE15-SP3
  [1359.519274] NIP:  c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8
  [1359.519293] REGS: c00000056890b700 TRAP: 0700   Tainted: G               X     (5.3.18-47-default)
  [1359.519317] MSR:  800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE>  CR: 48008222  XER: 00000007
  [1359.519356] CFAR: c00000000013e170 IRQMASK: 0
  [1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026
  [1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007
  [1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000
  [1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0
  [1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff
  [1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000
  [1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0
  [1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08
  [1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
  [1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs]
  [1359.519626] Call Trace:
  [1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable)
  [1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs]
  [1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs]
  [1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs]
  [1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350
  [1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0
  [1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40
  [1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170
  [1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278
  [1359.520037] Instruction dump:
  [1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000
  [1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378
  [1359.520122] ---[ end trace d6c186e151022e20 ]---

The following steps explain how we can end up in this situation:

1) Task A is at check_system_chunk(), either because it is allocating a
   new data or metadata block group, at btrfs_chunk_alloc(), or because
   it is removing a block group or turning a block group RO. It does not
   matter why;

2) Task A sees that there is not enough free space in the system
   space_info object, that is 'left' is < 'thresh'. And at this point
   the system space_info has a value of 0 for its 'bytes_may_use'
   counter;

3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate
   a new system block group (chunk) and then reserves 'thresh' bytes in
   the chunk block reserve with the call to btrfs_block_rsv_add(). This
   changes the chunk block reserve's 'reserved' and 'size' counters by an
   amount of 'thresh', and changes the 'bytes_may_use' counter of the
   system space_info object from 0 to 'thresh'.

   Also during its call to btrfs_alloc_chunk(), we end up increasing the
   value of the 'total_bytes' counter of the system space_info object by
   8MiB (the size of a system chunk stripe). This happens through the
   call chain:

   btrfs_alloc_chunk()
       create_chunk()
           btrfs_make_block_group()
               btrfs_update_space_info()

4) After it finishes the first phase of the block group allocation, at
   btrfs_chunk_alloc(), task A unlocks the chunk mutex;

5) At this point the new system block group was added to the transaction
   handle's list of new block groups, but its block group item, device
   items and chunk item were not yet inserted in the extent, device and
   chunk trees, respectively. That only happens later when we call
   btrfs_finish_chunk_alloc() through a call to
   btrfs_create_pending_block_groups();

   Note that only when we update the chunk tree, through the call to
   btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter
   of the chunk block reserve as we COW/allocate extent buffers,
   through:

   btrfs_alloc_tree_block()
      btrfs_use_block_rsv()
         btrfs_block_rsv_use_bytes()

   And the system space_info's 'bytes_may_use' is decremented everytime
   we allocate an extent buffer for COW operations on the chunk tree,
   through:

   btrfs_alloc_tree_block()
      btrfs_reserve_extent()
         find_free_extent()
            btrfs_add_reserved_bytes()

   If we end up COWing less chunk btree nodes/leaves than expected, which
   is the typical case since the amount of space we reserve is always
   pessimistic to account for the worst possible case, we release the
   unused space through:

   btrfs_create_pending_block_groups()
      btrfs_trans_release_chunk_metadata()
         btrfs_block_rsv_release()
            block_rsv_release_bytes()
                btrfs_space_info_free_bytes_may_use()

   But before task A gets into btrfs_create_pending_block_groups()...

6) Many other tasks start allocating new block groups through fallocate,
   each one does the first phase of block group allocation in a
   serialized way, since btrfs_chunk_alloc() takes the chunk mutex
   before calling check_system_chunk() and btrfs_alloc_chunk().

   However before everyone enters the final phase of the block group
   allocation, that is, before calling btrfs_create_pending_block_groups(),
   new tasks keep coming to allocate new block groups and while at
   check_system_chunk(), the system space_info's 'bytes_may_use' keeps
   increasing each time a task reserves space in the chunk block reserve.
   This means that eventually some other task can end up not seeing enough
   free space in the system space_info and decide to allocate yet another
   system chunk.

   This may repeat several times if yet more new tasks keep allocating
   new block groups before task A, and all the other tasks, finish the
   creation of the pending block groups, which is when reserved space
   in excess is released. Eventually this can result in exhaustion of
   system chunk array in the superblock, with btrfs_add_system_chunk()
   returning EFBIG, resulting later in a transaction abort.

   Even when we don't reach the extreme case of exhausting the system
   array, most, if not all, unnecessarily created system block groups
   end up being unused since when finishing creation of the first
   pending system block group, the creation of the following ones end
   up not needing to COW nodes/leaves of the chunk tree, so we never
   allocate and deallocate from them, resulting in them never being
   added to the list of unused block groups - as a consequence they
   don't get deleted by the cleaner kthread - the only exceptions are
   if we unmount and mount the filesystem again, which adds any unused
   block groups to the list of unused block groups, if a scrub is
   run, which also adds unused block groups to the unused list, and
   under some circumstances when using a zoned filesystem or async
   discard, which may also add unused block groups to the unused list.

So fix this by:

*) Tracking the number of reserved bytes for the chunk tree per
   transaction, which is the sum of reserved chunk bytes by each
   transaction handle currently being used;

*) When there is not enough free space in the system space_info,
   if there are other transaction handles which reserved chunk space,
   wait for some of them to complete in order to have enough excess
   reserved space released, and then try again. Otherwise proceed with
   the creation of a new system chunk.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

eafa4fd0

btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags · b7a7a834

Filipe Manana authored Mar 23, 2021

If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a
file that has the S_SYNC attribute set, we totally ignore that and do not
durably persist the reflink changes. Since a reflink can change the data
readable from a file (and mtime/ctime, or a file size), it makes sense to
durably persist (fsync) the source and destination files/ranges.

This was previously discussed at:

https://lore.kernel.org/linux-btrfs/20200903035225.GJ6090@magnolia/

The recently introduced test case generic/628, from fstests, exercises
these scenarios and currently fails without this change.

So make sure we fsync the source and destination files/ranges when either
of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set,
just like XFS already does.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

b7a7a834

btrfs: zoned: bail out in btrfs_alloc_chunk for bad input · bb05b298

Arnd Bergmann authored Mar 23, 2021

gcc complains that the ctl->max_chunk_size member might be used
uninitialized when none of the three conditions for initializing it in
init_alloc_chunk_ctl_policy_zoned() are true:

In function ‘init_alloc_chunk_ctl_policy_zoned’,
    inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3,
    inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2:
include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used uninitialized [-Werror=maybe-uninitialized]
 4998 |         ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
      |                               ^~~
fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’:
fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here
 5316 |         struct alloc_chunk_ctl ctl;
      |                                ^~~

If we ever get into this condition, something is seriously
wrong, as validity is checked in the callers

  btrfs_alloc_chunk
    init_alloc_chunk_ctl
      init_alloc_chunk_ctl_policy_zoned

so the same logic as in init_alloc_chunk_ctl_policy_regular()
and a few other places should be applied. This avoids both further
data corruption, and the compile-time warning.

Fixes: 1cd6121f ("btrfs: zoned: implement zoned chunk allocator")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

bb05b298

btrfs: fix a potential hole punching failure · 3227788c

BingJing Chang authored Mar 25, 2021

In commit d7781546 ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.

In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.

We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
  while (cur_offset < end) {
	  ...
	  // update cur_offset & len
	  // advance cur_offset & len in hole-punching case if needed
  }
Reported-by: Robbie Ko <robbieko@synology.com>
Fixes: d7781546 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>

3227788c

btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex · e75f9fd1

Naohiro Aota authored Mar 24, 2021

Commit 6e37d245 ("btrfs: zoned: fix deadlock on log sync") pointed out
a deadlock warning and removed mutex_{lock,unlock} of fs_info::tree_root->log_mutex.
While it looks like it always cause a deadlock, we didn't see actual
deadlock in fstests runs. The reason is log_root_tree->log_mutex !=
fs_info->tree_root->log_mutex, not taking the same lock. So, the warning
was actually a false-positive.

Since btrfs_alloc_log_tree_node() is protected only by
fs_info->tree_root->log_mutex, we can (and should) move the code out of
the lock scope of log_root_tree->log_mutex and silence the warning.

Fixes: 6e37d245 ("btrfs: zoned: fix deadlock on log sync")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e75f9fd1

btrfs: use percpu_read_positive instead of sum_positive for need_preempt · 2cdb3909

Josef Bacik authored Mar 24, 2021

Looking at perf data for a fio workload I noticed that we were spending
a pretty large chunk of time (around 5%) doing percpu_counter_sum() in
need_preemptive_reclaim.  This is silly, as we only want to know if we
have more ordered than delalloc to see if we should be counting the
delayed items in our threshold calculation.  Change this to
percpu_read_positive() to avoid the overhead.

I ran this through fsperf to validate the changes, obviously the latency
numbers in dbench and fio are quite jittery, so take them as you wish,
but overall the improvements on throughput, iops, and bw are all
positive.  Each test was run two times, the given value is the average
of both runs for their respective column.

  btrfs ssd normal test results

  bufferedrandwrite16g results
       metric         baseline   current          diff
  ==========================================================
  write_io_kbytes     16777216   16777216     0.00%
  read_clat_ns_p99           0          0     0.00%
  write_bw_bytes      1.04e+08   1.05e+08     1.12%
  read_iops                  0          0     0.00%
  write_clat_ns_p50      13888      11840   -14.75%
  read_io_kbytes             0          0     0.00%
  read_io_bytes              0          0     0.00%
  write_clat_ns_p99      35008      29312   -16.27%
  read_bw_bytes              0          0     0.00%
  elapsed                  170        167    -1.76%
  write_lat_ns_min     4221.50    3762.50   -10.87%
  sys_cpu                39.65      35.37   -10.79%
  write_lat_ns_max    2.67e+10   2.50e+10    -6.63%
  read_lat_ns_min            0          0     0.00%
  write_iops          25270.10   25553.43     1.12%
  read_lat_ns_max            0          0     0.00%
  read_clat_ns_p50           0          0     0.00%

  dbench60 results
    metric     baseline   current         diff
  ==================================================
  qpathinfo       11.12     12.73    14.52%
  throughput     416.09    445.66     7.11%
  flush         3485.63   1887.55   -45.85%
  qfileinfo        0.70      1.92   173.86%
  ntcreatex      992.60    695.76   -29.91%
  qfsinfo          2.43      3.71    52.48%
  close            1.67      3.14    88.09%
  sfileinfo       66.54    105.20    58.10%
  rename         809.23    619.59   -23.43%
  find            16.88     15.46    -8.41%
  unlink         820.54    670.86   -18.24%
  writex        3375.20   2637.91   -21.84%
  deltree        386.33    449.98    16.48%
  readx            3.43      3.41    -0.60%
  mkdir            0.05      0.03   -38.46%
  lockx            0.26      0.26    -0.76%
  unlockx          0.81      0.32   -60.33%

  dio4kbs16threads results
       metric          baseline       current           diff
  ================================================================
  write_io_kbytes         5249676       3357150   -36.05%
  read_clat_ns_p99              0             0     0.00%
  write_bw_bytes      89583501.50   57291192.50   -36.05%
  read_iops                     0             0     0.00%
  write_clat_ns_p50        242688        263680     8.65%
  read_io_kbytes                0             0     0.00%
  read_io_bytes                 0             0     0.00%
  write_clat_ns_p99      15826944      36732928   132.09%
  read_bw_bytes                 0             0     0.00%
  elapsed                      61            61     0.00%
  write_lat_ns_min          42704         42095    -1.43%
  sys_cpu                    5.27          3.45   -34.52%
  write_lat_ns_max       7.43e+08      9.27e+08    24.71%
  read_lat_ns_min               0             0     0.00%
  write_iops             21870.97      13987.11   -36.05%
  read_lat_ns_max               0             0     0.00%
  read_clat_ns_p50              0             0     0.00%

  randwrite2xram results
       metric          baseline       current           diff
  ================================================================
  write_io_kbytes        24831972      28876262    16.29%
  read_clat_ns_p99              0             0     0.00%
  write_bw_bytes      83745273.50   92182192.50    10.07%
  read_iops                     0             0     0.00%
  write_clat_ns_p50         13952         11648   -16.51%
  read_io_kbytes                0             0     0.00%
  read_io_bytes                 0             0     0.00%
  write_clat_ns_p99         50176         52992     5.61%
  read_bw_bytes                 0             0     0.00%
  elapsed                     314           332     5.73%
  write_lat_ns_min        5920.50          5127   -13.40%
  sys_cpu                    7.82          7.35    -6.07%
  write_lat_ns_max       5.27e+10      3.88e+10   -26.44%
  read_lat_ns_min               0             0     0.00%
  write_iops             20445.62      22505.42    10.07%
  read_lat_ns_max               0             0     0.00%
  read_clat_ns_p50              0             0     0.00%

  untarfirefox results
  metric    baseline   current        diff
  ==============================================
  elapsed      47.41     47.40   -0.03%
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

2cdb3909

btrfs: update outdated comment at btrfs_replace_file_extents() · e2b84217

Filipe Manana authored Mar 26, 2021

There is a comment at btrfs_replace_file_extents() that mentions that we
set the full sync flag on an inode when cloning into a file with a size
greater than or equals to 16MiB, through try_release_extent_mapping() when
we truncate the page cache after replacing file extents during a clone
operation.

That is not true anymore since commit 5e548b32 ("btrfs: do not set
the full sync flag on the inode during page release"), so update the
comment to remove that part and rephrase it slightly to make it more
clear why the full sync flag is set at btrfs_replace_file_extents().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

e2b84217

btrfs: update outdated comment at btrfs_orphan_cleanup() · 0c0218e9

Filipe Manana authored Mar 16, 2021

btrfs_orphan_cleanup() has a comment referring to find_dead_roots, but
function does not exists since commit cb517eab ("Btrfs: cleanup the
similar code of the fs root read"). What we use now to find and load dead
roots is btrfs_find_orphan_roots(). So update the comment and make it a
bit more detailed about why we can not delete an orphan item for a root.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

0c0218e9

btrfs: update debug message when checking seq number of a delayed ref · ffbc10a1

Filipe Manana authored Mar 11, 2021

We used to encode two different numbers in the tree mod log counter used
for sequence numbers, one in the upper 32 bits and the other one in the
lower 32 bits. However that is no longer the case, we stopped doing that
since commit fcebe456 ("Btrfs: rework qgroup accounting").

So update the debug message at btrfs_check_delayed_seq to stop extracting
the two 32 bits counters and print instead the 64 bits sequence numbers.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ffbc10a1

btrfs: add and use helper to get lowest sequence number for the tree mod log · 4bae7880

Filipe Manana authored Mar 11, 2021

There are two places outside the tree mod log module that extract the
lowest sequence number of the tree mod log. These places end up
duplicating code and open coding the logic and internal implementation
details of the tree mod log. So add a helper to the tree mod log module
and header that returns the lowest sequence number or 0 if there aren't
any tree mod log users at the moment.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

4bae7880

btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb() · ffe1d039

Filipe Manana authored Mar 11, 2021

At btrfs_tree_mod_log_free_eb() we check if we are dealing with a leaf,
and if so, return immediately and do nothing. However this check can be
removed, because after it we call tree_mod_need_log(), which returns
false when given an extent buffer that corresponds to a leaf.

So just remove the leaf check and pass the extent buffer to
tree_mod_need_log().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

ffe1d039

btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block() · 888dd183

Filipe Manana authored Mar 11, 2021

Instead of exposing implementation details of the tree mod log to check
if there are active tree mod log users at btrfs_free_tree_block(), use
the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This
way extent-tree.c does not need to known about any of the internals of
the tree mod log and avoids taking a lock unnecessarily as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

888dd183

btrfs: use a bit to track the existence of tree mod log users · bc03f39e

Filipe Manana authored Mar 11, 2021

The tree modification log functions are called very frequently, basically
they are called every time a btree is modified (a pointer added or removed
to a node, a new root for a btree is set, etc). Because of that, to avoid
heavy lock contention on the lock that protects the list of tree mod log
users, we have checks that test the emptiness of the list with a full
memory barrier before the checks, so that when there are no tree mod log
users we avoid taking the lock.

Replace the memory barrier and list emptiness check with a test for a new
bit set at fs_info->flags. This bit is used to indicate when there are
tree mod log users, set whenever a user is added to the list and cleared
when the last user is removed from the list. This makes the intention a
bit more obvious and possibly more efficient (assuming test_bit() may be
cheaper than a full memory barrier on some architectures).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

bc03f39e

btrfs: use booleans where appropriate for the tree mod log functions · 406808ab

Filipe Manana authored Mar 11, 2021

Several functions of the tree modification log use integers as booleans,
so change them to use booleans instead, making their use more clear.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

406808ab