Commits · 873555f04d81b49a96ea03b37dcd499c13e67742 · Kirill Smelkov / linux

22 Oct, 2023 40 commits

bcachefs: more aggressive fast path write buffer key flushing · 873555f0

Brian Foster authored Mar 17, 2023

The btree write buffer flush code is prone to causing journal
deadlock due to inefficient use and release of reservation space.
Reservation is not pre-reserved for write buffered keys (as is done
for key cache keys, for example), because the write buffer flush
side uses a fast path that attempts insertion without need for any
reservation at all.

The write buffer flush attempts to deal with this by inserting keys
using the BTREE_INSERT_JOURNAL_RECLAIM flag to return an error on
journal reservations that require blocking. Upon first error, it
falls back to a slow path that inserts in journal order and supports
moving the associated journal pin forward.

The problem is that under pathological conditions (i.e. smaller log,
larger write buffer and journal reservation pressure), we've seen
instances where the fast path fails fairly quickly without having
completed many insertions, and then the slow path is unable to push
the journal pin forward enough to free up the space it needs to
completely flush the buffer. This problem is occasionally reproduced
by fstest generic/333.

To avoid this problem, update the fast path algorithm to skip key
inserts that fail due to inability to acquire needed journal
reservation without immediately breaking out of the loop. Instead,
insert as many keys as possible, zap the sequence numbers to mark
them as processed, and then fall back to the slow path to process
the remaining set in journal order. This reduces the amount of
journal reservation that might be required to flush the entire
buffer and increases the odds that the slow path is able to move the
journal pin forward and free up space as keys are processed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

873555f0

bcachefs: use dedicated workqueue for tasks holding write refs · 8bff9875

Brian Foster authored Mar 23, 2023

A workqueue resource deadlock has been observed when running fsck
on a filesystem with a full/stuck journal. fsck is not currently
able to repair the fs due to fairly rapid emergency shutdown, but
rather than exit gracefully the fsck process hangs during the
shutdown sequence. Fortunately this is easily recoverable from
userspace, but the root cause involves code shared between the
kernel and userspace and so should be addressed.

The deadlock scenario involves the main task in the bch2_fs_stop()
-> bch2_fs_read_only() path waiting on write references to drain
with the fs state lock held. A bch2_read_only_work() workqueue task
is scheduled on the system_long_wq, blocked on the state lock.
Finally, various other write ref holding workqueue tasks are
scheduled to run on the same workqueue and must complete in order to
release references that the initial task is waiting on.

To avoid this problem, we can split the dependent workqueue tasks
across different workqueues. It's a bit of a waste to create a
dedicated wq for the read-only worker, but there are several tasks
throughout the fs that follow the pattern of acquiring a write
reference and then scheduling to the system wq. Use a local wq
for such tasks to break the subtle dependency between these and the
read-only worker.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8bff9875

bcachefs: remove unused bch2_trans_log_msg() · 76c70c57

Brian Foster authored Mar 22, 2023

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

76c70c57

bcachefs: Fix bch2_verify_bucket_evacuated() · ffc76edb

Kent Overstreet authored Mar 27, 2023

We were going into an infinite loop when printing out backpointers, due
to never incrementing bp_offset - whoops.

Also limit the number of backpointers we print to 10; this is debug code
and we only need to print a sample, not all of them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ffc76edb

bcachefs: verify_bucket_evacuated() -> set_btree_iter_dontneed() · d59ca7e8

Kent Overstreet authored Mar 19, 2023

This should help with excessive 'would deadlock' transaction restarts.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d59ca7e8

bcachefs: Make reconstruct_alloc quieter · 330970c2

Kent Overstreet authored Mar 19, 2023

We shouldn't be printing out fsck errors for expected errors - this
helps make test logs more readable, and makes it easier to see what the
actual failure was.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

330970c2

bcachefs: Fix an unhandled transaction restart error · 3e36e572

Kent Overstreet authored Mar 19, 2023

This is a bit awkward: we're passing around a btree_trans, but we're not
in a context where transaction restarts are handled - we should try to
come up with a better way to denote situations like this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3e36e572

bcachefs: Fix nocow write path closure bug · dc6274bc

Kent Overstreet authored Mar 19, 2023

With regular waitlists, we need to ensure we always call finish_wait().
With closures, the equivalent is that we need to call closure_sync()
before returning with a stack-allocated closure.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

dc6274bc

bcachefs: Nocow write error path fix · ac77810c

Kent Overstreet authored Mar 19, 2023

The nocow write error path was iterating over pointers in an extent,
aftre we'd dropped btree locks - oops.

Fortunately we'd already stashed what we need in nocow_lock_bucket, so
use that instead.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ac77810c

bcachefs: Fix bch2_extent_fallocate() in nocow mode · abab7609

Kent Overstreet authored Mar 17, 2023

When we allocate disk space, we need to be incrementing the WRITE io
clock, which perhaps should be renamed to sectors allocated - copygc
uses this io clock to know when to run.

Also, we should be incrementing the same clock when allocating btree
nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

abab7609

bcachefs: Add an assert in inode_write for -ENOENT · 711bf946
Kent Overstreet authored Mar 15, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
711bf946

bcachefs: Fix bch2_evict_subvolume_inodes() · 9edbcc72

Kent Overstreet authored Mar 15, 2023

This fixes a bug in bch2_evict_subvolume_inodes(): d_mark_dontcache()
doesn't handle the case where i_count is already 0, we need to grab and
put the inode in order for it to be dropped.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

9edbcc72

bcachefs: Improve error handling in bch2_ioctl_subvolume_destroy() · e1e7ecaf
Kent Overstreet authored Mar 16, 2023
```
Pure style fixes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
e1e7ecaf

bcachefs: Fix for 'missing subvolume' error · 2d33036c

Kent Overstreet authored Mar 16, 2023

Subvolumes, including their root inodes, get deleted asynchronously
after an unlink. But we still need to ensure that we tell the VFS the
inode has been deleted, otherwise VFS writeback could fire after
asynchronous deletion has finished, and try to write to an
inode/subvolume that no longer exists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2d33036c

bcachefs: Don't run transaction hooks multiple times · 56cc033d

Kent Overstreet authored Mar 15, 2023

transaction hooks aren't supposed to run unless we know the transaction
is going to commit succesfully: this fixes a bug with attempting to
delete a subvolume multiple times.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

56cc033d

bcachefs: Add a fallback when journal_keys doesn't fit in ram · 26559553

Kent Overstreet authored Mar 15, 2023

We may end up in a situation where allocating the buffer for the sorted
journal_keys fails - but it would likely succeed, post compaction where
we drop duplicates.

We've had reports of this allocation failing, so this adds a slowpath to
do the compaction incrementally.

This is only a band-aid fix; we need to look at limiting the number of
keys in the journal based on the amount of system RAM.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

26559553

bcachefs: Improve the backpointer to missing extent message · 2f081584

Kent Overstreet authored Mar 14, 2023

We now print the pos where the backpointer was found in the btree, as
well as the exact bucket:bucket_offset of the data, to aid in grepping
through logs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2f081584

bcachefs: Add error message for failing to allocate sorted journal keys · 40a18fe2
Kent Overstreet authored Mar 14, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
40a18fe2

bcachefs: New erasure coding shutdown path · b40901b0

Kent Overstreet authored Mar 13, 2023

This implements a new shutdown path for erasure coding, which is needed
for the upcoming BCH_WRITE_WAIT_FOR_EC write path.

The process is:
 - Cancel new stripes being built up
 - Close out/cancel open buckets on write points or the partial list
   that are for stripes
 - Shutdown rebalance/copygc
 - Then wait for in flight new stripes to finish

With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill
up before they complete; the new ec shutdown path is needed for shutting
down copygc/rebalance without deadlocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b40901b0

bcachefs: bch2_fs_moving_ctxts_to_text() · b9fa375b

Kent Overstreet authored Mar 11, 2023

This also adds bch2_write_op_to_text(): now we can see outstand moves,
useful for debugging shutdown with the upcoming BCH_WRITE_WAIT_FOR_EC
and likely for other things in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b9fa375b

bcachefs: Private error codes: ENOMEM · 65d48e35

Kent Overstreet authored Mar 14, 2023

This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

65d48e35

bcachefs: Fix bch2_check_extents_to_backpointers() · 872c0311

Kent Overstreet authored Mar 14, 2023

In rare cases, bch2_check_extents_to_backpointers() would incorrectly
flag an extent has having a missing backpointer when we just needed to
flush the btree write buffer - we weren't tracking the last flushed
position correctly.

This adds a level field to the last_flushed pos, fixing a bug where we'd
sometimes fail on a new root node.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

872c0311

bcachefs: Fix an assert in copygc thread shutdown path · c639c29c

Kent Overstreet authored Mar 14, 2023

We're not supposed to have nested (locked) btree_trans on the stack:
this means copygc shutdown needs to exit our btree_trans before exiting
the move_ctxt, which calls bch2_write().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c639c29c

bcachefs: bch2_bucket_is_movable() -> BTREE_ITER_CACHED · 2d004446

Kent Overstreet authored Mar 14, 2023

BTREE_ITER_CACHED should really be the default for cached btrees - this
is an easy mistake to make.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2d004446

bcachefs: Don't use BTREE_ITER_INTENT in make_extent_indirect() · 3997989a

Kent Overstreet authored Mar 13, 2023

This is a workaround for a btree path overflow - searching with
BTREE_ITER_INTENT periodically saves the iterator position for updates,
which eventually overflows.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3997989a

bcachefs: Fix stripe create error path · aebe7a67

Kent Overstreet authored Mar 13, 2023

If we errored out on a new stripe before fully allocating it, we
shouldn't be zeroing out unwritten data.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

aebe7a67

bcachefs: Mark new snapshots earlier in create path · ae1f5623

Kent Overstreet authored Mar 13, 2023

This fixes a null ptr deref when creating new snapshots:
bch2_create_trans() will lookup the subvolume and find the _new_
snapshot in the BCH_CREATE_SUBVOL path that's being created in that
transaction.

We have to call bch2_mark_snapshot() earlier so that it's properly
initialized, instead of leaving it for transaction commit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ae1f5623

bcachefs: Improve bch2_new_stripes_to_text() · e6539b0a

Kent Overstreet authored Mar 11, 2023

Print out the alloc reserve, and format it a bit more nicely.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e6539b0a

bcachefs: Kill bch_write_op->btree_update_ready · 751c025f

Kent Overstreet authored Mar 11, 2023

This changes the write path to not add write ops to to the write_point's
list of pending work items until it's ready; this means we have to
change the lock protecting it to an irq-safe lock, but means
bch2_write_point_do_index_updates() no longer has to iterate over the
list, which is beneficial with the way the new BCH_WRITE_WAIT_FOR_EC
code works.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

751c025f

bcachefs: Simplify stripe_idx_to_delete · e28ef07e

Kent Overstreet authored Mar 10, 2023

This is not technically correct - it's subject to a race if we ever end
up with a stripe with all empty blocks (that needs to be deleted) being
held open. But the "correct" version was much too inefficient, and soon
we'll be adding a stripes LRU.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e28ef07e

bcachefs: Fix next_bucket() · 46e14854

Kent Overstreet authored Mar 11, 2023

This fixes an infinite loop in bch2_get_key_or_real_bucket_hole().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

46e14854

bcachefs: Second layer of refcounting for new stripes · fba053d2

Kent Overstreet authored Mar 09, 2023

This will be used for move writes, which will be waiting until the
stripe is created to do the index update. They need to prevent the
stripe from being reclaimed until their index update is done, so we need
another refcount that just keeps the stripe open.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

# Conflicts:
#	fs/bcachefs/ec.c
#	fs/bcachefs/io.c

fba053d2

bcachefs: ec: fall back to creating new stripes for copygc · 10d9f7d2
Kent Overstreet authored Mar 10, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
10d9f7d2

bcachefs: Rework __bch2_data_update_index_update() · 57c723de

Kent Overstreet authored Mar 04, 2023

This makes some improvements to the logic for adding/removing replicas,
as part of the larger erasure coding improvements. We now directly
consider number of replicas desired for the given inode, and
extent/pointer durability: this ensures that the extent ends up with the
desired number of replicas when we're replacing multiple pointers with
one that has higher durability (e.g. erasure coded).
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

57c723de

bcachefs: Extent helper improvements · 702ffea2

Kent Overstreet authored Mar 10, 2023

 - __bch2_bkey_drop_ptr() -> bch2_bkey_drop_ptr_noerror(), now available
   outside extents.

 - Split bch2_bkey_has_device() and bch2_bkey_has_device_c(), const and
   non const versions

 - bch2_extent_has_ptr() now returns the pointer it found
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

702ffea2

bcachefs: evacuate_bucket() no longer moves cached ptrs · 3f5d3fb4
Kent Overstreet authored Mar 10, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
3f5d3fb4

bcachefs: evacuate_bucket() no longer calls verify_bucket_evacuated() · 5bf9db01

Kent Overstreet authored Mar 10, 2023

The copygc code itself now calls this when all moves from a given bucket
are complete.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5bf9db01

bcachefs: Suppress transaction restart err message · 51fe0332

Kent Overstreet authored Mar 10, 2023

This isn't a real error, and doesn't need to be printed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

51fe0332

bcachefs: Rework open bucket partial list allocation · 7635e1a6

Kent Overstreet authored Feb 25, 2023

Now, any open_bucket can go on the partial list: allocating from the
partial list has been moved to its own dedicated function,
open_bucket_add_bucets() -> bucket_alloc_set_partial().

In particular, this means that erasure coded buckets can safely go on
the partial list; the new location works with the "allocate an ec bucket
first, then the rest" logic.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

7635e1a6

bcachefs: don't bump key cache journal seq on nojournal commits · e53d03fe

Brian Foster authored Mar 02, 2023

fstest generic/388 occasionally reproduces corruptions where an
inode has extents beyond i_size. This is a deliberate crash and
recovery test, and the post crash+recovery characteristics are
usually the same: the inode exists on disk in an early (i.e. just
allocated) state based on the journal sequence number associated
with the inode. Subsequent inode updates exist in the journal at
higher sequence numbers, but the inode hadn't been written back
before the associated crash and the post-crash recovery processes a
set of journal sequence numbers that doesn't include updates to the
inode. In fact, the sequence with the most recent inode key update
always happens to be the sequence just before the front of the
journal processed by recovery.

This last bit is a significant hint that the problem relates to an
on-disk journal update of the front of the journal. The root cause
of this problem is basically that the inode is updated (multiple
times) in-core and in the key cache, each time bumping the key cache
sequence number used to control the cache flush. The cache flush
skips one or more times, bumping the associated key cache journal
pin to the key cache seq value. This has a side effect of holding
the inode in memory a bit longer than normal, which helps exacerbate
this problem, but is also unsafe in certain cases where the key
cache seq may have been updated by a transaction commit that didn't
journal the associated key.

For example, consider an inode that has been allocated, updated
several times in the key cache, journaled, but not yet written back.
At this stage, everything should be consistent if the fs happens to
crash because the latest update has been journal. Now consider a key
update via bch2_extent_update_i_size_sectors() that uses the
BTREE_UPDATE_NOJOURNAL flag. While this update may not change inode
state, it can have the side effect of bumping ck->seq in
bch2_btree_insert_key_cached(). In turn, if a subsequent key cache
flush skips due to seq not matching the former, the ck->journal pin
is updated to ck->seq even though the most recent key update was not
journaled. If this pin happens to reside at the front (tail) of the
journal, this means a subsequent journal write can update last_seq
to a value beyond that which includes the most recent update to the
inode. If this occurs and the fs happens to crash before the inode
happens to flush, recovery will see the latest last_seq, fail to
recover the inode and leave the inode in the inconsistent state
described above.

To avoid this problem, skip the key cache seq update on NOJOURNAL
commits, except on initial pin add. Pass the insert entry directly
to bch2_btree_insert_key_cached() to make the associated flag
available and be consistent with btree_insert_key_leaf().
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e53d03fe