Commits · 8669199438aeb5daf8b17f76bc853286b93f058e · Kirill Smelkov / linux

22 Oct, 2023 40 commits

bcachefs: Print out counters correctly · 86691994

Kent Overstreet authored Mar 30, 2023

Most counters aren't in units of sectors, and the ones that are should
just be switched to bytes, for simplicity.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

86691994

bcachefs: Add missing bch2_err_class() call · dde72e18

Kent Overstreet authored Mar 30, 2023

We're not supposed to return our private error codes to userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

dde72e18

bcachefs: Rip out code for storing backpointers in alloc keys · 62a03559

Kent Overstreet authored Mar 31, 2023

We don't store backpointers in alloc keys anymore, since we gained the
btree write buffer.

This patch drops support for backpointers in alloc keys, and revs the on
disk format version so that we know a fsck is required.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

62a03559

bcachefs: use reservation for log messages during recovery · 349b1d83

Brian Foster authored Mar 22, 2023

If we block on journal reservation attempting to log journal
messages during recovery, particularly for the first message(s)
before we start doing actual work, chances are the filesystem ends
up deadlocked.

Allow logged messages to use reserved journal space to mitigate this
problem. In the worst case where no space is available whatsoever,
this at least allows the fs to recognize that the journal is stuck
and fail the mount gracefully.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

349b1d83

bcachefs: Improve trans_restart_split_race tracepoint · 3d86f13d

Kent Overstreet authored Mar 30, 2023

Seeing occasional test failures where we get stuck in a livelock that
involves this event - this will help track it down.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3d86f13d

bcachefs: Data update path no longer leaves cached replicas · 25d8f405

Kent Overstreet authored Mar 29, 2023

It turns out that it's currently impossible to invalidate buckets
containing only cached data if they're part of a stripe. The normal
bucket invalidate path can't do it because we have to be able to
incerement the bucket's gen, which isn't correct becasue it's still a
member of the stripe - and the bucket invalidate path makes the bucket
availabel for reuse right away, which also isn't correct for buckets in
stripes.

What would work is invalidating cached data by following backpointers,
except that cached replicas don't currently get backpointers - because
they would be awkward for the existing bucket invalidate path to delete
and they haven't been needed elsewhere.

So for the time being, to prevent running out of space in stripes,
switch the data update path to not leave cached replicas; we may revisit
this in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

25d8f405

bcachefs: Rhashtable based buckets_in_flight for copygc · 32de2ea0

Kent Overstreet authored Mar 11, 2023

Previously, copygc used a fifo for tracking buckets in flight - this had
the disadvantage of being fixed size, since we pass references to
elements into the move code.

This restructures it to be a hash table and linked list, since with
erasure coding we need to be able to pipeline across an arbitrary number
of buckets.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

32de2ea0

bcachefs: Use BTREE_ITER_INTENT in ec_stripe_update_extent() · 6bdefe9c

Kent Overstreet authored Mar 29, 2023

This adds a flags param to bch2_backpointer_get_key() so that we can
pass BTREE_ITER_INTENT, since ec_stripe_update_extent() is updating the
extent immediately.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

6bdefe9c

bcachefs: move snapshot_t to subvolume_types.h · 4f77dcde
Kent Overstreet authored Mar 29, 2023
```
this doesn't need to be in bcachefs.h
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
4f77dcde

bcachefs: Fix bch2_get_key_or_hole() · 1546cf97

Kent Overstreet authored Mar 28, 2023

This fixes an off by one error, due to confusing closed vs. half open
intervals.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

1546cf97

bcachefs: Check return code from need_whiteout_for_snapshot() · 2a6c302f

Kent Overstreet authored Mar 28, 2023

This could return a transaction restart; we need to check for that.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2a6c302f

bcachefs: bch2_dev_freespace_init() Print out status every 10 seconds · e9b9e475

Kent Overstreet authored Mar 22, 2023

It appears freespace init can still take awhile, and we've had a report
or two of it getting stuck - let's have it print out where it's at every
10 seconds.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e9b9e475

bcachefs: Run freespace init in device hot add path · b1c945b3

Kent Overstreet authored Mar 22, 2023

Like in the recovery, and device add, we have to check if devices don't
have the freespace btree initialized - this was missed in the device hot
add path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b1c945b3

bcachefs: Improved copygc wait debugging · 0fb11e08

Kent Overstreet authored Mar 17, 2023

This just adds a line for how long copygc has been waiting to sysfs
copygc_wait, helpful for debugging why copygc isn't running.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

0fb11e08

bcachefs: Call bch2_path_put_nokeep() before bch2_path_put() · 11f11737

Kent Overstreet authored Mar 21, 2023

bch2_path_put_nokeep() is sketchy, and we should consider removing it:
it unconditionally frees btree_paths once their ref hits 0.

The assumption is that we only use it for paths that have never been
visible outside the btree core btree code; i.e. higher level code will
never be making assumptions about locking based on these paths.

However, there's subtle brokenness with this approach:

 - If we call bch2_path_put(), then bch2_path_put_nokeep(),
   bch2_path_put() may free the first path on the assumption that we we
   have another path keeping a node locked - but then
   bch2_path_put_nokeep() just unconditionally frees it.

The same bug may arise if we're calling bch2_path_put() and
bch2_path_put_nokeep() on the same (refcounted) path, or two adjacent
paths that point to the same btree node.

This patch hacks around one of these bugs by calling
bch2_path_put_nokeep() first in bch2_trans_iter_exit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

11f11737

bcachefs: drop unnecessary journal stuck check from space calculation · 030e9f92

Brian Foster authored Mar 21, 2023

The journal stucking check in bch2_journal_space_available() is
particularly aggressive and can lead to premature shutdown in some
rare cases. This is difficult to reproduce, but also comes along
with a fatal error and so is worthwhile to be cautious.

For example, we've seen instances where the journal is under heavy
reservation pressure, the journal allocation path transitions into
the final available journal bucket, the journal write path
immediately consumes that bucket and calls into
bch2_journal_space_available(), which then in turn flags the journal
as stuck because there is no available space and shuts down the
filesystem instead of submitting the journal write (that would have
otherwise succeeded).

To avoid this problem, simplify the journal stuck checking by just
relying on the higher level logic in the journal reservation path.
This produces more useful debug output and is a more reliable
indicator that things have bogged down.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

030e9f92

bcachefs: refactor journal stuck checking into standalone helper · db1bf729

Brian Foster authored Mar 21, 2023

bcachefs checks for journal stuck conditions both in the journal
space calculation code and the journal reservation slow path. The
logic in both places is rather tricky and can result in
non-deterministic failure characteristics and debug output.

In preparation to condense journal stuck handling to a single place,
refactor the __journal_res_get() logic into a standalone helper.
Since multiple callers into the reservation code can result in
duplicate reports, use the ->err_seq field as a serialization
mechanism for the debug dump. Finally, add some comments to help
explain the logic and hopefully facilitate further improvements in
the future.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

db1bf729

bcachefs: gracefully unwind journal res slowpath on shutdown · 23fd4f4d

Brian Foster authored Mar 20, 2023

bcachefs detects journal stuck conditions in a couple different
places. If the logic in the journal reservation slow path happens to
detect the problem, I've seen instances where the filesystem remains
deadlocked even though it has been shut down. This is occasionally
reproduced by generic/333, and usually manifests as one or more
tasks stuck in the journal reservation slow path.

To help avoid this problem, repeat the journal error check in
__journal_res_get() once under spinlock to cover the case where the
previous lock holder might have triggered shutdown. This also helps
avoid spurious/duplicate stuck reports. Also, wake the journal from
the halt code to make sure blocked callers of the journal res
slowpath have a chance to wake up and observe the pending error.
This survives an overnight looping run of generic/333 without the
aforementioned lockups.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

23fd4f4d

bcachefs: more aggressive fast path write buffer key flushing · 873555f0

Brian Foster authored Mar 17, 2023

The btree write buffer flush code is prone to causing journal
deadlock due to inefficient use and release of reservation space.
Reservation is not pre-reserved for write buffered keys (as is done
for key cache keys, for example), because the write buffer flush
side uses a fast path that attempts insertion without need for any
reservation at all.

The write buffer flush attempts to deal with this by inserting keys
using the BTREE_INSERT_JOURNAL_RECLAIM flag to return an error on
journal reservations that require blocking. Upon first error, it
falls back to a slow path that inserts in journal order and supports
moving the associated journal pin forward.

The problem is that under pathological conditions (i.e. smaller log,
larger write buffer and journal reservation pressure), we've seen
instances where the fast path fails fairly quickly without having
completed many insertions, and then the slow path is unable to push
the journal pin forward enough to free up the space it needs to
completely flush the buffer. This problem is occasionally reproduced
by fstest generic/333.

To avoid this problem, update the fast path algorithm to skip key
inserts that fail due to inability to acquire needed journal
reservation without immediately breaking out of the loop. Instead,
insert as many keys as possible, zap the sequence numbers to mark
them as processed, and then fall back to the slow path to process
the remaining set in journal order. This reduces the amount of
journal reservation that might be required to flush the entire
buffer and increases the odds that the slow path is able to move the
journal pin forward and free up space as keys are processed.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

873555f0

bcachefs: use dedicated workqueue for tasks holding write refs · 8bff9875

Brian Foster authored Mar 23, 2023

A workqueue resource deadlock has been observed when running fsck
on a filesystem with a full/stuck journal. fsck is not currently
able to repair the fs due to fairly rapid emergency shutdown, but
rather than exit gracefully the fsck process hangs during the
shutdown sequence. Fortunately this is easily recoverable from
userspace, but the root cause involves code shared between the
kernel and userspace and so should be addressed.

The deadlock scenario involves the main task in the bch2_fs_stop()
-> bch2_fs_read_only() path waiting on write references to drain
with the fs state lock held. A bch2_read_only_work() workqueue task
is scheduled on the system_long_wq, blocked on the state lock.
Finally, various other write ref holding workqueue tasks are
scheduled to run on the same workqueue and must complete in order to
release references that the initial task is waiting on.

To avoid this problem, we can split the dependent workqueue tasks
across different workqueues. It's a bit of a waste to create a
dedicated wq for the read-only worker, but there are several tasks
throughout the fs that follow the pattern of acquiring a write
reference and then scheduling to the system wq. Use a local wq
for such tasks to break the subtle dependency between these and the
read-only worker.
Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8bff9875

bcachefs: remove unused bch2_trans_log_msg() · 76c70c57

Brian Foster authored Mar 22, 2023

Signed-off-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

76c70c57

bcachefs: Fix bch2_verify_bucket_evacuated() · ffc76edb

Kent Overstreet authored Mar 27, 2023

We were going into an infinite loop when printing out backpointers, due
to never incrementing bp_offset - whoops.

Also limit the number of backpointers we print to 10; this is debug code
and we only need to print a sample, not all of them.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ffc76edb

bcachefs: verify_bucket_evacuated() -> set_btree_iter_dontneed() · d59ca7e8

Kent Overstreet authored Mar 19, 2023

This should help with excessive 'would deadlock' transaction restarts.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d59ca7e8

bcachefs: Make reconstruct_alloc quieter · 330970c2

Kent Overstreet authored Mar 19, 2023

We shouldn't be printing out fsck errors for expected errors - this
helps make test logs more readable, and makes it easier to see what the
actual failure was.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

330970c2

bcachefs: Fix an unhandled transaction restart error · 3e36e572

Kent Overstreet authored Mar 19, 2023

This is a bit awkward: we're passing around a btree_trans, but we're not
in a context where transaction restarts are handled - we should try to
come up with a better way to denote situations like this.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3e36e572

bcachefs: Fix nocow write path closure bug · dc6274bc

Kent Overstreet authored Mar 19, 2023

With regular waitlists, we need to ensure we always call finish_wait().
With closures, the equivalent is that we need to call closure_sync()
before returning with a stack-allocated closure.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

dc6274bc

bcachefs: Nocow write error path fix · ac77810c

Kent Overstreet authored Mar 19, 2023

The nocow write error path was iterating over pointers in an extent,
aftre we'd dropped btree locks - oops.

Fortunately we'd already stashed what we need in nocow_lock_bucket, so
use that instead.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ac77810c

bcachefs: Fix bch2_extent_fallocate() in nocow mode · abab7609

Kent Overstreet authored Mar 17, 2023

When we allocate disk space, we need to be incrementing the WRITE io
clock, which perhaps should be renamed to sectors allocated - copygc
uses this io clock to know when to run.

Also, we should be incrementing the same clock when allocating btree
nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

abab7609

bcachefs: Add an assert in inode_write for -ENOENT · 711bf946
Kent Overstreet authored Mar 15, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
711bf946

bcachefs: Fix bch2_evict_subvolume_inodes() · 9edbcc72

Kent Overstreet authored Mar 15, 2023

This fixes a bug in bch2_evict_subvolume_inodes(): d_mark_dontcache()
doesn't handle the case where i_count is already 0, we need to grab and
put the inode in order for it to be dropped.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

9edbcc72

bcachefs: Improve error handling in bch2_ioctl_subvolume_destroy() · e1e7ecaf
Kent Overstreet authored Mar 16, 2023
```
Pure style fixes
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
e1e7ecaf

bcachefs: Fix for 'missing subvolume' error · 2d33036c

Kent Overstreet authored Mar 16, 2023

Subvolumes, including their root inodes, get deleted asynchronously
after an unlink. But we still need to ensure that we tell the VFS the
inode has been deleted, otherwise VFS writeback could fire after
asynchronous deletion has finished, and try to write to an
inode/subvolume that no longer exists.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2d33036c

bcachefs: Don't run transaction hooks multiple times · 56cc033d

Kent Overstreet authored Mar 15, 2023

transaction hooks aren't supposed to run unless we know the transaction
is going to commit succesfully: this fixes a bug with attempting to
delete a subvolume multiple times.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

56cc033d

bcachefs: Add a fallback when journal_keys doesn't fit in ram · 26559553

Kent Overstreet authored Mar 15, 2023

We may end up in a situation where allocating the buffer for the sorted
journal_keys fails - but it would likely succeed, post compaction where
we drop duplicates.

We've had reports of this allocation failing, so this adds a slowpath to
do the compaction incrementally.

This is only a band-aid fix; we need to look at limiting the number of
keys in the journal based on the amount of system RAM.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

26559553

bcachefs: Improve the backpointer to missing extent message · 2f081584

Kent Overstreet authored Mar 14, 2023

We now print the pos where the backpointer was found in the btree, as
well as the exact bucket:bucket_offset of the data, to aid in grepping
through logs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2f081584

bcachefs: Add error message for failing to allocate sorted journal keys · 40a18fe2
Kent Overstreet authored Mar 14, 2023
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
40a18fe2

bcachefs: New erasure coding shutdown path · b40901b0

Kent Overstreet authored Mar 13, 2023

This implements a new shutdown path for erasure coding, which is needed
for the upcoming BCH_WRITE_WAIT_FOR_EC write path.

The process is:
 - Cancel new stripes being built up
 - Close out/cancel open buckets on write points or the partial list
   that are for stripes
 - Shutdown rebalance/copygc
 - Then wait for in flight new stripes to finish

With BCH_WRITE_WAIT_FOR_EC, move ops will be waiting on stripes to fill
up before they complete; the new ec shutdown path is needed for shutting
down copygc/rebalance without deadlocking.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b40901b0

bcachefs: bch2_fs_moving_ctxts_to_text() · b9fa375b

Kent Overstreet authored Mar 11, 2023

This also adds bch2_write_op_to_text(): now we can see outstand moves,
useful for debugging shutdown with the upcoming BCH_WRITE_WAIT_FOR_EC
and likely for other things in the future.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

b9fa375b

bcachefs: Private error codes: ENOMEM · 65d48e35

Kent Overstreet authored Mar 14, 2023

This adds private error codes for most (but not all) of our ENOMEM uses,
which makes it easier to track down assorted allocation failures.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

65d48e35

bcachefs: Fix bch2_check_extents_to_backpointers() · 872c0311

Kent Overstreet authored Mar 14, 2023

In rare cases, bch2_check_extents_to_backpointers() would incorrectly
flag an extent has having a missing backpointer when we just needed to
flush the btree write buffer - we weren't tracking the last flushed
position correctly.

This adds a level field to the last_flushed pos, fixing a bug where we'd
sometimes fail on a new root node.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

872c0311