Commits · a8eefbd324cd40fab57ab8eef88347d4f745db93 · Kirill Smelkov / linux

22 Oct, 2023 40 commits

bcachefs: Add error path to btree_split() · a8eefbd3

Kent Overstreet authored Oct 01, 2022

The next patch in the series is (finally!) going to change btree splits
(and interior updates in general) to not take intent locks all the way
up to the root - instead only locking the nodes they'll need to modify.

However, this will be introducing a race since if we're not holding a
write lock on a btree node it can be written out by another thread, and
then we might not have enough space for a new bset entry.

We can handle this by retrying - we just need to introduce a new error
path.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a8eefbd3

bcachefs: Write new btree nodes after parent update · 8cbb0002

Kent Overstreet authored Oct 01, 2022

In order to avoid locking all btree nodes up to the root for btree node
splits, we're going to have to introduce a new error path into
bch2_btree_insert_node(); this mean we can't have done any writes or
modified global state before that point.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8cbb0002

bcachefs: Simplify break_cycle() · fe2de9a8

Kent Overstreet authored Oct 09, 2022

We'd like to prioritize aborting transactions that have done less work -
however, it appears breaking cycles by telling other threads to abort
may still be buggy, so disable that for now.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

fe2de9a8

bcachefs: Print cycle on unrecoverable deadlock · 1148a97f

Kent Overstreet authored Oct 09, 2022

Some lock operations can't fail; a cycle of nofail locks is impossible
to recover from. So we want to get rid of these nofail locking
operations, but as this is tricky it'll be done incrementally.

If such a cycle happens, this patch prints out which codepaths are
involved so we know what to work on next.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

1148a97f

bcachefs: Handle dropping pointers in data_update path · 1be88797

Kent Overstreet authored Oct 09, 2022

Cached pointers are generally dropped, not moved: this led to an
assertion firing in the data update path when there were no new replicas
being written.

This path adds a data_options field for pointers to be dropped, and
tweaks move_extent() to check if we're only dropping pointers, not
writing new ones, before kicking off a data update operation.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

1be88797

bcachefs: Ratelimit ec error message · 160dff6d

Kent Overstreet authored Oct 09, 2022

We should fix this, but for now this makes this more usable.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

160dff6d

bcachefs: Use btree_type_has_ptrs() more consistently · 2da671dc
Kent Overstreet authored Oct 09, 2022
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
2da671dc
bcachefs: Fix "multiple types of data in same bucket" with ec · 6c22eb70
Kent Overstreet authored Oct 09, 2022
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
6c22eb70

bcachefs: Ensure fsck error is printed before panic · 22f51621

Kent Overstreet authored Oct 09, 2022

When errors=panic, we want to make sure we print the error before
calling bch2_inconsistent_error().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

22f51621

bcachefs: Fix a deadlock in btree_update_nodes_written() · 8aaee94d

Kent Overstreet authored Oct 03, 2022

btree_node_lock_nopath() is something we'd like to get rid of, it's
always prone to deadlocks if we accidentally are holding other locks,
because it doesn't mark the lock it's taking in a path: we'll want to
get rid of it in the future, but for now this patch works it by calling
bch2_trans_unlock().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8aaee94d

bcachefs: bch2_trans_locked() · 13bc41a7

Kent Overstreet authored Oct 03, 2022

Useful debugging function.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

13bc41a7

bcachefs: Improve btree_deadlock debugfs output · 40a44873

Kent Overstreet authored Oct 02, 2022

This changes bch2_check_for_deadlock() to print the longest chains it
finds - when we have a deadlock because the cycle detector isn't finding
something, this will let us see what it's missing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

40a44873

bcachefs: Don't quash error in bch2_bucket_alloc_set_trans() · 943f9946

Kent Overstreet authored Oct 01, 2022

We were incorrectly returning -BCH_ERR_insufficient_devices when we'd
received a different error from bch2_bucket_alloc_trans(), which
(erronously) turns into -EROFS further up the call chain.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

943f9946

bcachefs: Fix a trans path overflow in bch2_btree_delete_range_trans() · 685e0f0c

Kent Overstreet authored Sep 28, 2022

bch2_btree_delete_range_trans() was using btree_trans_too_many_iters()
to avoid path overflow, but this was buggy here (and also
btree_trans_too_many_iters() is suspect in general).

btree_trans_too_many_iters() only returns true when we're close to the
maximum number of paths - within 8 - but extent insert/delete assumes
that it can use more paths than that.

Instead, we need to call bch2_trans_begin() on every loop iteration.
Since we don't want to call bch2_trans_begin() (restarting the outer
transaction) if the call was a no-op - if we had no work to do - we have
to structure things a bit oddly.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

685e0f0c

bcachefs: bucket_alloc_state · ae10fe01

Kent Overstreet authored Nov 04, 2022

This refactoring puts our various allocation path counters into a
dedicated struct - the upcoming nocow patch is going to add another
counter.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ae10fe01

bcachefs: Fix bch2_btree_path_up_until_good_node() · 29cea6f4

Kent Overstreet authored Sep 27, 2022

There was a rare bug when path->locks_want was nonzero, but not
BTREE_MAX_DEPTH, where we'd return on a valid node that wasn't locked -
oops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

29cea6f4

bcachefs: Factor out bch2_write_drop_io_error_ptrs() · e0eaf862

Kent Overstreet authored Sep 27, 2022

Move slowpath code to a separate, non-inline function.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e0eaf862

bcachefs: Break out bch2_btree_path_traverse_cached_slowpath() · 99e2146b
Kent Overstreet authored Sep 26, 2022
```
Prep work for further refactoring.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
99e2146b

bcachefs: Kill io_in_flight semaphore · 2d848dac

Kent Overstreet authored Sep 26, 2022

This used to be needed more for buffered IO, but now the block layer has
writeback throttling - we can delete this now.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2d848dac

bcachefs: Improve bucket_alloc tracepoint · 68b6cd19

Kent Overstreet authored Sep 26, 2022

It now includes more info - whether the bucket was for metadata or data
- and also call it in the same place as the bucket_alloc_fail
tracepoint.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

68b6cd19

bcachefs; Mark __bch2_trans_iter_init as inline · c298fd7d

Kent Overstreet authored Sep 26, 2022

This function is fairly small and only used in two places: one very hot,
the other cold, so it should definitely be inlined.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c298fd7d

bcachefs: Inline fast path of check_pos_snapshot_overwritten() · 25b4b330

Kent Overstreet authored Sep 26, 2022

This moves the slowpath of check_pos_snapshot_overwritten() to a
separate function, and inlines the fast path - helping performance on
btrees that don't use snapshot and for users that aren't using
snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

25b4b330

bcachefs: Improve jset_validate() · c23a9e08

Kent Overstreet authored Sep 26, 2022

Previously, jset_validate() was formatting the initial part of an error
string for every entry it validating - expensive.

This moves that code to journal_entry_err_msg(), which is now only
called if there's an actual error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c23a9e08

bcachefs: Optimize btree_path_alloc() · 3f3bc66e

Kent Overstreet authored Sep 26, 2022

 - move slowpath code to a separate function, btree_path_overflow()
 - no need to use hweight64
 - copy nr_max_paths from btree_transaction_stats to btree_trans,
   avoiding a data dependency in the fast path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3f3bc66e

bcachefs: Inline bch2_trans_kmalloc() fast path · 14d8f26a
Kent Overstreet authored Sep 26, 2022
```
Small performance optimization.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
14d8f26a

bcachefs: Run bch2_fs_counters_init() earlier · f3b8403e

Kent Overstreet authored Sep 25, 2022

We need counters to be initialized before initializing shrinkers - the
shrinker callbacks will update those counters. This fixes a segfault in
userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

f3b8403e

bcachefs: btree_err() now uses bch2_print_string_as_lines() · d704d623

Kent Overstreet authored Sep 25, 2022

We've seen long error messages get truncated here, so convert to the new
bch2_print_string_as_lines().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d704d623

bcachefs: Improve bch2_fsck_err() · dbb9936b

Kent Overstreet authored Sep 25, 2022

 - factor out fsck_err_get()
 - if the "bcachefs (%s):" prefix has already been applied, don't
   duplicate it
 - convert to printbufs instead of static char arrays
 - tidy up control flow a bit
 - use bch2_print_string_as_lines(), to avoid messages getting truncated
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

dbb9936b

bcachefs: bch2_print_string_as_lines() · a8f35428

Kent Overstreet authored Sep 25, 2022

This adds a helper for printing a large buffer one line at a time, to
avoid the 1k printk limit.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a8f35428

bcachefs: bch2_btree_node_relock_notrace() · e9174370

Kent Overstreet authored Sep 25, 2022

Most of the node_relock_fail trace events are generated from
bch2_btree_path_verify_level(), when debugcheck_iterators is enabled -
but we're not interested in these trace events, they don't indicate that
we're in a slowpath.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e9174370

bcachefs: bch2_btree_cache_scan() improvement · c36ff038

Kent Overstreet authored Sep 25, 2022

We're still seeing OOM issues caused by the btree node cache shrinker
not sufficiently freeing memory: thus, this patch changes the shrinker
to not exit if __GFP_FS was not supplied.

Instead, tweak btree node memory allocation so that we never invoke
memory reclaim while holding the btree node cache lock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c36ff038

bcachefs: Fix blocking with locks held · c6cf49a9

Kent Overstreet authored Sep 23, 2022

This is a major oopsy - we should always be unlocking before calling
closure_sync(), else we'll cause a deadlock.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

c6cf49a9

bcachefs: btree_update_nodes_written() needs BTREE_INSERT_USE_RESERVE · 01ed3359
Kent Overstreet authored Sep 23, 2022
```
This fixes an obvious deadlock - whoops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
01ed3359

bcachefs: Fix error handling in bch2_btree_update_start() · d602657c

Kent Overstreet authored Sep 22, 2022

We were checking for -EAGAIN, but we're not returned that when we didn't
pass a closure to wait with - oops.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d602657c

bcachefs: Improve bch2_btree_trans_to_text() · afbc7194

Kent Overstreet authored Sep 01, 2022

This is just a formatting/readability improvement.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

afbc7194

bcachefs: Kill normalize_read_intent_locks() · 8b31e4fc

Kent Overstreet authored Aug 22, 2022

Before we had the deadlock cycle detector, we didn't want to be holding
read locks when taking intent locks, because blocking on an intent lock
while holding a read lock was a lock ordering violation that could
cause a deadlock.

With the cycle detector this is no longer an issue, so this code can be
deleted.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

8b31e4fc

bcachefs: Ensure bch2_btree_node_lock_write_nofail() never fails · 2ec254c0

Kent Overstreet authored Mar 06, 2023

In order for bch2_btree_node_lock_write_nofail() to never produce a
deadlock, we must ensure we're never holding read locks when using it.
Fortunately, it's only used from code paths where any read locks may be
safely dropped.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

2ec254c0

bcachefs: Delete old deadlock avoidance code · 0d7009d7

Kent Overstreet authored Aug 22, 2022

This deletes our old lock ordering based deadlock avoidance code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

0d7009d7

bcachefs: Print deadlock cycle in debugfs · 96d994b3

Kent Overstreet authored Aug 22, 2022

In the event that we're not finished debugging the cycle detector, this
adds a new file to debugfs that shows what the cycle detector finds, if
anything. By comparing this with btree_transactions, which shows held
locks for every btree_transaction, we'll be able to determine if it's
the cycle detector that's buggy or something else.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

96d994b3

bcachefs: Deadlock cycle detector · 33bd5d06

Kent Overstreet authored Aug 22, 2022

We've outgrown our own deadlock avoidance strategy.

The btree iterator API provides an interface where the user doesn't need
to concern themselves with lock ordering - different btree iterators can
be traversed in any order. Without special care, this will lead to
deadlocks.

Our previous strategy was to define a lock ordering internally, and
whenever we attempt to take a lock and trylock() fails, we'd check if
the current btree transaction is holding any locks that cause a lock
ordering violation. If so, we'd issue a transaction restart, and then
bch2_trans_begin() would re-traverse all previously used iterators, but
in the correct order.

That approach had some issues, though.
 - Sometimes we'd issue transaction restarts unnecessarily, when no
   deadlock would have actually occured. Lock ordering restarts have
   become our primary cause of transaction restarts, on some workloads
   totally 20% of actual transaction commits.

 - To avoid deadlock or livelock, we'd often have to take intent locks
   when we only wanted a read lock: with the lock ordering approach, it
   is actually illegal to hold _any_ read lock while blocking on an intent
   lock, and this has been causing us unnecessary lock contention.

 - It was getting fragile - the various lock ordering rules are not
   trivial, and we'd been seeing occasional livelock issues related to
   this machinery.

So, since bcachefs is already a relational database masquerading as a
filesystem, we're stealing the next traditional database technique and
switching to a cycle detector for avoiding deadlocks.

When we block taking a btree lock, after adding ourself to the waitlist
but before sleeping, we do a DFS of btree transactions waiting on other
btree transactions, starting with the current transaction and walking
our held locks, and transactions blocking on our held locks.

If we find a cycle, we emit a transaction restart. Occasionally (e.g.
the btree split path) we can not allow the lock() operation to fail, so
if necessary we'll tell another transaction that it has to fail.

Result: trans_restart_would_deadlock events are reduced by a factor of
10 to 100, and we'll be able to delete a whole bunch of grotty, fragile
code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

33bd5d06