Commits · 78c8fe20be12d0e4b6427d9149fd1eb9a69e2290 · Kirill Smelkov / linux

An error occurred fetching the project authors.

22 Oct, 2023 40 commits

bcachefs: Normal update/commit path now works before going RW · 78c8fe20

Kent Overstreet authored 3 years ago

This improves __bch2_trans_commit - early in the recovery process, when
we're running btree_gc and before we want to go RW, it now uses
bch2_journal_key_insert() to add the update to the list of updates for
journal replay to do, instead of btree_gc having to use separate
interfaces depending on whether we're running at bringup or, later,
runtime.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

78c8fe20

bcachefs: Delete some flag bits that are no longer used · 10b93677
Kent Overstreet authored 3 years ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
```
10b93677

bcachefs: Kill bch2_bkey_debugcheck · 2ce8fbd9

Kent Overstreet authored 3 years ago

The old .debugcheck methods are no more and this just calls the .invalid
method, which doesn't add much since we already check that when doing
btree updates and when reading metadata in.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

2ce8fbd9

bcachefs: bch2_gc_gens() no longer uses bucket array · c45c8667

Kent Overstreet authored 3 years ago

Like the previous patches, this converts bch2_gc_gens() to use the alloc
btree directly, and private arrays of generation numbers for its own
recalculation of oldest_gen.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

c45c8667

bcachefs: btree_gc no longer uses main in-memory bucket array · ec061b21

Kent Overstreet authored 3 years ago

This changes the btree_gc code to only use the second bucket array, the
one dedicated to GC. On completion, it compares what's in its in memory
bucket array to the allocation information in the btree and writes it
directly, instead of updating the main in-memory bucket array and
writing that.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

ec061b21

bcachefs: btree_id_cached() · 7c8f6f98

Kent Overstreet authored 3 years ago

Add a new helper that returns true if the given btree ID uses the btree
key cache. This enables some new cleanups, since the helper can check
the options for whether caching is enabled on a given btree.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

7c8f6f98

bcachefs: Fix freeing in bch2_dev_buckets_resize() · 80bf2f34

Kent Overstreet authored 3 years ago

We were double-freeing old_buckets and not freeing old_buckets_gens:
also, the code was supposed to free buckets, not old_buckets;
old_buckets is only needed because we have to use rcu_assign_pointer()
instead of swap(), and won't be set if we hit the error path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

80bf2f34

bcachefs: New data structure for buckets waiting on journal commit · 21aec962

Kent Overstreet authored 3 years ago

Implement a hash table, using cuckoo hashing, for empty buckets that are
waiting on a journal commit before they can be reused.

This replaces the journal_seq field of bucket_mark, and is part of
eventually getting rid of the in memory bucket array.

We may need to make bch2_bucket_needs_journal_commit() lockless, pending
profiling and testing.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

21aec962

Revert "bcachefs: Delete some obsolete journal_seq_blacklist code" · 9b6e2f1e

Kent Overstreet authored 3 years ago

This reverts commit f95b61228efd04c9c158123da5827c96e9773b29.

It turns out, we're seeing filesystems in the wild end up with
blacklisted btree node bsets - this should not be happening, and until
we understand why and fix it we need to keep this code around.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

9b6e2f1e

bcachefs: Log & error message improvements · 03ea3962

Kent Overstreet authored 3 years ago

 - Add a shim uuid_unparse_lower() in the kernel, since %pU doesn't work
   in userspace

 - We don't need to print the bcachefs: or the filesystem name prefix in
   userspace

 - Improve a few error messages
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

03ea3962

bcachefs: Add verbose log messages for journal read · 365f64f3
Kent Overstreet authored 3 years ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
```
365f64f3

bcachefs: bch_dev->dev · eacb2574

Kent Overstreet authored 3 years ago

Add a field to bch_dev for the dev_t of the underlying block device -
this fixes a null ptr deref in tracepoints.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

eacb2574

bcachefs: Simplify journal replay · d8601afc

Kent Overstreet authored 3 years ago

With BTREE_ITER_WITH_JOURNAL, there's no longer any restrictions on the
order we have to replay keys from the journal in, and we can also start
up journal reclaim right away - and delete a bunch of code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

d8601afc

bcachefs: BTREE_ITER_WITH_JOURNAL · 5222a460

Kent Overstreet authored 3 years ago

This adds a new btree iterator flag, BTREE_ITER_WITH_JOURNAL, that is
automatically enabled when initializing a btree iterator before journal
replay has completed - it overlays the contents of the journal with the
btree.

This lets us delete bch2_btree_and_journal_walk() and just use the
normal btree iterator interface instead - which also lets us delete a
significant amount of duplicated code.

Note that BTREE_ITER_WITH_JOURNAL is still unoptimized in this patch -
we're redoing the binary search over keys in the journal every time we
call bch2_btree_iter_peek().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5222a460

bcachefs: Fix race between btree updates & journal replay · dfd41fb9

Kent Overstreet authored 3 years ago

Add a flag to indicate whether a journal replay key has been
overwritten, and set/test it with appropriate btree locks held.

This fixes a race between the allocator - invalidating buckets, and
doing btree updates - and journal replay, which before this patch could
clobber the allocator thread's update with an older version of the key
from the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

dfd41fb9

bcachefs: New in-memory array for bucket gens · a7860877

Kent Overstreet authored 3 years ago

The main in-memory bucket array is going away, but we'll still need to
keep bucket generations in memory, at least for now - ptr_stale() needs
to be an efficient operation.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

a7860877

bcachefs: Put open_buckets in a hashtable · 9ddffaf8

Kent Overstreet authored 3 years ago

This is so that the copygc code doesn't have to refer to
bucket_mark.owned_by_allocator - assisting in getting rid of the in
memory bucket array.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

9ddffaf8

bcachefs: Delete some obsolete journal_seq_blacklist code · 04f0f77d

Kent Overstreet authored 3 years ago

Since metadata version bcachefs_metadata_version_btree_ptr_sectors_written,
we haven't needed the journal seq blacklist mechanism for ignoring
blacklisted btree node writes - we now only need it for ignoring journal
entries that were written after the newest flush journal entry, and then
we only need to keep those blacklist entries around until journal replay
is finished.

That means we can delete the code for scanning btree nodes to GC
journal_seq_blacklist entries.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

04f0f77d

bcachefs: Don't start allocator threads too early · c64740ef

Kent Overstreet authored 3 years ago

If the allocator threads start before journal replay has finished
replaying alloc keys, journal replay might overwrite the allocator's
btree updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

c64740ef

bcachefs: Rewrite bch2_bucket_alloc_new_fs() · 09943313

Kent Overstreet authored 3 years ago

This changes bch2_bucket_alloc_new_fs() to a simple bump allocator that
doesn't need to use the in memory bucket array, part of a larger patch
series to entirely get rid of the in memory bucket array, except for
gc/fsck.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

09943313

bcachefs: Turn encoded_extent_max into a regular option · e4099990

Kent Overstreet authored 3 years ago

It'll now be handled at format time and in sysfs like other options - it
still can only be set at format time, though.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

e4099990

bcachefs: Option improvements · 8244f320

Kent Overstreet authored 3 years ago

This adds flags for options that must be a power of two (block size and
btree node size), and options that are stored in the superblock as a
power of two (encoded extent max).

Also: options are now stored in memory in the same units they're
displayed in (bytes): we now convert when getting and setting from the
superblock.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

8244f320

bcachefs: Fix some shutdown path bugs · 99fafb04

Kent Overstreet authored 3 years ago

This fixes some bugs when we hit an error very early in the filesystem
startup path, before most things have been initialized.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

99fafb04

bcachefs: Add more time_stats · 991ba021

Kent Overstreet authored 3 years ago

This adds more latency/event measurements and breaks some apart into
more events. Journal writes are broken apart into flush writes and
noflush writes, btree compactions are broken out from btree splits,
btree mergers are added, as well as btree_interior_updates - foreground
and total.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

991ba021

bcachefs: Split out struct gc_stripe from struct stripe · 990d42d1

Kent Overstreet authored 3 years ago

We have two radix trees of stripes - one that mirrors some information
from the stripes btree in normal operation, and another that GC uses to
recalculate block usage counts.

The normal one is now only used for finding partially empty stripes in
order to reuse them - the normal stripes radix tree and the GC stripes
radix tree are used significantly differently, so this patch splits them
into separate types.

In an upcoming patch we'll be replacing c->stripes with a btree that
indexes stripes by the order we want to reuse them.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

990d42d1

bcachefs: Convert bucket_alloc_ret to negative error codes · fc6c01e2

Kent Overstreet authored 3 years ago

Start a new header, errcode.h, for bcachefs-private error codes - more
error codes will be converted later.

This patch just converts bucket_alloc_ret so that they can be mixed with
standard error codes and passed as ERR_PTR errors - the ec.c code was
doing this already, but incorrectly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

fc6c01e2

bcachefs: Disk space accounting fix on brand-new fs · c714614b

Kent Overstreet authored 3 years ago

The filesystem initialization path first marks superblock and journal
buckets non transactionally, since the btree isn't functional yet. That
path was updating the per-journal-buf percpu counters via
bch2_dev_usage_update(), and updating the wrong set of counters so those
updates didn't get written out until journal entry 4.

The relevant code is going to get significantly rewritten in the future
as we transition away from the in memory bucket array, so this just
hacks around it for now.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

c714614b

bcachefs: Also log device name in userspace · 0a84a066

Kent Overstreet authored 3 years ago

Change log messages in userspace to be closer to what they are in kernel
space, and include the device name - it's also useful in userspace.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

0a84a066

bcachefs: Add BCH_SUBVOLUME_UNLINKED · 2027875b

Kent Overstreet authored 3 years ago

Snapshot deletion needs to become a multi step process, where we unlink,
then tear down the page cache, then delete the subvolume - the deleting
flag is equivalent to an inode with i_nlink = 0.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

2027875b

bcachefs: Subvolumes, snapshots · 14b393ee

Kent Overstreet authored 3 years ago

This patch adds subvolume.c - support for the subvolumes and snapshots
btrees and related data types and on disk data structures. The next
patches will start hooking up this new code to existing code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

14b393ee

bcachefs: btree_path · 67e0dd8f

Kent Overstreet authored 3 years ago

This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.

This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

67e0dd8f

bcachefs: add progress stats to sysfs · 8dd6ed94

Brett Holman authored 3 years ago

This adds progress stats to sysfs for copygc, rebalance, recovery, and the
cmd_job ioctls.
Signed-off-by: Brett Holman <bholman.devel@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8dd6ed94

bcachefs: Update btree ptrs after every write · 9f1833ca

Kent Overstreet authored 3 years ago

This closes a significant hole (and last known hole) in our ability to
verify metadata. Previously, since btree nodes are log structured, we
couldn't detect lost btree writes that weren't the first write to a
given node. Additionally, this seems to have lead to some significant
metadata corruption on multi device filesystems with metadata
replication: since a write may have made it to one device and not
another, if we read that btree node back from the replica that did have
that write and started appending after that point, the other replica
would have a gap in the bset entries and reading from that replica
wouldn't find the rest of the bsets.

But, since updates to interior btree nodes are now journalled, we can
close this hole by updating pointers to btree nodes after every write
with the currently written number of sectors, without negatively
affecting performance. This means we will always detect lost or corrupt
metadata - it also means that our btree is now a curious hybrid of COW
and non COW btrees, with all the benefits of both (excluding
complexity).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

9f1833ca

bcachefs: Don't loop into topology repair · d976a84e
Kent Overstreet authored 3 years ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
```
d976a84e
bcachefs: Fsck for reflink refcounts · 890b74f0
Kent Overstreet authored 3 years ago
```
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
```
890b74f0

bcachefs: Split out btree_error_wq · 9f2772c4

Kent Overstreet authored 3 years ago

We can't use btree_update_wq becuase btree updates may be waiting on
btree writes to complete.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

9f2772c4

bcachefs: Don't use uuid in tracepoints · ddc7dd62

Kent Overstreet authored 3 years ago

%pU for printing out pointers to uuids doesn't work in perf trace
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

ddc7dd62

bcachefs: Add a workqueue for btree io completions · 731bdd2e

Kent Overstreet authored 3 years ago

Also, clean up workqueue usage - we shouldn't be using system
workqueues, pretty much everything we do needs to be on our own
WQ_MEM_RECLAIM workqueues.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

731bdd2e

bcachefs: Add a debug mode that always reads from every btree replica · 1ce0cf5f

Kent Overstreet authored 3 years ago

There's a new module parameter, verify_all_btree_replicas, that enables
reading from every btree replica when reading in btree nodes and
comparing them against each other. We've been seeing some strange btree
corruption - this will hopefully aid in tracking it down and catching it
more often.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

1ce0cf5f

bcachefs: Ratelimiting for writeback IOs · ef1b2092

Kent Overstreet authored 3 years ago

Writeback throttling is a kernel config option and not always enabled.
When it's not enabled we need a fallback, to avoid unbounded memory
pinning and work item backlogs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ef1b2092