Commits · 330405057fae56f2f21055851a5a816626679226 · Kirill Smelkov / linux

09 Sep, 2024 35 commits

bcachefs: bch2_seek_hole() -> for_each_btree_key_in_subvolume_upto · 33040505
Kent Overstreet authored Jul 17, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
33040505
bcachefs: bch2_seek_data() -> for_each_btree_key_in_subvolume_upto · 9f9e7f50
Kent Overstreet authored Jul 17, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
9f9e7f50
bcachefs: bch2_xattr_list() -> for_each_btree_key_in_subvolume_upto · 3da106cd
Kent Overstreet authored Jul 17, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
3da106cd
bcachefs: bch2_readdir() -> for_each_btree_key_in_subvolume_upto · efdb77a2
Kent Overstreet authored Jul 17, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
efdb77a2

bcachefs: for_each_btree_key_in_subvolume_upto() · 0215b918

Kent Overstreet authored Jul 17, 2024

New helper for looping over keys in a given subvolume
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

0215b918

bcachefs: bch2_fiemap(): call trans_begin() on every loop iter · 1a3158ec
Kent Overstreet authored Jul 17, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
1a3158ec

bcachefs: bchfs_read(): call trans_begin() on every loop iter · 7e759572

Kent Overstreet authored Jul 17, 2024

Same as the recent change for __bch2_read(); also, kill now unnecessary
btree_trans_too_many_iters() calls.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

7e759572

bcachefs: kill bch2_btree_iter_peek_and_restart() · 804baca7
Kent Overstreet authored Jul 17, 2024
```
dead code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
804baca7

bcachefs: Btree path tracepoints · 32ed4a62

Kent Overstreet authored Aug 10, 2022

Fastpath tracepoints, rarely needed, only enabled with
CONFIG_BCACHEFS_PATH_TRACEPOINTS.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

32ed4a62

bcachefs: Add check for btree_path ref overflow · abbfc4db
Kent Overstreet authored Jul 16, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
abbfc4db

bcachefs: Mark bch_inode_info as SLAB_ACCOUNT · 094c6a9f

Youling Tang authored Jul 03, 2024

After commit 230e9fc2 ("slab: add SLAB_ACCOUNT flag"), we need to mark
the inode cache as SLAB_ACCOUNT, similar to commit 5d097056 ("kmemcg:
account for certain kmem allocations to memcg")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

094c6a9f

bcachefs: allocate inode by using alloc_inode_sb() · 082330c3

Youling Tang authored Jul 16, 2024

The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() to alloc_inode_sb().

It will also fix [1] to avoid the NULL pointer dereference BUG in
list_lru_add() when CONFIG_MEMCG is enabled.

Links:
[1]: https://lore.kernel.org/all/20589721-46c0-4344-b2ef-6ab48bbe2ea5@linux.dev/
[2]: https://lore.kernel.org/all/7db60e36-9c96-4938-a28d-a9745e287386@linux.dev/

Fixes: 86d81ec5 ("bcachefs: Mark bch_inode_info as SLAB_ACCOUNT")
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

082330c3

bcachefs: Opt_durability can now be set via bch2_opt_set_sb() · 9092a38a
Kent Overstreet authored Jul 15, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
9092a38a
bcachefs: bch2_opt_set_sb() can now set (some) device options · 4aedeac5
Kent Overstreet authored Jul 15, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
4aedeac5

bcachefs: data_allowed is now an opts.h option · afefc986

Kent Overstreet authored Jul 15, 2024

need this so cmd_option in userspace can handle it
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

afefc986

bcachefs: Annotate struct bucket_array with __counted_by() · 8573dd34

Thorsten Blum authored Aug 21, 2024

Add the __counted_by compiler attribute to the flexible array member
bucket to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.
Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8573dd34

bcachefs: Fix format specifier in bch2_btree_key_cache_to_text() · 5396e5af

Nathan Chancellor authored Aug 21, 2024

When building for a 32-bit architecture, for which 'size_t' is
'unsigned int', there is a compiler warning due to use of '%lu':

  In file included from fs/bcachefs/vstructs.h:5,
                   from fs/bcachefs/bcachefs_format.h:80,
                   from fs/bcachefs/bcachefs.h:207,
                   from fs/bcachefs/btree_key_cache.c:3:
  fs/bcachefs/btree_key_cache.c: In function 'bch2_btree_key_cache_to_text':
  fs/bcachefs/btree_key_cache.c:795:25: error: format '%lu' expects argument of type 'long unsigned int', but argument 3 has type 'size_t' {aka 'unsigned int'} [-Werror=format=]
    795 |         prt_printf(out, "pending:\t%lu\r\n",            per_cpu_sum(bc->nr_pending));
        |                         ^~~~~~~~~~~~~~~~~~~
  fs/bcachefs/util.h:78:63: note: in definition of macro 'prt_printf'
     78 | #define prt_printf(_out, ...)           bch2_prt_printf(_out, __VA_ARGS__)
        |                                                               ^~~~~~~~~~~
  fs/bcachefs/btree_key_cache.c:795:38: note: format string is defined here
    795 |         prt_printf(out, "pending:\t%lu\r\n",            per_cpu_sum(bc->nr_pending));
        |                                    ~~^
        |                                      |
        |                                      long unsigned int
        |                                    %u
  cc1: all warnings being treated as errors

Use the proper specifier, '%zu', to resolve the warning.

Fixes: e447e49977b8 ("bcachefs: key cache can now allocate from pending")
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5396e5af

bcachefs: key cache can now allocate from pending · 5f1929f1

Kent Overstreet authored Jun 13, 2024

btree_trans objects can hold the btree_trans_barrier srcu read lock for
an extended amount of time (they shouldn't, but it's difficult to
guarantee).

the srcu barrier blocks memory reclaim, so to avoid too many stranded
key cache items, this uses the new pending_rcu_items to allocate from
pending items - like we did before, but now without a global lock on the
key cache.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5f1929f1

bcachefs: Rip out freelists from btree key cache · f2bfe7e8
Kent Overstreet authored Jun 08, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
f2bfe7e8
bcachefs: rcu_pending now works in userspace · d2ed0f20
Kent Overstreet authored Aug 23, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
d2ed0f20

bcachefs: rcu_pending · 8e973a4f

Kent Overstreet authored Jun 10, 2024

Generic data structure for explicitly tracking pending RCU items,
allowing items to be dequeued (i.e. allocate from items pending
freeing). Works with conventional RCU and SRCU, and possibly other RCU
flavors in the future, meaning this can serve as a more generic
replacement for SLAB_TYPESAFE_BY_RCU.

Pending items are tracked in radix trees; if memory allocation fails, we
fall back to linked lists.

A rcu_pending is initialized with a callback, which is invoked when
pending items's grace periods have expired. Two types of callback
processing are handled specially:

- RCU_PENDING_KVFREE_FN

  New backend for kvfree_rcu(). Slightly faster, and eliminates the
  synchronize_rcu() slowpath in kvfree_rcu_mightsleep() - instead, an
  rcu_head is allocated if we don't have one and can't use the radix
  tree

  TODO:
  - add a shrinker (as in the existing kvfree_rcu implementation) so that
    memory reclaim can free expired objects if callback processing isn't
    keeping up, and to expedite a grace period if we're under memory
    pressure and too much memory is stranded by RCU

  - add a counter for amount of memory pending

- RCU_PENDING_CALL_RCU_FN

  Accelerated backend for call_rcu() - pending callbacks are tracked in
  a radix tree to eliminate linked list overhead.

to serve as replacement backends for kvfree_rcu() and call_rcu(); these
may be of interest to other uses (e.g. SLAB_TYPESAFE_BY_RCU users).

Note:

Internally, we're using a single rearming call_rcu() callback for
notifications from the core RCU subsystem for notifications when objects
are ready to be processed.

Ideally we would be getting a callback every time a grace period
completes for which we have objects, but that would require multiple
rcu_heads in flight, and since the number of gp sequence numbers with
uncompleted callbacks is not bounded, we can't do that yet.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

8e973a4f

lib/generic-radix-tree.c: add preallocation · b3f9da79
Kent Overstreet authored Aug 10, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
b3f9da79
lib/generic-radix-tree.c: genradix_ptr_inlined() · f6594633
Kent Overstreet authored Jun 17, 2024
```
Provide an inlined fast path
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
f6594633

bcachefs: Fix deadlock in __wait_on_freeing_inode() · 54f77024

Kent Overstreet authored Aug 16, 2024

We can't call __wait_on_freeing_inode() with btree locks held; we're
waiting on another thread that's in evict(), and before it clears that
bit it needs to write that inode to flush timestamps - deadlock.

Fixing this involves a fair amount of re-jiggering to plumb a new
transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

54f77024

bcachefs: switch to rhashtable for vfs inodes hash · 112d21fd

Kent Overstreet authored Jun 08, 2024

the standard vfs inode hash table suffers from painful lock contention -
this is long overdue
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

112d21fd

inode: make __iget() a static inline · 88d2ae0e

Kent Overstreet authored Aug 08, 2024

bcachefs is switching to an rhashtable for vfs inodes instead of the
standard inode.c hashtable, so we need this exported, or - a static
inline makes more sense for a single atomic_inc().
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

88d2ae0e

bcachefs: Replace div_u64 with div64_u64 where second param is u64 · 27663d77

Reed Riley authored Sep 05, 2024

Bcachefs often uses this function to divide by nanosecond times - which
can easily cause problems when cast to u32.  For example, `cat
/sys/fs/bcachefs/*/internal/rebalance_status` would return invalid data
in the `duration waited` field because dividing by the number of
nanoseconds in a minute requires the divisor parameter to be u64.
Signed-off-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

27663d77

bcachefs: Fix sysfs rebalance duration waited formatting · 36f0af4f

Feiko Nanninga authored Sep 01, 2024

cat /sys/fs/bcachefs/*/internal/rebalance_status
waiting
  io wait duration:  13.5 GiB
  io wait remaining: 627 MiB
  duration waited:   1392 m

duration waited was increasing at a rate of about 14 times the expected
rate.

div_u64 takes a u32 divisor, but u->nsecs (from time_units[]) can be
bigger than u32.
Signed-off-by: Feiko Nanninga <feiko.nanninga@fnanninga.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

36f0af4f

bcachefs: Fix negative timespecs · a3ed1cc4

Alyssa Ross authored Sep 07, 2024

This fixes two problems in the handling of negative times:

 • rem is signed, but the rem * c->sb.nsec_per_time_unit operation
   produced a bogus unsigned result, because s32 * u32 = u32.

 • The timespec was not normalized (it could contain more than a
   billion nanoseconds).

For example, { .tv_sec = -14245441, .tv_nsec = 750000000 }, after
being round tripped through timespec_to_bch2_time and then
bch2_time_to_timespec would come back as
{ .tv_sec = -14245440, .tv_nsec = 4044967296 } (more than 4 billion
nanoseconds).

Cc: stable@vger.kernel.org
Fixes: 595c1e9b ("bcachefs: Fix time handling")
Closes: https://github.com/koverstreet/bcachefs/issues/743Co-developed-by: Erin Shepherd <erin.shepherd@e43.eu>
Signed-off-by: Erin Shepherd <erin.shepherd@e43.eu>
Co-developed-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Ryan Lahfa <ryan@lahfa.xyz>
Signed-off-by: Alyssa Ross <hi@alyssa.is>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

a3ed1cc4

bcachefs: Don't delete open files in online fsck · 16005147

Kent Overstreet authored Sep 08, 2024

If a file is unlinked but still open, we don't want online fsck to
delete it - or fun inconsistencies will happen.

https://github.com/koverstreet/bcachefs/issues/727Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

16005147

bcachefs: fix btree_key_cache sysfs knob · 2c377d8a
Kent Overstreet authored Sep 05, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
2c377d8a
bcachefs: More BCH_SB_MEMBER_INVALID support · 52df04f0
Kent Overstreet authored Sep 04, 2024
```
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
```
52df04f0

bcachefs: Simplify bch2_bkey_drop_ptrs() · df88febc

Kent Overstreet authored Sep 04, 2024

bch2_bkey_drop_ptrs() had a some complicated machinery for avoiding
O(n^2) when dropping multiple pointers - but when n is only going to be
~4, it's not worth it.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

df88febc

bcachefs: Add a cond_resched() to __journal_keys_sort() · ec36573d

Kent Overstreet authored Sep 05, 2024

Without this, we'd potentially sort multiple times without a
cond_resched(), leading to hung task warnings on larger systems.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

ec36573d

bcachefs: Fix ca->io_ref usage · 5a6e43af

Kent Overstreet authored Sep 04, 2024

ca->io_ref does not protect against the filesystem going way,
c->write_ref does. Much like

0b50b731 bcachefs: Fix refcounting in discard path

the other async paths need fixing.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

5a6e43af

04 Sep, 2024 1 commit

bcachefs: BCH_SB_MEMBER_INVALID · 53f66195

Kent Overstreet authored Sep 01, 2024

Create a sentinal value for "invalid device".

This is needed for removing devices that have stripes on them (force
removing, without evacuating); we need a sentinal value for the stripe
pointers to the device being removed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

53f66195

01 Sep, 2024 1 commit

bcachefs: fix rebalance accounting · 7f12a963

Kent Overstreet authored Sep 01, 2024

Fixes: 49aa7830 ("bcachefs: Fix rebalance_work accounting")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

7f12a963

31 Aug, 2024 2 commits

bcachefs: Mark more errors as autofix · 3d3020c4

Kent Overstreet authored Aug 22, 2024

errors that are known to always be safe to fix should be autofix: this
should be most errors even at this point, but that will need some
thorough review.

note that errors are still logged in the superblock, so we'll still know
that they happened.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

3d3020c4

bcachefs: Revert lockless buffered IO path · e3e69409

Kent Overstreet authored Aug 31, 2024

We had a report of data corruption on nixos when building installer
images.

https://github.com/NixOS/nixpkgs/pull/321055#issuecomment-2184131334

It seems that writes are being dropped, but only when issued by QEMU,
and possibly only in snapshot mode. It's undetermined if it's write
calls are being dropped or dirty folios.

Further testing, via minimizing the original patch to just the change
that skips the inode lock on non appends/truncates, reveals that it
really is just not taking the inode lock that causes the corruption: it
has nothing to do with the other logic changes for preserving write
atomicity in corner cases.

It's also kernel config dependent: it doesn't reproduce with the minimal
kernel config that ktest uses, but it does reproduce with nixos's distro
config. Bisection the kernel config initially pointer the finger at page
migration or compaction, but it appears that was erroneous; we haven't
yet determined what kernel config option actually triggers it.

Sadly it appears this will have to be reverted since we're getting too
close to release and my plate is full, but we'd _really_ like to fully
debug it.

My suspicion is that this patch is exposing a preexisting bug - the
inode lock actually covers very little in IO paths, and we have a
different lock (the pagecache add lock) that guards against races with
truncate here.

Fixes: 7e64c86c ("bcachefs: Buffered write path now can avoid the inode lock")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

e3e69409

27 Aug, 2024 1 commit

bcachefs: Fix bch2_extents_match() false positive · d2693569

Kent Overstreet authored Aug 26, 2024

This was caught as a very rare nonce inconsistency, on systems with
encryption and replication (and tiering, or some form of rebalance
operation running):

[Wed Jul 17 13:30:03 2024] about to insert invalid key in data update path
[Wed Jul 17 13:30:03 2024] old: u64s 10 type extent 671283510:6392:U32_MAX len 16 ver 106595503: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:104 gen 7 ptr: 4:513244:48 gen 6 rebalance: target hdd compression zstd
[Wed Jul 17 13:30:03 2024] k:   u64s 10 type extent 671283510:6400:U32_MAX len 16 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 ptr: 4:513244:56 gen 6 rebalance: target hdd compression zstd
[Wed Jul 17 13:30:03 2024] new: u64s 14 type extent 671283510:6392:U32_MAX len 8 ver 106595508: durability: 2 crc: c_size 8 size 16 offset 0 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 3:355968:112 gen 7 cached ptr: 4:513244:56 gen 6 cached rebalance: target hdd compression zstd crc: c_size 8 size 16 offset 8 nonce 0 csum chacha20_poly1305_80 compress zstd ptr: 1:10860085:32 gen 0 ptr: 0:17285918:408 gen 0
[Wed Jul 17 13:30:03 2024] bcachefs (cca5bc65-fe77-409d-a9fa-465a6e7f4eae): fatal error - emergency read only

bch2_extents_match() was reporting true for extents that did not
actually point to the same data.

bch2_extent_match() iterates over pairs of pointers, looking for
pointers that point to the same location on disk (with matching
generation numbers). However one or both extents may have been trimmed
(or merged) and they might not have the same disk offset: it corrects
for this by subtracting the key offset and the checksum entry offset.

However, this failed when an extent was immediately partially
overwritten, and the new overwrite was allocated the next adjacent disk
space.

Normally, with compression off, this would never cause a bug, since the
new extent would have to be immediately after the old extent for the
pointer offsets to match, and the rebalance index update path is not
looking for an extent outside the range of the extent it moved.

However with compression enabled, extents take up less space on disk
than they do in the btree index space - and spuriously matching after
partial overwrite is possible.

To fix this, add a secondary check, that strictly checks that the
regions pointed to on disk overlap.

https://github.com/koverstreet/bcachefs/issues/717Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

d2693569