bcachefs: fix accounting corruption race between reclaim and dev add

When a device is removed from a bcachefs volume, the associated content is removed from the various btrees. The alloc tree uses the key cache, so when keys are removed the deletes exist in cache for a period of time until reclaim comes along and flushes outstanding updates. When a device is re-added to the bcachefs volume, the add process re-adds some of these previously deleted keys. When marking device superblock locations on device add, the keys will likely refer to some of the same alloc keys that were just removed. The memory triggers for these key updates are responsible for further updates, such as bch2_mark_alloc() calling into bch2_dev_usage_update() to update per-device usage accounting. When a new key is added to key cache, the trans update path also flushes the key to the backing btree for coherency reasons for tree walks. With all of this context, if a device is removed and re-added quickly enough such that some key deletes from the remove are still pending a key cache flush, the trans update path can view this as addition of a new key because the old key in the insert entry refers to a deleted key. However the deleted cached key has not been filled by absence of a btree key, but rather refers to an explicit deletion of an existing key that occurred during device removal. The trans update path adds a new update to flush the key and tags the original (cached) update to skip running the memory triggers. This results in running triggers on the non-cached update instead, which in turn will perform accounting updates based on incoherent values. For example, bch2_dev_usage_update() subtracts the the old alloc key dirty sector count in the non-cached btree key from the newly initialized (i.e. zeroed) per device counters, leading to underflow and accounting corruption. There are at least a few ways to avoid this problem, the simplest of which may be to run triggers against the cached update rather than the non-cached update. If the key only needs to be flushed when the key is not present in the tree, however, then this still performs an unnecessary update. We could potentially use the cached key dirty state to determine whether the delete is a dirty, cached update vs. a clean cache fill, but this may require transmitting key cache dirty state across layers, which adds complexity and seems to be of limited value. Instead, update flush_new_cached_update() to handle this by simply checking for the key in the btree and only perform the flush when a backing key is not present. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

bcachefs: fix accounting corruption race between reclaim and dev add
When a device is removed from a bcachefs volume, the associated content is removed from the various btrees. The alloc tree uses the key cache, so when keys are removed the deletes exist in cache for a period of time until reclaim comes along and flushes outstanding updates. When a device is re-added to the bcachefs volume, the add process re-adds some of these previously deleted keys. When marking device superblock locations on device add, the keys will likely refer to some of the same alloc keys that were just removed. The memory triggers for these key updates are responsible for further updates, such as bch2_mark_alloc() calling into bch2_dev_usage_update() to update per-device usage accounting. When a new key is added to key cache, the trans update path also flushes the key to the backing btree for coherency reasons for tree walks. With all of this context, if a device is removed and re-added quickly enough such that some key deletes from the remove are still pending a key cache flush, the trans update path can view this as addition of a new key because the old key in the insert entry refers to a deleted key. However the deleted cached key has not been filled by absence of a btree key, but rather refers to an explicit deletion of an existing key that occurred during device removal. The trans update path adds a new update to flush the key and tags the original (cached) update to skip running the memory triggers. This results in running triggers on the non-cached update instead, which in turn will perform accounting updates based on incoherent values. For example, bch2_dev_usage_update() subtracts the the old alloc key dirty sector count in the non-cached btree key from the newly initialized (i.e. zeroed) per device counters, leading to underflow and accounting corruption. There are at least a few ways to avoid this problem, the simplest of which may be to run triggers against the cached update rather than the non-cached update. If the key only needs to be flushed when the key is not present in the tree, however, then this still performs an unnecessary update. We could potentially use the cached key dirty state to determine whether the delete is a dirty, cached update vs. a clean cache fill, but this may require transmitting key cache dirty state across layers, which adds complexity and seems to be of limited value. Instead, update flush_new_cached_update() to handle this by simply checking for the key in the btree and only perform the flush when a backing key is not present. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
3c434cdf · Brian Foster · Kent Overstreet · 958c347b · 3c434cdf
Commit 3c434cdf authored May 01, 2023 by Brian Foster Committed by Kent Overstreet Oct 22, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 16 additions and 6 deletions

fs/bcachefs/btree_update_leaf.c fs/bcachefs/btree_update_leaf.c +16 -6

No files found.
--- a/fs/bcachefs/btree_update_leaf.c
+++ b/fs/bcachefs/btree_update_leaf.c
@@ -1501,21 +1501,31 @@ static noinline int flush_new_cached_update(struct btree_trans *trans,
 					    unsigned long ip)
 {
 	struct btree_path *btree_path;
+	struct bkey k;
 	int ret;

-	i->key_cache_already_flushed = true;
-	i->flags |= BTREE_TRIGGER_NORUN;
-
 	btree_path = bch2_path_get(trans, path->btree_id, path->pos, 1, 0,
 				   BTREE_ITER_INTENT, _THIS_IP_);
-
 	ret = bch2_btree_path_traverse(trans, btree_path, 0);
 	if (ret)
-		goto err;
+		goto out;
+
+	/*
+	 * The old key in the insert entry might actually refer to an existing
+	 * key in the btree that has been deleted from cache and not yet
+	 * flushed. Check for this and skip the flush so we don't run triggers
+	 * against a stale key.
+	 */
+	bch2_btree_path_peek_slot_exact(btree_path, &k);
+	if (!bkey_deleted(&k))
+		goto out;
+
+	i->key_cache_already_flushed = true;
+	i->flags |= BTREE_TRIGGER_NORUN;

 	btree_path_set_should_be_locked(btree_path);
 	ret = bch2_trans_update_by_path_trace(trans, btree_path, i->k, flags, ip);
-err:
+out:
 	bch2_path_put(trans, btree_path, true);
 	return ret;
 }