• Brian Foster's avatar
    bcachefs: fix accounting corruption race between reclaim and dev add · 3c434cdf
    Brian Foster authored
    When a device is removed from a bcachefs volume, the associated
    content is removed from the various btrees. The alloc tree uses the
    key cache, so when keys are removed the deletes exist in cache for a
    period of time until reclaim comes along and flushes outstanding
    updates.
    
    When a device is re-added to the bcachefs volume, the add process
    re-adds some of these previously deleted keys. When marking device
    superblock locations on device add, the keys will likely refer to
    some of the same alloc keys that were just removed. The memory
    triggers for these key updates are responsible for further updates,
    such as bch2_mark_alloc() calling into bch2_dev_usage_update() to
    update per-device usage accounting.
    
    When a new key is added to key cache, the trans update path also
    flushes the key to the backing btree for coherency reasons for tree
    walks.
    
    With all of this context, if a device is removed and re-added
    quickly enough such that some key deletes from the remove are still
    pending a key cache flush, the trans update path can view this as
    addition of a new key because the old key in the insert entry refers
    to a deleted key. However the deleted cached key has not been filled
    by absence of a btree key, but rather refers to an explicit deletion
    of an existing key that occurred during device removal.
    
    The trans update path adds a new update to flush the key and tags
    the original (cached) update to skip running the memory triggers.
    This results in running triggers on the non-cached update instead,
    which in turn will perform accounting updates based on incoherent
    values. For example, bch2_dev_usage_update() subtracts the the old
    alloc key dirty sector count in the non-cached btree key from the
    newly initialized (i.e. zeroed) per device counters, leading to
    underflow and accounting corruption.
    
    There are at least a few ways to avoid this problem, the simplest of
    which may be to run triggers against the cached update rather than
    the non-cached update. If the key only needs to be flushed when the
    key is not present in the tree, however, then this still performs an
    unnecessary update. We could potentially use the cached key dirty
    state to determine whether the delete is a dirty, cached update vs.
    a clean cache fill, but this may require transmitting key cache
    dirty state across layers, which adds complexity and seems to be of
    limited value. Instead, update flush_new_cached_update() to handle
    this by simply checking for the key in the btree and only perform
    the flush when a backing key is not present.
    Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
    Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
    3c434cdf
btree_update_leaf.c 51.3 KB