• Brian Foster's avatar
    bcachefs: fix race between journal entry close and pin set · 3e55189b
    Brian Foster authored
    bcachefs freeze testing via fstests generic/390 occasionally
    reproduces the following BUG from bch2_fs_read_only():
    
      BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty));
    
    This indicates that one or more dirty key cache keys still exist
    after the attempt to flush and quiesce the fs. The sequence that
    leads to this problem actually occurs on unfreeze (ro->rw), and
    looks something like the following:
    
    - Task A begins a transaction commit and acquires journal_res for
      the current seq. This transaction intends to perform key cache
      insertion.
    - Task B begins a bch2_journal_flush() via bch2_sync_fs(). This ends
      up in journal_entry_want_write(), which closes the current journal
      entry and drops the reference to the pin list created on entry open.
      The pin put pops the front of the journal via fast reclaim since the
      reference count has dropped to 0.
    - Task A attempts to set the journal pin for the associated cached
      key, but bch2_journal_pin_set() skips the pin insert because the
      seq of the transaction reservation is behind the front of the pin
      list fifo.
    
    The end result is that the pin associated with the cached key is not
    added, which prevents a subsequent reclaim from processing the key
    and thus leaves it dangling at freeze time. The fundamental cause of
    this problem is that the front of the journal is allowed to pop
    before a transaction with outstanding reservation on the associated
    journal seq is able to add a pin. The count for the pin list
    associated with the seq drops to zero and is prematurely reclaimed
    as a result.
    
    The logical fix for this problem lies in how the journal buffer is
    managed in similar scenarios where the entry might have been closed
    before a transaction with outstanding reservations happens to be
    committed.
    
    When a journal entry is opened, the current sequence number is
    bumped, the associated pin list is initialized with a reference
    count of 1, and the journal buffer reference count is bumped (via
    journal_state_inc()). When a journal reservation is acquired, the
    reservation also acquires a reference on the associated buffer. If
    the journal entry is closed in the meantime, it drops both the pin
    and buffer references held by the open entry, but the buffer still
    has references held by outstanding reservation. After the associated
    transaction commits, the reservation release drops the associated
    buffer references and the buffer is written out once the reference
    count has dropped to zero.
    
    The fundamental problem here is that the lifecycle of the pin list
    reference held by an open journal entry is too short to cover the
    processing of transactions with outstanding reservations. The
    simplest way to address this is to expand the pin list reference to
    the lifecycle of the buffer vs. the shorter lifecycle of the open
    journal entry. This ensures the pin list for a seq with outstanding
    reservation cannot be popped and reclaimed before all outstanding
    reservations have been released, even if the associated journal
    entry has been closed for further reservations.
    
    Move the pin put from journal entry close to where final processing
    of the journal buffer occurs. Create a duplicate helper to cover the
    case where the caller doesn't already hold the journal lock. This
    allows generic/390 to pass reliably.
    Signed-off-by: default avatarBrian Foster <bfoster@redhat.com>
    Signed-off-by: default avatarKent Overstreet <kent.overstreet@linux.dev>
    3e55189b
journal.h 16 KB