• Filipe Manana's avatar
    btrfs: join running log transaction when logging new name · 723df2bc
    Filipe Manana authored
    When logging a new name, in case of a rename, we pin the log before
    changing it. We then either delete a directory entry from the log or
    insert a key range item to mark the old name for deletion on log replay.
    
    However when doing one of those log changes we may have another task that
    started writing out the log (at btrfs_sync_log()) and it started before
    we pinned the log root. So we may end up changing a log tree while its
    writeback is being started by another task syncing the log. This can lead
    to inconsistencies in a log tree and other unexpected results during log
    replay, because we can get some committed node pointing to a node/leaf
    that ends up not getting written to disk before the next log commit.
    
    The problem, conceptually, started to happen in commit 88d2beec
    ("btrfs: avoid logging all directory changes during renames"), because
    there we started to update the log without joining its current transaction
    first.
    
    However the problem only became visible with commit 259c4b96
    ("btrfs: stop doing unnecessary log updates during a rename"), and that is
    because we used to pin the log at btrfs_rename() and then before entering
    btrfs_log_new_name(), when unlinking the old dentry, we ended up at
    btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both
    of them join the current log transaction, effectively waiting for any log
    transaction writeout (due to acquiring the root's log_mutex). This made it
    safe even after leaving the current log transaction, because we remained
    with the log pinned when we called btrfs_log_new_name().
    
    Then in commit 259c4b96 ("btrfs: stop doing unnecessary log updates
    during a rename"), we removed the log pinning from btrfs_rename() and
    stopped calling btrfs_del_inode_ref_in_log() and
    btrfs_del_dir_entries_in_log() during the rename, and started to do all
    the needed work at btrfs_log_new_name(), but without joining the current
    log transaction, only pinning the log, which is racy because another task
    may have started writeout of the log tree right before we pinned the log.
    
    Both commits landed in kernel 5.18, so it doesn't make any practical
    difference which should be blamed, but I'm blaming the second commit only
    because with the first one, by chance, the problem did not happen due to
    the fact we joined the log transaction after pinning the log and unpinned
    it only after calling btrfs_log_new_name().
    
    So make btrfs_log_new_name() join the current log transaction instead of
    pinning it, so that we never do log updates if it's writeout is starting.
    
    Fixes: 259c4b96 ("btrfs: stop doing unnecessary log updates during a rename")
    CC: stable@vger.kernel.org # 5.18+
    Reported-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
    Tested-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    723df2bc
tree-log.c 196 KB