• Filipe Manana's avatar
    btrfs: fix race between page release and a fast fsync · 3d6448e6
    Filipe Manana authored
    When releasing an extent map, done through the page release callback, we
    can race with an ongoing fast fsync and cause the fsync to miss a new
    extent and not log it. The steps for this to happen are the following:
    
    1) A page is dirtied for some inode I;
    
    2) Writeback for that page is triggered by a path other than fsync, for
       example by the system due to memory pressure;
    
    3) When the ordered extent for the extent (a single 4K page) finishes,
       we unpin the corresponding extent map and set its generation to N,
       the current transaction's generation;
    
    4) The btrfs_releasepage() callback is invoked by the system due to
       memory pressure for that no longer dirty page of inode I;
    
    5) At the same time, some task calls fsync on inode I, joins transaction
       N, and at btrfs_log_inode() it sees that the inode does not have the
       full sync flag set, so we proceed with a fast fsync. But before we get
       into btrfs_log_changed_extents() and lock the inode's extent map tree:
    
    6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
       and we remove the extent map for the new 4Kb extent, because it is
       neither pinned anymore nor locked. By calling remove_extent_mapping(),
       we remove the extent map from the list of modified extents, since the
       extent map does not have the logging flag set. We unlock the inode's
       extent map tree;
    
    7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
       locks the inode's extent map tree and iterates its list of modified
       extents, which no longer has the 4Kb extent in it, so it does not log
       the extent;
    
    8) The fsync finishes;
    
    9) Before transaction N is committed, a power failure happens. After
       replaying the log, the 4K extent of inode I will be missing, since
       it was not logged due to the race with try_release_extent_mapping().
    
    So fix this by teaching try_release_extent_mapping() to not remove an
    extent map if it's still in the list of modified extents.
    
    Fixes: ff44c6e3 ("Btrfs: do not hold the write_lock on the extent tree while logging")
    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    3d6448e6
extent_io.c 157 KB