• Filipe Manana's avatar
    Btrfs: fix missing data checksums after a ranged fsync (msync) · 008c6753
    Filipe Manana authored
    Recently we got a massive simplification for fsync, where for the fast
    path we no longer log new extents while their respective ordered extents
    are still running.
    
    However that simplification introduced a subtle regression for the case
    where we use a ranged fsync (msync). Consider the following example:
    
                   CPU 0                                    CPU 1
    
                                                mmap write to range [2Mb, 4Mb[
      mmap write to range [512Kb, 1Mb[
      msync range [512K, 1Mb[
        --> triggers fast fsync
            (BTRFS_INODE_NEEDS_FULL_SYNC
             not set)
        --> creates extent map A for this
            range and adds it to list of
            modified extents
        --> starts ordered extent A for
            this range
        --> waits for it to complete
    
                                                writeback triggered for range
                                                [2Mb, 4Mb[
                                                  --> create extent map B and
                                                      adds it to the list of
                                                      modified extents
                                                  --> creates ordered extent B
    
        --> start looking for and logging
            modified extents
        --> logs extent maps A and B
        --> finds checksums for extent A
            in the csum tree, but not for
            extent B
      fsync (msync) finishes
    
                                                  --> ordered extent B
                                                      finishes and its
                                                      checksums are added
                                                      to the csum tree
    
                                    <power cut>
    
    After replaying the log, we have the extent covering the range [2Mb, 4Mb[
    but do not have the data checksum items covering that file range.
    
    This happens because at the very beginning of an fsync (btrfs_sync_file())
    we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
    wait for any ordered extents in that range to complete before we start
    logging the extents. However if right before we start logging the extent
    in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
    such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
    or msync (btrfs_sync_file() starts writeback before acquiring the inode's
    lock), an ordered extent is created for that other range and a new extent
    map is created to represent that range and added to the inode's list of
    modified extents.
    
    That means that we will see that other extent in that list when collecting
    extents for logging (done at btrfs_log_changed_extents()) and log the
    extent before the respective ordered extent finishes - namely before the
    checksum items are added to the checksums tree, which is where
    log_extent_csums() looks for the checksums, therefore making us log an
    extent without logging its checksums. Before that massive simplification
    of fsync, this wasn't a problem because besides looking for checkums in
    the checksums tree, we also looked for them in any ordered extent still
    running.
    
    The consequence of data checksums missing for a file range is that users
    attempting to read the affected file range will get -EIO errors and dmesg
    reports the following:
    
     [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
     [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1
    
    So fix this by skipping extents outside of our logging range at
    btrfs_log_changed_extents() and leaving them on the list of modified
    extents so that any subsequent ranged fsync may collect them if needed.
    Also, if we find a hole extent outside of the range still log it, just
    to prevent having gaps between extent items after replaying the log,
    otherwise fsck will complain when we are not using the NO_HOLES feature
    (fstest btrfs/056 triggers such case).
    
    Fixes: e7175a69 ("btrfs: remove the wait ordered logic in the log_one_extent path")
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    008c6753
tree-log.c 164 KB