• Filipe Manana's avatar
    btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled · 3660d0bc
    Filipe Manana authored
    When using the NO_HOLES feature, if we clone a file range that spans only
    a hole into a range that is at or beyond the current i_size of the
    destination file, we end up not setting the full sync runtime flag on the
    inode. As a result, if we then fsync the destination file and have a power
    failure, after log replay we can end up exposing stale data instead of
    having a hole for that range.
    
    The conditions for this to happen are the following:
    
    1) We have a file with a size of, for example, 1280K;
    
    2) There is a written (non-prealloc) extent for the file range from 1024K
       to 1280K with a length of 256K;
    
    3) This particular file extent layout is durably persisted, so that the
       existing superblock persisted on disk points to a subvolume root where
       the file has that exact file extent layout and state;
    
    4) The file is truncated to a smaller size, to an offset lower than the
       start offset of its last extent, for example to 800K. The truncate sets
       the full sync runtime flag on the inode;
    
    6) Fsync the file to log it and clear the full sync runtime flag;
    
    7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
       into the file with a destination offset that starts at or beyond the
       256K file extent item we had - for example to offset 1024K;
    
    8) Since the clone operation does not find extents in the source range,
       we end up in the if branch at the bottom of btrfs_clone() where we
       punch a hole for the file range starting at offset 1024K by calling
       btrfs_replace_file_extents(). There we end up not setting the full
       sync flag on the inode, because we don't know we are being called in
       a clone context (and not fallocate's punch hole operation), and
       neither do we create an extent map to represent a hole because the
       requested range is beyond eof;
    
    9) A further fsync to the file will be a fast fsync, since the clone
       operation did not set the full sync flag, and therefore it relies on
       modified extent maps to correctly log the file layout. But since
       it does not find any extent map marking the range from 1024K (the
       previous eof) to the new eof, it does not log a file extent item
       for that range representing the hole;
    
    10) After a power failure no hole for the range starting at 1024K is
       punched and we end up exposing stale data from the old 256K extent.
    
    Turning this into exact steps:
    
      $ mkfs.btrfs -f -O no-holes /dev/sdi
      $ mount /dev/sdi /mnt
    
      # Create our test file with 3 extents of 256K and a 256K hole at offset
      # 256K. The file has a size of 1280K.
      $ xfs_io -f -s \
                  -c "pwrite -S 0xab -b 256K 0 256K" \
                  -c "pwrite -S 0xcd -b 256K 512K 256K" \
                  -c "pwrite -S 0xef -b 256K 768K 256K" \
                  -c "pwrite -S 0x73 -b 256K 1024K 256K" \
                  /mnt/sdi/foobar
    
      # Make sure it's durably persisted. We want the last committed super
      # block to point to this particular file extent layout.
      sync
    
      # Now truncate our file to a smaller size, falling within a position of
      # the second extent. This sets the full sync runtime flag on the inode.
      # Then fsync the file to log it and clear the full sync flag from the
      # inode. The third extent is no longer part of the file and therefore
      # it is not logged.
      $ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar
    
      # Now do a clone operation that only clones the hole and sets back the
      # file size to match the size it had before the truncate operation
      # (1280K).
      $ xfs_io \
            -c "reflink /mnt/foobar 256K 1024K 256K" \
            -c "fsync" \
            /mnt/foobar
    
      # File data before power failure:
      $ od -A d -t x1 /mnt/foobar
      0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
      *
      0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
      *
      0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
      *
      0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      1310720
    
      <power fail>
    
      # Mount the fs again to replay the log tree.
      $ mount /dev/sdi /mnt
    
      # File data after power failure:
      $ od -A d -t x1 /mnt/foobar
      0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
      *
      0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
      *
      0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
      *
      0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
      *
      1310720
    
    The range from 1024K to 1280K should correspond to a hole but instead it
    points to stale data, to the 256K extent that should not exist after the
    truncate operation.
    
    The issue does not exists when not using NO_HOLES, because for that case
    we use file extent items to represent holes, these are found and copied
    during the loop that iterates over extents at btrfs_clone(), and that
    causes btrfs_replace_file_extents() to be called with a non-NULL
    extent_info argument and therefore set the full sync runtime flag on the
    inode.
    
    So fix this by making the code that deals with a trailing hole during
    cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
    range starts at or beyond the current i_size.
    
    A test case for fstests will follow soon.
    
    Backporting notes: for kernel 5.4 the change goes to ioctl.c into
    btrfs_clone before the last call to btrfs_punch_hole_range.
    
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    3660d0bc
reflink.c 25.6 KB