An error occurred fetching the project authors.
  1. 05 Dec, 2022 5 commits
  2. 26 Sep, 2022 25 commits
  3. 15 Mar, 2022 1 commit
  4. 08 Dec, 2020 4 commits
    • Qu Wenruo's avatar
      btrfs: use fixed width int type for extent_state::state · f97e27e9
      Qu Wenruo authored
      Currently the type is unsigned int which could change its width
      depending on the architecture. We need up to 32 bits so make it
      explicit.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f97e27e9
    • Nikolay Borisov's avatar
      btrfs: merge __set_extent_bit and set_extent_bit · 1cab5e72
      Nikolay Borisov authored
      There are only 2 direct calls to set_extent_bit outside of extent-io -
      in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are
      thin wrappers around __set_extent_bit. This adds unnecessary indirection
      and just makes it more annoying when looking at the various extent bit
      manipulation functions.  This patch renames __set_extent_bit to
      set_extent_bit effectively removing a level of indirection. No
      functional changes.
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ reformat and remove __must_check ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cab5e72
    • Filipe Manana's avatar
      btrfs: update the number of bytes used by an inode atomically · 2766ff61
      Filipe Manana authored
      There are several occasions where we do not update the inode's number of
      used bytes atomically, resulting in a concurrent stat(2) syscall to report
      a value of used blocks that does not correspond to a valid value, that is,
      a value that does not match neither what we had before the operation nor
      what we get after the operation completes.
      
      In extreme cases it can result in stat(2) reporting zero used blocks, which
      can cause problems for some userspace tools where they can consider a file
      with a non-zero size and zero used blocks as completely sparse and skip
      reading data, as reported/discussed a long time ago in some threads like
      the following:
      
        https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html
      
      The cases where this can happen are the following:
      
      -> Case 1
      
      If we do a write (buffered or direct IO) against a file region for which
      there is already an allocated extent (or multiple extents), then we have a
      short time window where we can report a number of used blocks to stat(2)
      that does not take into account the file region being overwritten. This
      short time window happens when completing the ordered extent(s).
      
      This happens because when we drop the extents in the write range we
      decrement the inode's number of bytes and later on when we insert the new
      extent(s) we increment the number of bytes in the inode, resulting in a
      short time window where a stat(2) syscall can get an incorrect number of
      used blocks.
      
      If we do writes that overwrite an entire file, then we have a short time
      window where we report 0 used blocks to stat(2).
      
      Example reproducer:
      
        $ cat reproducer-1.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                   echo -n "ERROR: unexpected used blocks"
                   echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        expected=$(stat -c %b $MNT/foobar)
      
        # Create a process to keep calling stat(2) on the file and see if the
        # reported number of blocks used (disk space used) changes, it should
        # not because we are not increasing the file size nor punching holes.
        stat_loop $MNT/foobar $expected &
        loop_pid=$!
      
        for ((i = 0; i < 50000; i++)); do
            xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null
        done
      
        kill $loop_pid &> /dev/null
        wait
      
        umount $DEV
      
        $ ./reproducer-1.sh
        ERROR: unexpected used blocks (got: 0 expected: 128)
        ERROR: unexpected used blocks (got: 0 expected: 128)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 2
      
      If we do a buffered write against a file region that does not have any
      allocated extents, like a hole or beyond EOF, then during ordered extent
      completion we have a short time window where a concurrent stat(2) syscall
      can report a number of used blocks that does not correspond to the value
      before or after the write operation, a value that is actually larger than
      the value after the write completes.
      
      This happens because once we start a buffered write into an unallocated
      file range we increment the inode's 'new_delalloc_bytes', to make sure
      any stat(2) call gets a correct used blocks value before delalloc is
      flushed and completes. However at ordered extent completion, after we
      inserted the new extent, we increment the inode's number of bytes used
      with the size of the new extent, and only later, when clearing the range
      in the inode's iotree, we decrement the inode's 'new_delalloc_bytes'
      counter with the size of the extent. So this results in a short time
      window where a concurrent stat(2) syscall can report a number of used
      blocks that accounts for the new extent twice.
      
      Example reproducer:
      
        $ cat reproducer-2.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        stat_loop()
        {
            trap "wait; exit" SIGTERM
            local filepath=$1
            local expected=$2
            local got
      
            while :; do
                got=$(stat -c %b $filepath)
                if [ $got -ne $expected ]; then
                    echo -n "ERROR: unexpected used blocks"
                    echo " (got: $got expected: $expected)"
                fi
            done
        }
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f $DEV > /dev/null
        # mkfs.ext4 -F $DEV > /dev/null
        # mkfs.f2fs -f $DEV > /dev/null
        # mkfs.reiserfs -f $DEV > /dev/null
        mount $DEV $MNT
      
        touch $MNT/foobar
        write_size=$((64 * 1024))
        for ((i = 0; i < 16384; i++)); do
           offset=$(($i * $write_size))
           xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null
           blocks_used=$(stat -c %b $MNT/foobar)
      
           # Fsync the file to trigger writeback and keep calling stat(2) on it
           # to see if the number of blocks used changes.
           stat_loop $MNT/foobar $blocks_used &
           loop_pid=$!
           xfs_io -c "fsync" $MNT/foobar
      
           kill $loop_pid &> /dev/null
           wait $loop_pid
        done
      
        umount $DEV
      
        $ ./reproducer-2.sh
        ERROR: unexpected used blocks (got: 265472 expected: 265344)
        ERROR: unexpected used blocks (got: 284032 expected: 283904)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      -> Case 3
      
      Another case where such problems happen is during other operations that
      replace extents in a file range with other extents. Those operations are
      extent cloning, deduplication and fallocate's zero range operation.
      
      The cause of the problem is similar to the first case. When we drop the
      extents from a range, we decrement the inode's number of bytes, and later
      on, after inserting the new extents we increment it. Since this is not
      done atomically, a concurrent stat(2) call can see and return a number of
      used blocks that is smaller than it should be, does not match the number
      of used blocks before or after the clone/deduplication/zero operation.
      
      Like for the first case, when doing a clone, deduplication or zero range
      operation against an entire file, we end up having a time window where we
      can report 0 used blocks to a stat(2) call.
      
      Example reproducer:
      
        $ cat reproducer-3.sh
        #!/bin/bash
      
        MNT=/mnt/sdi
        DEV=/dev/sdi
      
        mkfs.btrfs -f $DEV > /dev/null
        # mkfs.xfs -f -m reflink=1 $DEV > /dev/null
        mount $DEV $MNT
      
        extent_size=$((64 * 1024))
        num_extents=16384
        file_size=$(($extent_size * $num_extents))
      
        # File foo has many small extents.
        xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \
            > /dev/null
        # File bar has much less extents and has exactly the same data as foo.
        xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null
      
        expected=$(stat -c %b $MNT/foo)
      
        # Now deduplicate bar into foo. While the deduplication is in progres,
        # the number of used blocks/file size reported by stat should not change
        xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null  &
        dedupe_pid=$!
        while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do
            used=$(stat -c %b $MNT/foo)
            if [ $used -ne $expected ]; then
                echo "Unexpected blocks used: $used (expected: $expected)"
            fi
        done
      
        umount $DEV
      
        $ ./reproducer-3.sh
        Unexpected blocks used: 2076800 (expected: 2097152)
        Unexpected blocks used: 2097024 (expected: 2097152)
        Unexpected blocks used: 2079872 (expected: 2097152)
        (...)
      
      Note that since this is a short time window where the race can happen, the
      reproducer may not be able to always trigger the bug in one run, or it may
      trigger it multiple times.
      
      So fix this by:
      
      1) Making btrfs_drop_extents() not decrement the VFS inode's number of
         bytes, and instead return the number of bytes;
      
      2) Making any code that drops extents and adds new extents update the
         inode's number of bytes atomically, while holding the btrfs inode's
         spinlock, which is also used by the stat(2) callback to get the inode's
         number of bytes;
      
      3) For ranges in the inode's iotree that are marked as 'delalloc new',
         corresponding to previously unallocated ranges, increment the inode's
         number of bytes when clearing the 'delalloc new' bit from the range,
         in the same critical section that decrements the inode's
         'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock.
      
      An alternative would be to have btrfs_getattr() wait for any IO (ordered
      extents in progress) and locking the whole range (0 to (u64)-1) while it
      it computes the number of blocks used. But that would mean blocking
      stat(2), which is a very used syscall and expected to be fast, waiting
      for writes, clone/dedupe, fallocate, page reads, fiemap, etc.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2766ff61
    • Qu Wenruo's avatar
      btrfs: sink the failed_start parameter to set_extent_bit · 3f6bb4ae
      Qu Wenruo authored
      The @failed_start parameter is only paired with @exclusive_bits, and
      those parameters are only used for EXTENT_LOCKED bit, which have their
      own wrappers lock_extent_bits().
      
      Thus for regular set_extent_bit() calls, the failed_start makes no
      sense, just sink the parameter.
      
      Also, since @failed_start and @exclusive_bits are used in pairs, add
      an assert to make it obvious.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3f6bb4ae
  5. 07 Oct, 2020 3 commits
  6. 12 Aug, 2020 1 commit
    • Qu Wenruo's avatar
      btrfs: trim: fix underflow in trim length to prevent access beyond device boundary · c57dd1f2
      Qu Wenruo authored
      [BUG]
      The following script can lead to tons of beyond device boundary access:
      
        mkfs.btrfs -f $dev -b 10G
        mount $dev $mnt
        trimfs $mnt
        btrfs filesystem resize 1:-1G $mnt
        trimfs $mnt
      
      [CAUSE]
      Since commit 929be17a ("btrfs: Switch btrfs_trim_free_extents to
      find_first_clear_extent_bit"), we try to avoid trimming ranges that's
      already trimmed.
      
      So we check device->alloc_state by finding the first range which doesn't
      have CHUNK_TRIMMED and CHUNK_ALLOCATED not set.
      
      But if we shrunk the device, that bits are not cleared, thus we could
      easily got a range starts beyond the shrunk device size.
      
      This results the returned @start and @end are all beyond device size,
      then we call "end = min(end, device->total_bytes -1);" making @end
      smaller than device size.
      
      Then finally we goes "len = end - start + 1", totally underflow the
      result, and lead to the beyond-device-boundary access.
      
      [FIX]
      This patch will fix the problem in two ways:
      
      - Clear CHUNK_TRIMMED | CHUNK_ALLOCATED bits when shrinking device
        This is the root fix
      
      - Add extra safety check when trimming free device extents
        We check and warn if the returned range is already beyond current
        device.
      
      Link: https://github.com/kdave/btrfs-progs/issues/282
      Fixes: 929be17a ("btrfs: Switch btrfs_trim_free_extents to find_first_clear_extent_bit")
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c57dd1f2
  7. 27 Jul, 2020 1 commit