1. 16 May, 2022 40 commits
    • Christoph Hellwig's avatar
      btrfs: check-integrity: split submit_bio from btrfsic checking · 58ff51f1
      Christoph Hellwig authored
      Require a separate call to the integrity checking helpers from the
      actual bio submission.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      58ff51f1
    • Christoph Hellwig's avatar
      btrfs: factor check and flush helpers from __btrfsic_submit_bio · 57906d58
      Christoph Hellwig authored
      Split out two helpers to make __btrfsic_submit_bio more readable.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      57906d58
    • Johannes Thumshirn's avatar
      btrfs: zoned: make auto-reclaim less aggressive · 3687fcb0
      Johannes Thumshirn authored
      The current auto-reclaim algorithm starts reclaiming all block groups
      with a zone_unusable value above a configured threshold. This is causing
      a lot of reclaim IO even if there would be enough free zones on the
      device.
      
      Instead of only accounting a block groups zone_unusable value, also take
      the ratio of free and not usable (written as well as zone_unusable)
      bytes a device has into account.
      Tested-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3687fcb0
    • Josef Bacik's avatar
      btrfs: change the bg_reclaim_threshold valid region from 0 to 100 · ef972e7b
      Josef Bacik authored
      For the non-zoned case we may want to set the threshold for reclaim to
      something below 50%.  Change the acceptable threshold from 50-100 to
      0-100.
      Tested-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ef972e7b
    • Josef Bacik's avatar
      btrfs: allow block group background reclaim for non-zoned filesystems · ac2f1e63
      Josef Bacik authored
      This will allow us to set a threshold for block groups to be
      automatically relocated even if we don't have zoned devices.
      
      We have found this feature invaluable at Facebook due to how our
      workload interacts with the allocator.  We have been using this in
      production for months with only a single problem that has already been
      fixed.
      Tested-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac2f1e63
    • Josef Bacik's avatar
      btrfs: make the bg_reclaim_threshold per-space info · bb5a098d
      Josef Bacik authored
      For non-zoned file systems it's useful to have the auto reclaim feature,
      however there are different use cases for non-zoned, for example we may
      not want to reclaim metadata chunks ever, only data chunks.  Move this
      sysfs flag to per-space_info.  This won't affect current users because
      this tunable only ever did anything for zoned, and that is currently
      hidden behind BTRFS_CONFIG_DEBUG.
      Tested-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      [ jth restore global bg_reclaim_threshold ]
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb5a098d
    • Filipe Manana's avatar
      btrfs: do not test for free space inode during NOCOW check against file extent · a7bb6bd4
      Filipe Manana authored
      When checking if we can do a NOCOW write against a range covered by a file
      extent item, we do a quick a check to determine if the inode's root was
      snapshotted in a generation older than the generation of the file extent
      item or not. This is to quickly determine if the extent is likely shared
      and avoid the expensive check for cross references (this was added in
      commit 78d4295b ("btrfs: lift some btrfs_cross_ref_exist checks in
      nocow path").
      
      We restrict that check to the case where the inode is not a free space
      inode (since commit 27a7ff55 ("btrfs: skip file_extent generation
      check for free_space_inode in run_delalloc_nocow")). That is because when
      we had the inode cache feature, inode caches were backed by a free space
      inode that belonged to the inode's root.
      
      However we don't have support for the inode cache feature since kernel
      5.11, so we don't need this check anymore since free space inodes are
      now always related to free space caches, which are always associated to
      the root tree (which can't be snapshotted, and its last_snapshot field
      is always 0).
      
      So remove that condition.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a7bb6bd4
    • Filipe Manana's avatar
      btrfs: move common NOCOW checks against a file extent into a helper · 619104ba
      Filipe Manana authored
      Verifying if we can do a NOCOW write against a range fully or partially
      covered by a file extent item requires verifying several constraints, and
      these are currently duplicated at two different places: can_nocow_extent()
      and run_delalloc_nocow().
      
      This change moves those checks into a common helper function to avoid
      duplication. It adds some comments and also preserves all existing
      behaviour like for example can_nocow_extent() treating errors from the
      calls to btrfs_cross_ref_exist() and csum_exist_in_range() as meaning
      we can not NOCOW, instead of propagating the error back to the caller.
      That specific behaviour is questionable but also reasonable to some
      degree.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      619104ba
    • Sweet Tea Dorminy's avatar
      btrfs: wait between incomplete batch memory allocations · 395cb57e
      Sweet Tea Dorminy authored
      When allocating memory in a loop, each iteration should call
      memalloc_retry_wait() in order to prevent starving memory-freeing
      processes (and to mark where allocation loops are). Other filesystems do
      that as well.
      
      The bulk page allocation is the only place in btrfs with an allocation
      retry loop, so add an appropriate call to it.
      Signed-off-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      395cb57e
    • Sweet Tea Dorminy's avatar
      btrfs: allocate page arrays using bulk page allocator · 91d6ac1d
      Sweet Tea Dorminy authored
      While calling alloc_page() in a loop is an effective way to populate an
      array of pages, the MM subsystem provides a method to allocate pages in
      bulk.  alloc_pages_bulk_array() populates the NULL slots in a page
      array, trying to grab more than one page at a time.
      
      Unfortunately, it doesn't guarantee allocating all slots in the array,
      but it's easy to call it in a loop and return an error if no progress
      occurs.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      91d6ac1d
    • Sweet Tea Dorminy's avatar
      btrfs: factor out allocating an array of pages · dd137dd1
      Sweet Tea Dorminy authored
      Several functions currently populate an array of page pointers one
      allocated page at a time. Factor out the common code so as to allow
      improvements to all of the sites at once.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd137dd1
    • Yu Zhe's avatar
      btrfs: remove unnecessary type casts · 0d031dc4
      Yu Zhe authored
      Explicit type casts are not necessary when it's void* to another pointer
      type.
      Signed-off-by: default avatarYu Zhe <yuzhe@nfschina.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0d031dc4
    • Qu Wenruo's avatar
      btrfs: expand subpage support to any PAGE_SIZE > 4K · 1a42daab
      Qu Wenruo authored
      With the recent change in metadata handling, we can handle metadata in
      the following cases:
      
      - nodesize < PAGE_SIZE and sectorsize < PAGE_SIZE
        Go subpage routine for both metadata and data.
      
      - nodesize < PAGE_SIZE and sectorsize >= PAGE_SIZE
        Invalid case for now. As we require nodesize >= sectorsize.
      
      - nodesize >= PAGE_SIZE and sectorsize < PAGE_SIZE
        Go subpage routine for data, but regular page routine for metadata.
      
      - nodesize >= PAGE_SIZE and sectorsize >= PAGE_SIZE
        Go regular page routine for both metadata and data.
      
      Now we can handle any sectorsize < PAGE_SIZE, plus the existing
      sectorsize == PAGE_SIZE support.
      
      But here we introduce an artificial limit, any PAGE_SIZE > 4K case, we
      will only support 4K and PAGE_SIZE as sector size.
      
      The idea here is to reduce the test combinations, and push 4K as the
      default standard in the future.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a42daab
    • Qu Wenruo's avatar
      btrfs: make nodesize >= PAGE_SIZE case to reuse the non-subpage routine · fbca46eb
      Qu Wenruo authored
      The reason why we only support 64K page size for subpage is, for 64K
      page size we can ensure no matter what the nodesize is, we can fit it
      into one page.
      
      When other page size come, especially like 16K, the limitation is a bit
      limiting.
      
      To remove such limitation, we allow nodesize >= PAGE_SIZE case to go the
      non-subpage routine.  By this, we can allow 4K sectorsize on 16K page
      size.
      
      Although this introduces another smaller limitation, the metadata can
      not cross page boundary, which is already met by most recent mkfs.
      
      Another small improvement is, we can avoid the overhead for metadata if
      nodesize >= PAGE_SIZE.
      For 4K sector size and 64K page size/node size, or 4K sector size and
      16K page size/node size, we don't need to allocate extra memory for the
      metadata pages.
      
      Please note that, this patch will not yet enable other page size support
      yet.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fbca46eb
    • Qu Wenruo's avatar
      btrfs: use dummy extent buffer for super block sys chunk array read · e959d3c1
      Qu Wenruo authored
      In function btrfs_read_sys_array(), we allocate a real extent buffer
      using btrfs_find_create_tree_block().
      
      Such extent buffer will be even cached into buffer_radix tree, and using
      btree inode address space.
      
      However we only use such extent buffer to enable the accessors, thus we
      don't even need to bother using real extent buffer, a dummy one is
      what we really need.
      
      And for dummy extent buffer, we no longer need to do any special
      handling for the first page, as subpage helper is already doing it
      properly.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e959d3c1
    • Naohiro Aota's avatar
      btrfs: assert that relocation is protected with sb_start_write() · 0320b353
      Naohiro Aota authored
      Relocation of a data block group creates ordered extents. They can cause
      a hang when a process is trying to thaw the filesystem.
      
      We should have called sb_start_write(), so the filesystem is not being
      frozen. Add an ASSERT to check it is protected.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0320b353
    • Naohiro Aota's avatar
      fs: add a lockdep check function for sb_start_write() · 7f8d12ea
      Naohiro Aota authored
      Add a function sb_write_started() to allow callers to verify if
      sb_start_write() is properly called. It will be used for assertion in
      btrfs.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7f8d12ea
    • Nikolay Borisov's avatar
      btrfs: simplify code flow in btrfs_ioctl_balance · d8645462
      Nikolay Borisov authored
      Move code in btrfs_ioctl_balance to simplify its flow. This is
      possible thanks to the removal of balance v1 ioctl and ensuring 'arg'
      argument is always present. First move the code duplicating the
      userspace arg to the kernel 'barg'. This makes the out_unlock label
      redundant.  Secondly, check the validity of bargs::flags before copying
      to the dynamically allocated 'bctl'. This removes the need for the
      out_bctl label.
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8645462
    • Nikolay Borisov's avatar
      btrfs: remove checks for arg argument in btrfs_ioctl_balance · 39864601
      Nikolay Borisov authored
      With the removal of balance v1 ioctl the 'arg' argument is guaranteed to
      be present so simply remove all conditional code which checks for its
      presence.
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      39864601
    • Qu Wenruo's avatar
      btrfs: replace memset with memzero_page in data checksum verification · b06660b5
      Qu Wenruo authored
      The original code resets the page to 0x1 for not apparent reason, it's
      been like that since the initial 2007 code added in commit 07157aac
      ("Btrfs: Add file data csums back in via hooks in the extent map code").
      
      It could mean that a failed buffer can be detected from the data but
      that's just a guess and any value is good.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b06660b5
    • Filipe Manana's avatar
      btrfs: avoid blocking on space revervation when doing nowait dio writes · d4135134
      Filipe Manana authored
      When doing a NOWAIT direct IO write, if we can NOCOW then it means we can
      proceed with the non-blocking, NOWAIT path. However reserving the metadata
      space and qgroup meta space can often result in blocking - flushing
      delalloc, wait for ordered extents to complete, trigger transaction
      commits, etc, going against the semantics of a NOWAIT write.
      
      So make the NOWAIT write path to try to reserve all the metadata it needs
      without resulting in a blocking behaviour - if we get -ENOSPC or -EDQUOT
      then return -EAGAIN to make the caller fallback to a blocking direct IO
      write.
      
      This is part of a patchset comprised of the following patches:
      
        btrfs: avoid blocking on page locks with nowait dio on compressed range
        btrfs: avoid blocking nowait dio when locking file range
        btrfs: avoid double nocow check when doing nowait dio writes
        btrfs: stop allocating a path when checking if cross reference exists
        btrfs: free path at can_nocow_extent() before checking for checksum items
        btrfs: release path earlier at can_nocow_extent()
        btrfs: avoid blocking when allocating context for nowait dio read/write
        btrfs: avoid blocking on space revervation when doing nowait dio writes
      
      The following test was run before and after applying this patchset:
      
        $ cat io-uring-nodatacow-test.sh
        #!/bin/bash
      
        DEV=/dev/sdc
        MNT=/mnt/sdc
      
        MOUNT_OPTIONS="-o ssd -o nodatacow"
        MKFS_OPTIONS="-R free-space-tree -O no-holes"
      
        NUM_JOBS=4
        FILE_SIZE=8G
        RUN_TIME=300
      
        cat <<EOF > /tmp/fio-job.ini
        [io_uring_rw]
        rw=randrw
        fsync=0
        fallocate=posix
        group_reporting=1
        direct=1
        ioengine=io_uring
        iodepth=64
        bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
        filesize=$FILE_SIZE
        runtime=$RUN_TIME
        time_based
        filename=foobar
        directory=$MNT
        numjobs=$NUM_JOBS
        thread
        EOF
      
        echo performance | \
           tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        umount $MNT &> /dev/null
        mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
        mount $MOUNT_OPTIONS $DEV $MNT
      
        fio /tmp/fio-job.ini
      
        umount $MNT
      
      The test was run a 12 cores box with 64G of ram, using a non-debug kernel
      config (Debian's default config) and a spinning disk.
      
      Result before the patchset:
      
       READ: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      WRITE: bw=407MiB/s (427MB/s), 407MiB/s-407MiB/s (427MB/s-427MB/s), io=119GiB (128GB), run=300175-300175msec
      
      Result after the patchset:
      
       READ: bw=436MiB/s (457MB/s), 436MiB/s-436MiB/s (457MB/s-457MB/s), io=128GiB (137GB), run=300044-300044msec
      WRITE: bw=435MiB/s (456MB/s), 435MiB/s-435MiB/s (456MB/s-456MB/s), io=128GiB (137GB), run=300044-300044msec
      
      That's about +7.2% throughput for reads and +6.9% for writes.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4135134
    • Filipe Manana's avatar
      btrfs: avoid blocking when allocating context for nowait dio read/write · 4f208dcc
      Filipe Manana authored
      When doing a NOWAIT direct IO read/write, we allocate a context object
      (struct btrfs_dio_data) with GFP_NOFS, which can result in blocking
      waiting for memory allocation (GFP_NOFS is __GFP_RECLAIM | __GFP_IO).
      This is undesirable for the NOWAIT semantics, so do the allocation with
      GFP_NOWAIT if we are serving a NOWAIT request and if the allocation fails
      return -EAGAIN, so that the caller can fallback to a blocking context and
      retry with a non-blocking write.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4f208dcc
    • Filipe Manana's avatar
      btrfs: release path earlier at can_nocow_extent() · 59d35c51
      Filipe Manana authored
      At can_nocow_extent(), we are releasing the path only after checking if
      the block group that has the target extent is read only, and after
      checking if there's delalloc in the range in case our extent is a
      preallocated extent. The read only extent check can be expensive if we
      have a very large filesystem with many block groups, as well as the
      check for delalloc in the inode's io_tree in case the io_tree is big
      due to IO on other file ranges.
      
      Our path is holding a read lock on a leaf and there's no need to keep
      the lock while doing those two checks, so release the path before doing
      them, immediately after the last use of the leaf.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59d35c51
    • Filipe Manana's avatar
      btrfs: free path at can_nocow_extent() before checking for checksum items · c1a548db
      Filipe Manana authored
      When we look for checksum items, through csum_exist_in_range(), at
      can_nocow_extent(), we no longer need the path that we have previously
      allocated. Through csum_exist_in_range() -> btrfs_lookup_csums_range(),
      we also end up allocating a path, so we are adding unnecessary extra
      memory usage. So free the path before calling csum_exist_in_range().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c1a548db
    • Filipe Manana's avatar
      btrfs: stop allocating a path when checking if cross reference exists · 1a89f173
      Filipe Manana authored
      At btrfs_cross_ref_exist() we always allocate a path, but we really don't
      need to because all its callers (only 2) already have an allocated path
      that is not being used when they call btrfs_cross_ref_exist(). So change
      btrfs_cross_ref_exist() to take a path as an argument and update both
      its callers to pass in the unused path they have when they call
      btrfs_cross_ref_exist().
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1a89f173
    • Filipe Manana's avatar
      btrfs: avoid double nocow check when doing nowait dio writes · d7a8ab4e
      Filipe Manana authored
      When doing a NOWAIT direct IO write we are checking twice if we can COW
      into the target file range using can_nocow_extent() - once at the very
      beginning of the write path, at btrfs_write_check() via
      check_nocow_nolock(), and later again at btrfs_get_blocks_direct_write().
      
      The can_nocow_extent() function does a lot of expensive things - searching
      for the file extent item in the inode's subvolume tree, searching for the
      extent item in the extent tree, checking delayed references, etc, so it
      isn't a very cheap call.
      
      We can remove the first check at btrfs_write_check(), and add there a
      quick check to verify if the inode has the NODATACOW or PREALLOC flags,
      and quickly bail out if it doesn't have neither of those flags, as that
      means we have to COW and therefore can't comply with the NOWAIT semantics.
      
      After this we do only one call to can_nocow_extent(), while we are at
      btrfs_get_blocks_direct_write(), where we have already locked the file
      range and we did a try lock on the range before, at
      btrfs_dio_iomap_begin() (since the previous patch in the series).
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d7a8ab4e
    • Filipe Manana's avatar
      btrfs: avoid blocking nowait dio when locking file range · 59094403
      Filipe Manana authored
      If we are doing a NOWAIT direct IO read/write, we can block when locking
      the file range at btrfs_dio_iomap_begin(), as it's possible the range (or
      a part of it) is already locked by another task (mmap writes, another
      direct IO read/write racing with us, fiemap, etc). We are also waiting for
      completion of any ordered extent we find in the range, which also can
      block us for a significant amount of time.
      
      There's also the incorrect fallback to buffered IO (returning -ENOTBLK)
      when we are dealing with a NOWAIT request and we can't proceed. In this
      case we should be returning -EAGAIN, as falling back to buffered IO can
      result in blocking for many different reasons, so that the caller can
      delegate a retry to a context where blocking is more acceptable.
      
      Fix these cases by:
      
      1) Doing a try lock on the file range and failing with -EAGAIN if we
         can not lock right away;
      
      2) Fail with -EAGAIN if we find an ordered extent;
      
      3) Return -EAGAIN instead of -ENOTBLK when we need to fallback to
         buffered IO and we have a NOWAIT request.
      
      This will also allow us to avoid a duplicated check that verifies if we
      are able to do a NOCOW write for NOWAIT direct IO writes, done in the
      next patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      59094403
    • Filipe Manana's avatar
      btrfs: avoid blocking on page locks with nowait dio on compressed range · b023e675
      Filipe Manana authored
      If we are doing NOWAIT direct IO read/write and our inode has compressed
      extents, we call filemap_fdatawrite_range() against the range in order
      to wait for compressed writeback to complete, since the generic code at
      iomap_dio_rw() calls filemap_write_and_wait_range() once, which is not
      enough to wait for compressed writeback to complete.
      
      This call to filemap_fdatawrite_range() can block on page locks, since
      the first writepages() on a range that we will try to compress results
      only in queuing a work to compress the data while holding the pages
      locked.
      
      Even though the generic code at iomap_dio_rw() will do the right thing
      and return -EAGAIN for NOWAIT requests in case there are pages in the
      range, we can still end up at btrfs_dio_iomap_begin() with pages in the
      range because either of the following can happen:
      
      1) Memory mapped writes, as we haven't locked the range yet;
      
      2) Buffered reads might have started, which lock the pages, and we do
         the filemap_fdatawrite_range() call before locking the file range.
      
      So don't call filemap_fdatawrite_range() at btrfs_dio_iomap_begin() if we
      are doing a NOWAIT read/write. Instead call filemap_range_needs_writeback()
      to check if there are any locked, dirty, or under writeback pages, and
      return -EAGAIN if that's the case.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b023e675
    • Jonathan Lassoff's avatar
      btrfs: add messages to printk index · b0a66a31
      Jonathan Lassoff authored
      In order for end users to quickly react to new issues that come up in
      production, it is proving useful to leverage this printk indexing
      system. This printk index enables kernel developers to use calls to
      printk() with changeable ad-hoc format strings, while still enabling end
      users to detect changes and develop a semi-stable interface for
      detecting and parsing these messages.
      
      So that detailed Btrfs messages are captured by this printk index, this
      patch wraps btrfs_printk and btrfs_handle_fs_error with macros.
      
      Example of the generated list:
      https://lore.kernel.org/lkml/12588e13d51a9c3bf59467d3fc1ac2162f1275c1.1647539056.git.jof@thejof.comSigned-off-by: default avatarJonathan Lassoff <jof@thejof.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b0a66a31
    • Qu Wenruo's avatar
      btrfs: tree-checker: check extent buffer owner against owner rootid · 88c602ab
      Qu Wenruo authored
      Btrfs doesn't check whether the tree block respects the root owner.
      This means, if a tree block referred by a parent in extent tree, but has
      owner of 5, btrfs can still continue reading the tree block, as long as
      it doesn't trigger other sanity checks.
      
      Normally this is fine, but combined with the empty tree check in
      check_leaf(), if we hit an empty extent tree, but the root node has
      csum tree owner, we can let such extent buffer to sneak in.
      
      Shrink the hole by:
      
      - Do extra eb owner check at tree read time
      
      - Make sure the root owner extent buffer exactly matches the root id.
      
      Unfortunately we can't yet completely patch the hole, there are several
      call sites can't pass all info we need:
      
      - For reloc/log trees
        Their owner is key::offset, not key::objectid.
        We need the full root key to do that accurate check.
      
        For now, we just skip the ownership check for those trees.
      
      - For add_data_references() of relocation
        That call site doesn't have any parent/ownership info, as all the
        bytenrs are all from btrfs_find_all_leafs().
      
      - For direct backref items walk
        Direct backref items records the parent bytenr directly, thus unlike
        indirect backref item, we don't do a full tree search.
      
        Thus in that case, we don't have full parent owner to check.
      
      For the later two cases, they all pass 0 as @owner_root, thus we can
      skip those cases if @owner_root is 0.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      88c602ab
    • Filipe Manana's avatar
      btrfs: add and use helper to assert an inode range is clean · 63c34cb4
      Filipe Manana authored
      We have four different scenarios where we don't expect to find ordered
      extents after locking a file range:
      
      1) During plain fallocate;
      2) During hole punching;
      3) During zero range;
      4) During reflinks (both cloning and deduplication).
      
      This is because in all these cases we follow the pattern:
      
      1) Lock the inode's VFS lock in exclusive mode;
      
      2) Lock the inode's i_mmap_lock in exclusive node, to serialize with
         mmap writes;
      
      3) Flush delalloc in a file range and wait for all ordered extents
         to complete - both done through btrfs_wait_ordered_range();
      
      4) Lock the file range in the inode's io_tree.
      
      So add a helper that asserts that we don't have ordered extents for a
      given range. Make the four scenarios listed above use this helper after
      locking the respective file range.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63c34cb4
    • Filipe Manana's avatar
      btrfs: remove ordered extent check and wait during hole punching and zero range · 55961c8a
      Filipe Manana authored
      For hole punching and zero range we have this loop that checks if we have
      ordered extents after locking the file range, and if so unlock the range,
      wait for ordered extents, and retry until we don't find more ordered
      extents.
      
      This logic was needed in the past because:
      
      1) Direct IO writes within the i_size boundary did not take the inode's
         VFS lock. This was because that lock used to be a mutex, then some
         years ago it was switched to a rw semaphore (commit 9902af79
         ("parallel lookups: actual switch to rwsem")), and then btrfs was
         changed to take the VFS inode's lock in shared mode for writes that
         don't cross the i_size boundary (commit e9adabb9 ("btrfs: use
         shared lock for direct writes within EOF"));
      
      2) We could race with memory mapped writes, because memory mapped writes
         don't acquire the inode's VFS lock. We don't have that race anymore,
         as we have a rw semaphore to synchronize memory mapped writes with
         fallocate (and reflinking too). That change happened with commit
         8d9b4a16 ("btrfs: exclude mmap from happening during all
         fallocate operations").
      
      So stop looking for ordered extents after locking the file range when
      doing hole punching and zero range operations.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      55961c8a
    • Filipe Manana's avatar
      btrfs: lock the inode first before flushing range when punching hole · bd6526d0
      Filipe Manana authored
      When doing hole punching we are flushing delalloc and waiting for ordered
      extents to complete before locking the inode (VFS lock and the btrfs
      specific i_mmap_lock). This is fine because even if a write happens after
      we call btrfs_wait_ordered_range() and before we lock the inode (call
      btrfs_inode_lock()), we will notice the write at
      btrfs_punch_hole_lock_range() and flush delalloc and wait for its ordered
      extent.
      
      We can however make this simpler by locking first the inode an then call
      btrfs_wait_ordered_range(), which will allow us to remove the ordered
      extent lookup logic from btrfs_punch_hole_lock_range() in the next patch.
      It also makes the behaviour the same as plain fallocate, hole punching
      and reflinks.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bd6526d0
    • Filipe Manana's avatar
      btrfs: remove ordered extent check and wait during fallocate · ffa8fc60
      Filipe Manana authored
      For fallocate() we have this loop that checks if we have ordered extents
      after locking the file range, and if so unlock the range, wait for ordered
      extents, and retry until we don't find more ordered extents.
      
      This logic was needed in the past because:
      
      1) Direct IO writes within the i_size boundary did not take the inode's
         VFS lock. This was because that lock used to be a mutex, then some
         years ago it was switched to a rw semaphore (commit 9902af79
         ("parallel lookups: actual switch to rwsem")), and then btrfs was
         changed to take the VFS inode's lock in shared mode for writes that
         don't cross the i_size boundary (commit e9adabb9 ("btrfs: use
         shared lock for direct writes within EOF"));
      
      2) We could race with memory mapped writes, because memory mapped writes
         don't acquire the inode's VFS lock. We don't have that race anymore,
         as we have a rw semaphore to synchronize memory mapped writes with
         fallocate (and reflinking too). That change happened with commit
         8d9b4a16 ("btrfs: exclude mmap from happening during all
         fallocate operations").
      
      So stop looking for ordered extents after locking the file range when
      doing a plain fallocate.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffa8fc60
    • Filipe Manana's avatar
      btrfs: remove inode_dio_wait() calls when starting reflink operations · 1c6cbbbe
      Filipe Manana authored
      When starting a reflink operation we have these calls to inode_dio_wait()
      which used to be needed because direct IO writes that don't cross the
      i_size boundary did not take the inode's VFS lock, so we could race with
      them and end up with ordered extents in target range after calling
      btrfs_wait_ordered_range().
      
      However that is not the case anymore, because the inode's VFS lock was
      changed from a mutex to a rw semaphore, by commit 9902af79
      ("parallel lookups: actual switch to rwsem"), and several years later we
      started to lock the inode's VFS lock in shared mode for direct IO writes
      that don't cross the i_size boundary (commit e9adabb9 ("btrfs: use
      shared lock for direct writes within EOF")).
      
      So remove those inode_dio_wait() calls.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1c6cbbbe
    • Filipe Manana's avatar
      btrfs: remove useless dio wait call when doing fallocate zero range · 831e1ee6
      Filipe Manana authored
      When starting a fallocate zero range operation, before getting the first
      extent map for the range, we make a call to inode_dio_wait().
      
      This logic was needed in the past because direct IO writes within the
      i_size boundary did not take the inode's VFS lock. This was because that
      lock used to be a mutex, then some years ago it was switched to a rw
      semaphore (by commit 9902af79 ("parallel lookups: actual switch to
      rwsem")), and then btrfs was changed to take the VFS inode's lock in
      shared mode for writes that don't cross the i_size boundary (done in
      commit e9adabb9 ("btrfs: use shared lock for direct writes within
      EOF")). The lockless direct IO writes could result in a race with the
      zero range operation, resulting in the later getting a stale extent
      map for the range.
      
      So remove this no longer needed call to inode_dio_wait(), as fallocate
      takes the inode's VFS lock in exclusive mode and direct IO writes within
      i_size take that same lock in shared mode.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      831e1ee6
    • Filipe Manana's avatar
      btrfs: only reserve the needed data space amount during fallocate · 47e1d1c7
      Filipe Manana authored
      During a plain fallocate, we always start by reserving an amount of data
      space that matches the length of the range passed to fallocate. When we
      already have extents allocated in that range, we may end up trying to
      reserve a lot more data space then we need, which can result in several
      undesired behaviours:
      
      1) We fail with -ENOSPC. For example the passed range has a length
         of 1G, but there's only one hole with a size of 1M in that range;
      
      2) We temporarily reserve excessive data space that could be used by
         other operations happening concurrently;
      
      3) By reserving much more data space then we need, we can end up
         doing expensive things like triggering dellaloc for other inodes,
         waiting for the ordered extents to complete, trigger transaction
         commits, allocate new block groups, etc.
      
      Example:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/sdj
        MNT=/mnt/sdj
      
        mkfs.btrfs -f -b 1g $DEV
        mount $DEV $MNT
      
        # Create a file with a size of 600M and two holes, one at [200M, 201M[
        # and another at [401M, 402M[
        xfs_io -f -c "pwrite -S 0xab 0 200M" \
                  -c "pwrite -S 0xcd 201M 200M" \
                  -c "pwrite -S 0xef 402M 198M" \
                  $MNT/foobar
      
        # Now call fallocate against the whole file range, see if it fails
        # with -ENOSPC or not - it shouldn't since we only need to allocate
        # 2M of data space.
        xfs_io -c "falloc 0 600M" $MNT/foobar
      
        umount $MNT
      
        $ ./test.sh
        (...)
        wrote 209715200/209715200 bytes at offset 0
        200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
        wrote 209715200/209715200 bytes at offset 210763776
        200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
        wrote 207618048/207618048 bytes at offset 421527552
        198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
        fallocate: No space left on device
        $
      
      So fix this by not allocating an amount of data space that matches the
      length of the range passed to fallocate. Instead allocate an amount of
      data space that corresponds to the sum of the sizes of each hole found
      in the range. This reservation now happens after we have locked the file
      range, which is safe since we know at this point there's no delalloc
      in the range because we've taken the inode's VFS lock in exclusive mode,
      we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
      delalloc and waited for all ordered extents in the range to complete.
      
      This type of failure actually seems to happen in practice with systemd,
      and we had at least one report about this in a very long thread which
      is referenced by the Link tag below.
      
      Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47e1d1c7
    • Sweet Tea Dorminy's avatar
      btrfs: restore inode creation before xattr setting · 6c3636eb
      Sweet Tea Dorminy authored
      According to the tree checker, "all xattrs with a given objectid follow
      the inode with that objectid in the tree" is an invariant. This was
      broken by the recent change "btrfs: move common inode creation code into
      btrfs_create_new_inode()", which moved acl creation and property
      inheritance (stored in xattrs) to before inode insertion into the tree.
      As a result, under certain timings, the xattrs could be written to the
      tree before the inode, causing the tree checker to report violation of
      the invariant.
      
      Move property inheritance and acl creation back to their old ordering
      after the inode insertion.
      Suggested-by: default avatarOmar Sandoval <osandov@osandov.com>
      Reported-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6c3636eb
    • Omar Sandoval's avatar
      btrfs: move common inode creation code into btrfs_create_new_inode() · caae78e0
      Omar Sandoval authored
      All of our inode creation code paths duplicate the calls to
      btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
      additionally duplicates property inheritance and the call to
      btrfs_set_inode_index(). Fix this by moving the common code into
      btrfs_create_new_inode(). This accomplishes a few things at once:
      
      1. It reduces code duplication.
      
      2. It allows us to set up the inode completely before inserting the
         inode item, removing calls to btrfs_update_inode().
      
      3. It fixes a leak of an inode on disk in some error cases. For example,
         in btrfs_create(), if btrfs_new_inode() succeeds, then we have
         inserted an inode item and its inode ref. However, if something after
         that fails (e.g., btrfs_init_inode_security()), then we end the
         transaction and then decrement the link count on the inode. If the
         transaction is committed and the system crashes before the failed
         inode is deleted, then we leak that inode on disk. Instead, this
         refactoring aborts the transaction when we can't recover more
         gracefully.
      
      4. It exposes various ways that subvolume creation diverges from mkdir
         in terms of inheriting flags, properties, permissions, and POSIX
         ACLs, a lot of which appears to be accidental. This patch explicitly
         does _not_ change the existing non-standard behavior, but it makes
         those differences more clear in the code and documents them so that
         we can discuss whether they should be changed.
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      caae78e0
    • Omar Sandoval's avatar
      btrfs: reserve correct number of items for inode creation · 3538d68d
      Omar Sandoval authored
      The various inode creation code paths do not account for the compression
      property, POSIX ACLs, or the parent inode item when starting a
      transaction. Fix it by refactoring all of these code paths to use a new
      function, btrfs_new_inode_prepare(), which computes the correct number
      of items. To do so, it needs to know whether POSIX ACLs will be created,
      so move the ACL creation into that function. To reduce the number of
      arguments that need to be passed around for inode creation, define
      struct btrfs_new_inode_args containing all of the relevant information.
      
      btrfs_new_inode_prepare() will also be a good place to set up the
      fscrypt context and encrypted filename in the future.
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3538d68d