1. 08 Jul, 2022 3 commits
    • Naohiro Aota's avatar
      btrfs: zoned: drop optimization of zone finish · b3a3b025
      Naohiro Aota authored
      We have an optimization in do_zone_finish() to send REQ_OP_ZONE_FINISH only
      when necessary, i.e. we don't send REQ_OP_ZONE_FINISH when we assume we
      wrote fully into the zone.
      
      The assumption is determined by "alloc_offset == capacity". This condition
      won't work if the last ordered extent is canceled due to some errors. In
      that case, we consider the zone is deactivated without sending the finish
      command while it's still active.
      
      This inconstancy results in activating another block group while we cannot
      really activate the underlying zone, which causes the active zone exceeds
      errors like below.
      
          BTRFS error (device nvme3n2): allocation failed flags 1, wanted 520192 tree-log 0, relocation: 0
          nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
          active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
          nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
          active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
      
      Fix the issue by removing the optimization for now.
      
      Fixes: 8376d9e1 ("btrfs: zoned: finish superblock zone once no space left for new SB")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b3a3b025
    • Christoph Hellwig's avatar
      btrfs: zoned: fix a leaked bioc in read_zone_info · 29634578
      Christoph Hellwig authored
      The bioc would leak on the normal completion path and also on the RAID56
      check (but that one won't happen in practice due to the invalid
      combination with zoned mode).
      
      Fixes: 7db1c5d1 ("btrfs: zoned: support dev-replace in zoned filesystems")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      [ update changelog ]
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      29634578
    • Filipe Manana's avatar
      btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents · a4527e18
      Filipe Manana authored
      When doing a direct IO read or write, we always return -ENOTBLK when we
      find a compressed extent (or an inline extent) so that we fallback to
      buffered IO. This however is not ideal in case we are in a NOWAIT context
      (io_uring for example), because buffered IO can block and we currently
      have no support for NOWAIT semantics for buffered IO, so if we need to
      fallback to buffered IO we should first signal the caller that we may
      need to block by returning -EAGAIN instead.
      
      This behaviour can also result in short reads being returned to user
      space, which although it's not incorrect and user space should be able
      to deal with partial reads, it's somewhat surprising and even some popular
      applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't
      deal with short reads properly (or at all).
      
      The short read case happens when we try to read from a range that has a
      non-compressed and non-inline extent followed by a compressed extent.
      After having read the first extent, when we find the compressed extent we
      return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
      treat the request as a short read, returning 0 (success) and waiting for
      previously submitted bios to complete (this happens at
      fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at
      btrfs_file_read_iter(), we call filemap_read() to use buffered IO to
      read the remaining data, and pass it the number of bytes we were able to
      read with direct IO. Than at filemap_read() if we get a page fault error
      when accessing the read buffer, we return a partial read instead of an
      -EFAULT error, because the number of bytes previously read is greater
      than zero.
      
      So fix this by returning -EAGAIN for NOWAIT direct IO when we find a
      compressed or an inline extent.
      Reported-by: default avatarDominique MARTINET <dominique.martinet@atmark-techno.com>
      Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/
      Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582Tested-by: default avatarDominique MARTINET <dominique.martinet@atmark-techno.com>
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a4527e18
  2. 21 Jun, 2022 8 commits
    • David Sterba's avatar
      Documentation: update btrfs list of features and link to readthedocs.io · 037e1274
      David Sterba authored
      The btrfs documentation in kernel is only meant as a starting point, so
      update the list of features and add link to btrfs.readthedocs.io page
      that is most up-to-date. The wiki is still used but information is
      migrated from there.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      037e1274
    • Josef Bacik's avatar
      btrfs: fix deadlock with fsync+fiemap+transaction commit · bf7ba8ee
      Josef Bacik authored
      We are hitting the following deadlock in production occasionally
      
      Task 1		Task 2		Task 3		Task 4		Task 5
      		fsync(A)
      		 start trans
      						start commit
      				falloc(A)
      				 lock 5m-10m
      				 start trans
      				  wait for commit
      fiemap(A)
       lock 0-10m
        wait for 5m-10m
         (have 0-5m locked)
      
      		 have btrfs_need_log_full_commit
      		  !full_sync
      		  wait_ordered_extents
      								finish_ordered_io(A)
      								lock 0-5m
      								DEADLOCK
      
      We have an existing dependency of file extent lock -> transaction.
      However in fsync if we tried to do the fast logging, but then had to
      fall back to committing the transaction, we will be forced to call
      btrfs_wait_ordered_range() to make sure all of our extents are updated.
      
      This creates a dependency of transaction -> file extent lock, because
      btrfs_finish_ordered_io() will need to take the file extent lock in
      order to run the ordered extents.
      
      Fix this by stopping the transaction if we have to do the full commit
      and we attempted to do the fast logging.  Then attach to the transaction
      and commit it if we need to.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf7ba8ee
    • Zygo Blaxell's avatar
      btrfs: don't set lock_owner when locking extent buffer for reading · 97e86631
      Zygo Blaxell authored
      In 196d59ab "btrfs: switch extent buffer tree lock to rw_semaphore"
      the functions for tree read locking were rewritten, and in the process
      the read lock functions started setting eb->lock_owner = current->pid.
      Previously lock_owner was only set in tree write lock functions.
      
      Read locks are shared, so they don't have exclusive ownership of the
      underlying object, so setting lock_owner to any single value for a
      read lock makes no sense.  It's mostly harmless because write locks
      and read locks are mutually exclusive, and none of the existing code
      in btrfs (btrfs_init_new_buffer and print_eb_refs_lock) cares what
      nonsense is written in lock_owner when no writer is holding the lock.
      
      KCSAN does care, and will complain about the data race incessantly.
      Remove the assignments in the read lock functions because they're
      useless noise.
      
      Fixes: 196d59ab ("btrfs: switch extent buffer tree lock to rw_semaphore")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarZygo Blaxell <ce3g8jdj@umail.furryterror.org>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97e86631
    • Naohiro Aota's avatar
      btrfs: zoned: fix critical section of relocation inode writeback · 19ab78ca
      Naohiro Aota authored
      We use btrfs_zoned_data_reloc_{lock,unlock} to allow only one process to
      write out to the relocation inode. That critical section must include all
      the IO submission for the inode. However, flush_write_bio() in
      extent_writepages() is out of the critical section, causing an IO
      submission outside of the lock. This leads to an out of the order IO
      submission and fail the relocation process.
      
      Fix it by extending the critical section.
      
      Fixes: 35156d85 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      19ab78ca
    • Naohiro Aota's avatar
      btrfs: zoned: prevent allocation from previous data relocation BG · 343d8a30
      Naohiro Aota authored
      After commit 5f0addf7 ("btrfs: zoned: use dedicated lock for data
      relocation"), we observe IO errors on e.g, btrfs/232 like below.
      
        [09.0][T4038707] WARNING: CPU: 3 PID: 4038707 at fs/btrfs/extent-tree.c:2381 btrfs_cross_ref_exist+0xfc/0x120 [btrfs]
        <snip>
        [09.9][T4038707] Call Trace:
        [09.5][T4038707]  <TASK>
        [09.3][T4038707]  run_delalloc_nocow+0x7f1/0x11a0 [btrfs]
        [09.6][T4038707]  ? test_range_bit+0x174/0x320 [btrfs]
        [09.2][T4038707]  ? fallback_to_cow+0x980/0x980 [btrfs]
        [09.3][T4038707]  ? find_lock_delalloc_range+0x33e/0x3e0 [btrfs]
        [09.5][T4038707]  btrfs_run_delalloc_range+0x445/0x1320 [btrfs]
        [09.2][T4038707]  ? test_range_bit+0x320/0x320 [btrfs]
        [09.4][T4038707]  ? lock_downgrade+0x6a0/0x6a0
        [09.2][T4038707]  ? orc_find.part.0+0x1ed/0x300
        [09.5][T4038707]  ? __module_address.part.0+0x25/0x300
        [09.0][T4038707]  writepage_delalloc+0x159/0x310 [btrfs]
        <snip>
        [09.4][    C3] sd 10:0:1:0: [sde] tag#2620 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
        [09.5][    C3] sd 10:0:1:0: [sde] tag#2620 Sense Key : Illegal Request [current]
        [09.9][    C3] sd 10:0:1:0: [sde] tag#2620 Add. Sense: Unaligned write command
        [09.5][    C3] sd 10:0:1:0: [sde] tag#2620 CDB: Write(16) 8a 00 00 00 00 00 02 f3 63 87 00 00 00 2c 00 00
        [09.4][    C3] critical target error, dev sde, sector 396041272 op 0x1:(WRITE) flags 0x800 phys_seg 3 prio class 0
        [09.9][    C3] BTRFS error (device dm-1): bdev /dev/mapper/dml_102_2 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
      
      The IO errors occur when we allocate a regular extent in previous data
      relocation block group.
      
      On zoned btrfs, we use a dedicated block group to relocate a data
      extent. Thus, we allocate relocating data extents (pre-alloc) only from
      the dedicated block group and vice versa. Once the free space in the
      dedicated block group gets tight, a relocating extent may not fit into
      the block group. In that case, we need to switch the dedicated block
      group to the next one. Then, the previous one is now freed up for
      allocating a regular extent. The BG is already not enough to allocate
      the relocating extent, but there is still room to allocate a smaller
      extent. Now the problem happens. By allocating a regular extent while
      nocow IOs for the relocation is still on-going, we will issue WRITE IOs
      (for relocation) and ZONE APPEND IOs (for the regular writes) at the
      same time. That mixed IOs confuses the write pointer and arises the
      unaligned write errors.
      
      This commit introduces a new bit 'zoned_data_reloc_ongoing' to the
      btrfs_block_group. We set this bit before releasing the dedicated block
      group, and no extent are allocated from a block group having this bit
      set. This bit is similar to setting block_group->ro, but is different from
      it by allowing nocow writes to start.
      
      Once all the nocow IO for relocation is done (hooked from
      btrfs_finish_ordered_io), we reset the bit to release the block group for
      further allocation.
      
      Fixes: c2707a25 ("btrfs: zoned: add a dedicated data relocation block group")
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      343d8a30
    • Filipe Manana's avatar
      btrfs: do not BUG_ON() on failure to migrate space when replacing extents · 650c9cab
      Filipe Manana authored
      At btrfs_replace_file_extents(), if we fail to migrate reserved metadata
      space from the transaction block reserve into the local block reserve,
      we trigger a BUG_ON(). This is because it should not be possible to have
      a failure here, as we reserved more space when we started the transaction
      than the space we want to migrate. However having a BUG_ON() is way too
      drastic, we can perfectly handle the failure and return the error to the
      caller. So just do that instead, and add a WARN_ON() to make it easier
      to notice the failure if it ever happens (which is particularly useful
      for fstests, and the warning will trigger a failure of a test case).
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      650c9cab
    • Filipe Manana's avatar
      btrfs: add missing inode updates on each iteration when replacing extents · 983d8209
      Filipe Manana authored
      When replacing file extents, called during fallocate, hole punching,
      clone and deduplication, we may not be able to replace/drop all the
      target file extent items with a single transaction handle. We may get
      -ENOSPC while doing it, in which case we release the transaction handle,
      balance the dirty pages of the btree inode, flush delayed items and get
      a new transaction handle to operate on what's left of the target range.
      
      By dropping and replacing file extent items we have effectively modified
      the inode, so we should bump its iversion and update its mtime/ctime
      before we update the inode item. This is because if the transaction
      we used for partially modifying the inode gets committed by someone after
      we release it and before we finish the rest of the range, a power failure
      happens, then after mounting the filesystem our inode has an outdated
      iversion and mtime/ctime, corresponding to the values it had before we
      changed it.
      
      So add the missing iversion and mtime/ctime updates.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      983d8209
    • Filipe Manana's avatar
      btrfs: fix race between reflinking and ordered extent completion · d4597898
      Filipe Manana authored
      While doing a reflink operation, if an ordered extent for a file range
      that does not overlap with the source and destination ranges of the
      reflink operation happens, we can end up having a failure in the reflink
      operation and return -EINVAL to user space.
      
      The following sequence of steps explains how this can happen:
      
      1) We have the page at file offset 315392 dirty (under delalloc);
      
      2) A reflink operation for this file starts, using the same file as both
         source and destination, the source range is [372736, 409600) (length of
         36864 bytes) and the destination range is [208896, 245760);
      
      3) At btrfs_remap_file_range_prep(), we flush all delalloc in the source
         and destination ranges, and wait for any ordered extents in those range
         to complete;
      
      4) Still at btrfs_remap_file_range_prep(), we then flush all delalloc in
         the inode, but we neither wait for it to complete nor any ordered
         extents to complete. This results in starting delalloc for the page at
         file offset 315392 and creating an ordered extent for that single page
         range;
      
      5) We then move to btrfs_clone() and enter the loop to find file extent
         items to copy from the source range to destination range;
      
      6) In the first iteration we end up at last file extent item stored in
         leaf A:
      
         (...)
         item 131 key (143616 108 315392) itemoff 5101 itemsize 53
                  extent data disk bytenr 1903988736 nr 73728
                  extent data offset 12288 nr 61440 ram 73728
      
         This represents the file range [315392, 376832), which overlaps with
         the source range to clone.
      
         @datal is set to 61440, key.offset is 315392 and @next_key_min_offset
         is therefore set to 376832 (315392 + 61440).
      
         @off (372736) is > key.offset (315392), so @new_key.offset is set to
         the value of @destoff (208896).
      
         @new_key.offset == @last_dest_end (208896) so @drop_start is set to
         208896 (@new_key.offset).
      
         @datal is adjusted to 4096, as @off is > @key.offset.
      
         So in this iteration we call btrfs_replace_file_extents() for the range
         [208896, 212991] (a single page, which is
         [@drop_start, @new_key.offset + @datal - 1]).
      
         @last_dest_end is set to 212992 (@new_key.offset + @datal =
         208896 + 4096 = 212992).
      
         Before the next iteration of the loop, @key.offset is set to the value
         376832, which is @next_key_min_offset;
      
      7) On the second iteration btrfs_search_slot() leaves us again at leaf A,
         but this time pointing beyond the last slot of leaf A, as that's where
         a key with offset 376832 should be at if it existed. So end up calling
         btrfs_next_leaf();
      
      8) btrfs_next_leaf() releases the path, but before it searches again the
         tree for the next key/leaf, the ordered extent for the single page
         range at file offset 315392 completes. That results in trimming the
         file extent item we processed before, adjusting its key offset from
         315392 to 319488, reducing its length from 61440 to 57344 and inserting
         a new file extent item for that single page range, with a key offset of
         315392 and a length of 4096.
      
         Leaf A now looks like:
      
           (...)
           item 132 key (143616 108 315392) itemoff 4995 itemsize 53
                    extent data disk bytenr 1801666560 nr 4096
                    extent data offset 0 nr 4096 ram 4096
           item 133 key (143616 108 319488) itemoff 4942 itemsize 53
                    extent data disk bytenr 1903988736 nr 73728
                    extent data offset 16384 nr 57344 ram 73728
      
      9) When btrfs_next_leaf() returns, it gives us a path pointing to leaf A
         at slot 133, since it's the first key that follows what was the last
         key we saw (143616 108 315392). In fact it's the same item we processed
         before, but its key offset was changed, so it counts as a new key;
      
      10) So now we have:
      
          @key.offset == 319488
          @datal == 57344
      
          @off (372736) is > key.offset (319488), so @new_key.offset is set to
          208896 (@destoff value).
      
          @new_key.offset (208896) != @last_dest_end (212992), so @drop_start
          is set to 212992 (@last_dest_end value).
      
          @datal is adjusted to 4096 because @off > @key.offset.
      
          So in this iteration we call btrfs_replace_file_extents() for the
          invalid range of [212992, 212991] (which is
          [@drop_start, @new_key.offset + @datal - 1]).
      
          This range is empty, the end offset is smaller than the start offset
          so btrfs_replace_file_extents() returns -EINVAL, which we end up
          returning to user space and fail the reflink operation.
      
          This all happens because the range of this file extent item was
          already processed in the previous iteration.
      
      This scenario can be triggered very sporadically by fsx from fstests, for
      example with test case generic/522.
      
      So fix this by having btrfs_clone() skip file extent items that cover a
      file range that we have already processed.
      
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d4597898
  3. 07 Jun, 2022 1 commit
  4. 06 Jun, 2022 2 commits
    • Qu Wenruo's avatar
      btrfs: prevent remounting to v1 space cache for subpage mount · 0591f040
      Qu Wenruo authored
      Upstream commit 9f73f1ae ("btrfs: force v2 space cache usage for
      subpage mount") forces subpage mount to use v2 cache, to avoid
      deprecated v1 cache which doesn't support subpage properly.
      
      But there is a loophole that user can still remount to v1 cache.
      
      The existing check will only give users a warning, but does not really
      prevent to do the remount.
      
      Although remounting to v1 will not cause any problems since the v1 cache
      will always be marked invalid when mounted with a different page size,
      it's still better to prevent v1 cache at all for subpage mounts.
      
      Fixes: 9f73f1ae ("btrfs: force v2 space cache usage for subpage mount")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0591f040
    • Filipe Manana's avatar
      btrfs: fix hang during unmount when block group reclaim task is running · 31e70e52
      Filipe Manana authored
      When we start an unmount, at close_ctree(), if we have the reclaim task
      running and in the middle of a data block group relocation, we can trigger
      a deadlock when stopping an async reclaim task, producing a trace like the
      following:
      
      [629724.498185] task:kworker/u16:7   state:D stack:    0 pid:681170 ppid:     2 flags:0x00004000
      [629724.499760] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
      [629724.501267] Call Trace:
      [629724.501759]  <TASK>
      [629724.502174]  __schedule+0x3cb/0xed0
      [629724.502842]  schedule+0x4e/0xb0
      [629724.503447]  btrfs_wait_on_delayed_iputs+0x7c/0xc0 [btrfs]
      [629724.504534]  ? prepare_to_wait_exclusive+0xc0/0xc0
      [629724.505442]  flush_space+0x423/0x630 [btrfs]
      [629724.506296]  ? rcu_read_unlock_trace_special+0x20/0x50
      [629724.507259]  ? lock_release+0x220/0x4a0
      [629724.507932]  ? btrfs_get_alloc_profile+0xb3/0x290 [btrfs]
      [629724.508940]  ? do_raw_spin_unlock+0x4b/0xa0
      [629724.509688]  btrfs_async_reclaim_metadata_space+0x139/0x320 [btrfs]
      [629724.510922]  process_one_work+0x252/0x5a0
      [629724.511694]  ? process_one_work+0x5a0/0x5a0
      [629724.512508]  worker_thread+0x52/0x3b0
      [629724.513220]  ? process_one_work+0x5a0/0x5a0
      [629724.514021]  kthread+0xf2/0x120
      [629724.514627]  ? kthread_complete_and_exit+0x20/0x20
      [629724.515526]  ret_from_fork+0x22/0x30
      [629724.516236]  </TASK>
      [629724.516694] task:umount          state:D stack:    0 pid:719055 ppid:695412 flags:0x00004000
      [629724.518269] Call Trace:
      [629724.518746]  <TASK>
      [629724.519160]  __schedule+0x3cb/0xed0
      [629724.519835]  schedule+0x4e/0xb0
      [629724.520467]  schedule_timeout+0xed/0x130
      [629724.521221]  ? lock_release+0x220/0x4a0
      [629724.521946]  ? lock_acquired+0x19c/0x420
      [629724.522662]  ? trace_hardirqs_on+0x1b/0xe0
      [629724.523411]  __wait_for_common+0xaf/0x1f0
      [629724.524189]  ? usleep_range_state+0xb0/0xb0
      [629724.524997]  __flush_work+0x26d/0x530
      [629724.525698]  ? flush_workqueue_prep_pwqs+0x140/0x140
      [629724.526580]  ? lock_acquire+0x1a0/0x310
      [629724.527324]  __cancel_work_timer+0x137/0x1c0
      [629724.528190]  close_ctree+0xfd/0x531 [btrfs]
      [629724.529000]  ? evict_inodes+0x166/0x1c0
      [629724.529510]  generic_shutdown_super+0x74/0x120
      [629724.530103]  kill_anon_super+0x14/0x30
      [629724.530611]  btrfs_kill_super+0x12/0x20 [btrfs]
      [629724.531246]  deactivate_locked_super+0x31/0xa0
      [629724.531817]  cleanup_mnt+0x147/0x1c0
      [629724.532319]  task_work_run+0x5c/0xa0
      [629724.532984]  exit_to_user_mode_prepare+0x1a6/0x1b0
      [629724.533598]  syscall_exit_to_user_mode+0x16/0x40
      [629724.534200]  do_syscall_64+0x48/0x90
      [629724.534667]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [629724.535318] RIP: 0033:0x7fa2b90437a7
      [629724.535804] RSP: 002b:00007ffe0b7e4458 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [629724.536912] RAX: 0000000000000000 RBX: 00007fa2b9182264 RCX: 00007fa2b90437a7
      [629724.538156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000555d6cf20dd0
      [629724.539053] RBP: 0000555d6cf20ba0 R08: 0000000000000000 R09: 00007ffe0b7e3200
      [629724.539956] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
      [629724.540883] R13: 0000555d6cf20dd0 R14: 0000555d6cf20cb0 R15: 0000000000000000
      [629724.541796]  </TASK>
      
      This happens because:
      
      1) Before entering close_ctree() we have the async block group reclaim
         task running and relocating a data block group;
      
      2) There's an async metadata (or data) space reclaim task running;
      
      3) We enter close_ctree() and park the cleaner kthread;
      
      4) The async space reclaim task is at flush_space() and runs all the
         existing delayed iputs;
      
      5) Before the async space reclaim task calls
         btrfs_wait_on_delayed_iputs(), the block group reclaim task which is
         doing the data block group relocation, creates a delayed iput at
         replace_file_extents() (called when COWing leaves that have file extent
         items pointing to relocated data extents, during the merging phase
         of relocation roots);
      
      6) The async reclaim space reclaim task blocks at
         btrfs_wait_on_delayed_iputs(), since we have a new delayed iput;
      
      7) The task at close_ctree() then calls cancel_work_sync() to stop the
         async space reclaim task, but it blocks since that task is waiting for
         the delayed iput to be run;
      
      8) The delayed iput is never run because the cleaner kthread is parked,
         and no one else runs delayed iputs, resulting in a hang.
      
      So fix this by stopping the async block group reclaim task before we
      park the cleaner kthread.
      
      Fixes: 18bb8bbf ("btrfs: zoned: automatically reclaim zones")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      31e70e52
  5. 17 May, 2022 6 commits
    • Johannes Thumshirn's avatar
      btrfs: zoned: introduce a minimal zone size 4M and reject mount · 0a05fafe
      Johannes Thumshirn authored
      Zoned devices are expected to have zone sizes in the range of 1-2GB for
      ZNS SSDs and SMR HDDs have zone sizes of 256MB, so there is no need to
      allow arbitrarily small zone sizes on btrfs.
      
      But for testing purposes with emulated devices it is sometimes desirable
      to create devices with as small as 4MB zone size to uncover errors.
      
      So use 4MB as the smallest possible zone size and reject mounts of devices
      with a smaller zone size.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0a05fafe
    • Qu Wenruo's avatar
      btrfs: allow defrag to convert inline extents to regular extents · d8101a0c
      Qu Wenruo authored
      Btrfs defaults to max_inline=2K to make small writes inlined into
      metadata.
      
      The default value is always a win, as even DUP/RAID1/RAID10 doubles the
      metadata usage, it should still cause less physical space used compared
      to a 4K regular extents.
      
      But since the introduction of RAID1C3 and RAID1C4 it's no longer the case,
      users may find inlined extents causing too much space wasted, and want
      to convert those inlined extents back to regular extents.
      
      Unfortunately defrag will unconditionally skip all inline extents, no
      matter if the user is trying to converting them back to regular extents.
      
      So this patch will add a small exception for defrag_collect_targets() to
      allow defragging inline extents, if and only if the inlined extents are
      larger than max_inline, allowing users to convert them to regular ones.
      
      This also allows us to defrag extents like the following:
      
      	item 6 key (257 EXTENT_DATA 0) itemoff 15794 itemsize 69
      		generation 7 type 0 (inline)
      		inline extent data size 48 ram_bytes 4096 compression 1 (zlib)
      	item 7 key (257 EXTENT_DATA 4096) itemoff 15741 itemsize 53
      		generation 7 type 1 (regular)
      		extent data disk byte 13631488 nr 4096
      		extent data offset 0 nr 16384 ram 16384
      		extent compression 1 (zlib)
      
      Previously we're unable to do any defrag, since the first extent is
      inlined, and the second one has no extent to merge.
      
      Now we can defrag it to just one single extent, saving 48 bytes metadata
      space.
      
      	item 6 key (257 EXTENT_DATA 0) itemoff 15810 itemsize 53
      		generation 8 type 1 (regular)
      		extent data disk byte 13635584 nr 4096
      		extent data offset 0 nr 20480 ram 20480
      		extent compression 1 (zlib)
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d8101a0c
    • Qu Wenruo's avatar
      btrfs: add "0x" prefix for unsupported optional features · d5321a0f
      Qu Wenruo authored
      The following error message lack the "0x" obviously:
      
        cannot mount because of unsupported optional features (4000)
      
      Add the prefix to make it less confusing. This can happen on older
      kernels that try to mount a filesystem with newer features so it makes
      sense to backport to older trees.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d5321a0f
    • Filipe Manana's avatar
      btrfs: do not account twice for inode ref when reserving metadata units · 97bdf1a9
      Filipe Manana authored
      When reserving metadata units for creating an inode, we don't need to
      reserve one extra unit for the inode ref item because when creating the
      inode, at btrfs_create_new_inode(), we always insert the inode item and
      the inode ref item in a single batch (a single btree insert operation,
      and both ending up in the same leaf).
      
      As we have accounted already one unit for the inode item, the extra unit
      for the inode ref item is superfluous, it only makes us reserve more
      metadata than necessary and often adding more reclaim pressure if we are
      low on available metadata space.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      97bdf1a9
    • Naohiro Aota's avatar
      btrfs: zoned: fix comparison of alloc_offset vs meta_write_pointer · aa9ffadf
      Naohiro Aota authored
      The block_group->alloc_offset is an offset from the start of the block
      group. OTOH, the ->meta_write_pointer is an address in the logical
      space. So, we should compare the alloc_offset shifted with the
      block_group->start.
      
      Fixes: afba2bc0 ("btrfs: zoned: implement active zone tracking")
      CC: stable@vger.kernel.org # 5.16+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aa9ffadf
    • Filipe Manana's avatar
      btrfs: send: avoid trashing the page cache · 152555b3
      Filipe Manana authored
      A send operation reads extent data using the buffered IO path for getting
      extent data to send in write commands and this is both because it's simple
      and to make use of the generic readahead infrastructure, which results in
      a massive speedup.
      
      However this fills the page cache with data that, most of the time, is
      really only used by the send operation - once the write commands are sent,
      it's not useful to have the data in the page cache anymore. For large
      snapshots, bringing all data into the page cache eventually leads to the
      need to evict other data from the page cache that may be more useful for
      applications (and kernel subsystems).
      
      Even if extents are shared with the subvolume on which a snapshot is based
      on and the data is currently on the page cache due to being read through
      the subvolume, attempting to read the data through the snapshot will
      always result in bringing a new copy of the data into another location in
      the page cache (there's currently no shared memory for shared extents).
      
      So make send evict the data it has read before if when it first opened
      the inode, its mapping had no pages currently loaded: when
      inode->i_mapping->nr_pages has a value of 0. Do this instead of deciding
      based on the return value of filemap_range_has_page() before reading an
      extent because the generic readahead mechanism may read pages beyond the
      range we request (and it very often does it), which means a call to
      filemap_range_has_page() will return true due to the readahead that was
      triggered when processing a previous extent - we don't have a simple way
      to distinguish this case from the case where the data was brought into
      the page cache through someone else. So checking for the mapping number
      of pages being 0 when we first open the inode is simple, cheap and it
      generally accomplishes the goal of not trashing the page cache - the
      only exception is if part of data was previously loaded into the page
      cache through the snapshot by some other process, in that case we end
      up not evicting any data send brings into the page cache, just like
      before this change - but that however is not the common case.
      
      Example scenario, on a box with 32G of RAM:
      
        $ btrfs subvolume create /mnt/sv1
        $ xfs_io -f -c "pwrite 0 4G" /mnt/sv1/file1
      
        $ btrfs subvolume snapshot -r /mnt/sv1 /mnt/snap1
      
        $ free -m
                       total        used        free      shared  buff/cache   available
        Mem:           31937         186       26866           0        4883       31297
        Swap:           8188           0        8188
      
        # After this we get less 4G of free memory.
        $ btrfs send /mnt/snap1 >/dev/null
      
        $ free -m
                       total        used        free      shared  buff/cache   available
        Mem:           31937         186       22814           0        8935       31297
        Swap:           8188           0        8188
      
      The same, obviously, applies to an incremental send.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      152555b3
  6. 16 May, 2022 20 commits