1. 30 Apr, 2024 1 commit
    • Qu Wenruo's avatar
      btrfs: set correct ram_bytes when splitting ordered extent · 63a6ce5a
      Qu Wenruo authored
      [BUG]
      When running generic/287, the following file extent items can be
      generated:
      
              item 16 key (258 EXTENT_DATA 2682880) itemoff 15305 itemsize 53
                      generation 9 type 1 (regular)
                      extent data disk byte 1378414592 nr 462848
                      extent data offset 0 nr 462848 ram 2097152
                      extent compression 0 (none)
      
      Note that file extent item is not a compressed one, but its ram_bytes is
      way larger than its disk_num_bytes.
      
      According to btrfs on-disk scheme, ram_bytes should match disk_num_bytes
      if it's not a compressed one.
      
      [CAUSE]
      Since commit b73a6fd1 ("btrfs: split partial dio bios before
      submit"), for partial dio writes, we would split the ordered extent.
      
      However the function btrfs_split_ordered_extent() doesn't update the
      ram_bytes even it has already shrunk the disk_num_bytes.
      
      Originally the function btrfs_split_ordered_extent() is only introduced
      for zoned devices in commit d22002fd ("btrfs: zoned: split ordered
      extent when bio is sent"), but later commit b73a6fd1 ("btrfs: split
      partial dio bios before submit") makes non-zoned btrfs affected.
      
      Thankfully for un-compressed file extent, we do not really utilize the
      ram_bytes member, thus it won't cause any real problem.
      
      [FIX]
      Also update btrfs_ordered_extent::ram_bytes inside
      btrfs_split_ordered_extent().
      
      Fixes: d22002fd ("btrfs: zoned: split ordered extent when bio is sent")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63a6ce5a
  2. 25 Apr, 2024 2 commits
    • Josef Bacik's avatar
      btrfs: take the cleaner_mutex earlier in qgroup disable · 0f2b8098
      Josef Bacik authored
      One of my CI runs popped the following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.9.0-rc4+ #1 Not tainted
      ------------------------------------------------------
      btrfs/471533 is trying to acquire lock:
      ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0
      
      but task is already holding lock:
      ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #2 (&fs_info->subvol_sem){++++}-{3:3}:
             down_read+0x42/0x170
             btrfs_rename+0x607/0xb00
             btrfs_rename2+0x2e/0x70
             vfs_rename+0xaf8/0xfc0
             do_renameat2+0x586/0x600
             __x64_sys_rename+0x43/0x50
             do_syscall_64+0x95/0x180
             entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      -> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}:
             down_write+0x3f/0xc0
             btrfs_inode_lock+0x40/0x70
             prealloc_file_extent_cluster+0x1b0/0x370
             relocate_file_extent_cluster+0xb2/0x720
             relocate_data_extent+0x107/0x160
             relocate_block_group+0x442/0x550
             btrfs_relocate_block_group+0x2cb/0x4b0
             btrfs_relocate_chunk+0x50/0x1b0
             btrfs_balance+0x92f/0x13d0
             btrfs_ioctl+0x1abf/0x2600
             __x64_sys_ioctl+0x97/0xd0
             do_syscall_64+0x95/0x180
             entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      -> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}:
             __lock_acquire+0x13e7/0x2180
             lock_acquire+0xcb/0x2e0
             __mutex_lock+0xbe/0xc00
             btrfs_quota_disable+0x54/0x4c0
             btrfs_ioctl+0x206b/0x2600
             __x64_sys_ioctl+0x97/0xd0
             do_syscall_64+0x95/0x180
             entry_SYSCALL_64_after_hwframe+0x76/0x7e
      
      other info that might help us debug this:
      
      Chain exists of:
        &fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&fs_info->subvol_sem);
                                     lock(&sb->s_type->i_mutex_key#16);
                                     lock(&fs_info->subvol_sem);
        lock(&fs_info->cleaner_mutex);
      
       *** DEADLOCK ***
      
      2 locks held by btrfs/471533:
       #0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600
       #1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
      
      stack backtrace:
      CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1
      Call Trace:
       <TASK>
       dump_stack_lvl+0x77/0xb0
       check_noncircular+0x148/0x160
       ? lock_acquire+0xcb/0x2e0
       __lock_acquire+0x13e7/0x2180
       lock_acquire+0xcb/0x2e0
       ? btrfs_quota_disable+0x54/0x4c0
       ? lock_is_held_type+0x9a/0x110
       __mutex_lock+0xbe/0xc00
       ? btrfs_quota_disable+0x54/0x4c0
       ? srso_return_thunk+0x5/0x5f
       ? lock_acquire+0xcb/0x2e0
       ? btrfs_quota_disable+0x54/0x4c0
       ? btrfs_quota_disable+0x54/0x4c0
       btrfs_quota_disable+0x54/0x4c0
       btrfs_ioctl+0x206b/0x2600
       ? srso_return_thunk+0x5/0x5f
       ? __do_sys_statfs+0x61/0x70
       __x64_sys_ioctl+0x97/0xd0
       do_syscall_64+0x95/0x180
       ? srso_return_thunk+0x5/0x5f
       ? reacquire_held_locks+0xd1/0x1f0
       ? do_user_addr_fault+0x307/0x8a0
       ? srso_return_thunk+0x5/0x5f
       ? lock_acquire+0xcb/0x2e0
       ? srso_return_thunk+0x5/0x5f
       ? srso_return_thunk+0x5/0x5f
       ? find_held_lock+0x2b/0x80
       ? srso_return_thunk+0x5/0x5f
       ? lock_release+0xca/0x2a0
       ? srso_return_thunk+0x5/0x5f
       ? do_user_addr_fault+0x35c/0x8a0
       ? srso_return_thunk+0x5/0x5f
       ? trace_hardirqs_off+0x4b/0xc0
       ? srso_return_thunk+0x5/0x5f
       ? lockdep_hardirqs_on_prepare+0xde/0x190
       ? srso_return_thunk+0x5/0x5f
      
      This happens because when we call rename we already have the inode mutex
      held, and then we acquire the subvol_sem if we are a subvolume.  This
      makes the dependency
      
      inode lock -> subvol sem
      
      When we're running data relocation we will preallocate space for the
      data relocation inode, and we always run the relocation under the
      ->cleaner_mutex.  This now creates the dependency of
      
      cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem
      
      Qgroup delete is doing this in the opposite order, it is acquiring the
      subvol_sem and then it is acquiring the cleaner_mutex, which results in
      this lockdep splat.  This deadlock can't happen in reality, because we
      won't ever rename the data reloc inode, nor is the data reloc inode a
      subvolume.
      
      However this is fairly easy to fix, simply take the cleaner mutex in the
      case where we are disabling qgroups before we take the subvol_sem.  This
      resolves the lockdep splat.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0f2b8098
    • Dominique Martinet's avatar
      btrfs: add missing mutex_unlock in btrfs_relocate_sys_chunks() · 9af503d9
      Dominique Martinet authored
      The previous patch that replaced BUG_ON by error handling forgot to
      unlock the mutex in the error path.
      
      Link: https://lore.kernel.org/all/Zh%2fHpAGFqa7YAFuM@duo.ucw.czReported-by: default avatarPavel Machek <pavel@denx.de>
      Fixes: 7411055d ("btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks()")
      CC: stable@vger.kernel.org
      Reviewed-by: default avatarPavel Machek <pavel@denx.de>
      Signed-off-by: default avatarDominique Martinet <dominique.martinet@atmark-techno.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9af503d9
  3. 18 Apr, 2024 2 commits
    • Qu Wenruo's avatar
      btrfs: fix wrong block_start calculation for btrfs_drop_extent_map_range() · fe1c6c7a
      Qu Wenruo authored
      [BUG]
      During my extent_map cleanup/refactor, with extra sanity checks,
      extent-map-tests::test_case_7() would not pass the checks.
      
      The problem is, after btrfs_drop_extent_map_range(), the resulted
      extent_map has a @block_start way too large.
      Meanwhile my btrfs_file_extent_item based members are returning a
      correct @disk_bytenr/@offset combination.
      
      The extent map layout looks like this:
      
           0        16K    32K       48K
           | PINNED |      | Regular |
      
      The regular em at [32K, 48K) also has 32K @block_start.
      
      Then drop range [0, 36K), which should shrink the regular one to be
      [36K, 48K).
      However the @block_start is incorrect, we expect 32K + 4K, but got 52K.
      
      [CAUSE]
      Inside btrfs_drop_extent_map_range() function, if we hit an extent_map
      that covers the target range but is still beyond it, we need to split
      that extent map into half:
      
      	|<-- drop range -->|
      		 |<----- existing extent_map --->|
      
      And if the extent map is not compressed, we need to forward
      extent_map::block_start by the difference between the end of drop range
      and the extent map start.
      
      However in that particular case, the difference is calculated using
      (start + len - em->start).
      
      The problem is @start can be modified if the drop range covers any
      pinned extent.
      
      This leads to wrong calculation, and would be caught by my later
      extent_map sanity checks, which checks the em::block_start against
      btrfs_file_extent_item::disk_bytenr + btrfs_file_extent_item::offset.
      
      This is a regression caused by commit c962098c ("btrfs: fix
      incorrect splitting in btrfs_drop_extent_map_range"), which removed the
      @len update for pinned extents.
      
      [FIX]
      Fix it by avoiding using @start completely, and use @end - em->start
      instead, which @end is exclusive bytenr number.
      
      And update the test case to verify the @block_start to prevent such
      problem from happening.
      
      Thankfully this is not going to lead to any data corruption, as IO path
      does not utilize btrfs_drop_extent_map_range() with @skip_pinned set.
      
      So this fix is only here for the sake of consistency/correctness.
      
      CC: stable@vger.kernel.org # 6.5+
      Fixes: c962098c ("btrfs: fix incorrect splitting in btrfs_drop_extent_map_range")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fe1c6c7a
    • Johannes Thumshirn's avatar
      btrfs: fix information leak in btrfs_ioctl_logical_to_ino() · 2f7ef5bb
      Johannes Thumshirn authored
      Syzbot reported the following information leak for in
      btrfs_ioctl_logical_to_ino():
      
        BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline]
        BUG: KMSAN: kernel-infoleak in _copy_to_user+0xbc/0x110 lib/usercopy.c:40
         instrument_copy_to_user include/linux/instrumented.h:114 [inline]
         _copy_to_user+0xbc/0x110 lib/usercopy.c:40
         copy_to_user include/linux/uaccess.h:191 [inline]
         btrfs_ioctl_logical_to_ino+0x440/0x750 fs/btrfs/ioctl.c:3499
         btrfs_ioctl+0x714/0x1260
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:904 [inline]
         __se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
         __x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
         x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
         do_syscall_x64 arch/x86/entry/common.c:52 [inline]
         do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
         entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
        Uninit was created at:
         __kmalloc_large_node+0x231/0x370 mm/slub.c:3921
         __do_kmalloc_node mm/slub.c:3954 [inline]
         __kmalloc_node+0xb07/0x1060 mm/slub.c:3973
         kmalloc_node include/linux/slab.h:648 [inline]
         kvmalloc_node+0xc0/0x2d0 mm/util.c:634
         kvmalloc include/linux/slab.h:766 [inline]
         init_data_container+0x49/0x1e0 fs/btrfs/backref.c:2779
         btrfs_ioctl_logical_to_ino+0x17c/0x750 fs/btrfs/ioctl.c:3480
         btrfs_ioctl+0x714/0x1260
         vfs_ioctl fs/ioctl.c:51 [inline]
         __do_sys_ioctl fs/ioctl.c:904 [inline]
         __se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
         __x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
         x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
         do_syscall_x64 arch/x86/entry/common.c:52 [inline]
         do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
         entry_SYSCALL_64_after_hwframe+0x77/0x7f
      
        Bytes 40-65535 of 65536 are uninitialized
        Memory access of size 65536 starts at ffff888045a40000
      
      This happens, because we're copying a 'struct btrfs_data_container' back
      to user-space. This btrfs_data_container is allocated in
      'init_data_container()' via kvmalloc(), which does not zero-fill the
      memory.
      
      Fix this by using kvzalloc() which zeroes out the memory on allocation.
      
      CC: stable@vger.kernel.org # 4.14+
      Reported-by: default avatar  <syzbot+510a1abbb8116eeb341d@syzkaller.appspotmail.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <Johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f7ef5bb
  4. 17 Apr, 2024 3 commits
    • Sweet Tea Dorminy's avatar
      btrfs: fallback if compressed IO fails for ENOSPC · 131a821a
      Sweet Tea Dorminy authored
      In commit b4ccace8 ("btrfs: refactor submit_compressed_extents()"), if
      an async extent compressed but failed to find enough space, we changed
      from falling back to an uncompressed write to just failing the write
      altogether. The principle was that if there's not enough space to write
      the compressed version of the data, there can't possibly be enough space
      to write the larger, uncompressed version of the data.
      
      However, this isn't necessarily true: due to fragmentation, there could
      be enough discontiguous free blocks to write the uncompressed version,
      but not enough contiguous free blocks to write the smaller but
      unsplittable compressed version.
      
      This has occurred to an internal workload which relied on write()'s
      return value indicating there was space. While rare, it has happened a
      few times.
      
      Thus, in order to prevent early ENOSPC, re-add a fallback to
      uncompressed writing.
      
      Fixes: b4ccace8 ("btrfs: refactor submit_compressed_extents()")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Co-developed-by: default avatarNeal Gompa <neal@gompa.dev>
      Signed-off-by: default avatarNeal Gompa <neal@gompa.dev>
      Signed-off-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      131a821a
    • Naohiro Aota's avatar
      btrfs: scrub: run relocation repair when/only needed · 7192833c
      Naohiro Aota authored
      When btrfs scrub finds an error, it reads mirrors to find correct data. If
      all the errors are fixed, sctx->error_bitmap is cleared for the stripe
      range. However, in the zoned mode, it runs relocation to repair scrub
      errors when the bitmap is *not* empty, which is a flipped condition.
      
      Also, it runs the relocation even if the scrub is read-only. This was
      missed by a fix in commit 1f2030ff ("btrfs: scrub: respect the
      read-only flag during repair").
      
      The repair is only necessary when there is a repaired sector and should be
      done on read-write scrub. So, tweak the condition for both regular and
      zoned case.
      
      Fixes: 54765392 ("btrfs: scrub: introduce helper to queue a stripe for scrub")
      Fixes: 1f2030ff ("btrfs: scrub: respect the read-only flag during repair")
      CC: stable@vger.kernel.org # 6.6+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7192833c
    • David Sterba's avatar
      btrfs: remove colon from messages with state · e5a78fde
      David Sterba authored
      The message format in syslog is usually made of two parts:
      
        prefix ":" message
      
      Various tools parse the prefix up to the first ":". When there's
      an additional status of a btrfs filesystem like
      
        [5.199782] BTRFS info (device nvme1n1p1: state M): use zstd compression, level 9
      
      where 'M' is for remount, there's one more ":" that does not conform to
      the format. Remove it.
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e5a78fde
  5. 09 Apr, 2024 3 commits
  6. 02 Apr, 2024 6 commits
    • Boris Burkov's avatar
      btrfs: always clear PERTRANS metadata during commit · 6e68de0b
      Boris Burkov authored
      It is possible to clear a root's IN_TRANS tag from the radix tree, but
      not clear its PERTRANS, if there is some error in between. Eliminate
      that possibility by moving the free up to where we clear the tag.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6e68de0b
    • Boris Burkov's avatar
      btrfs: make btrfs_clear_delalloc_extent() free delalloc reserve · 3c6f0c5e
      Boris Burkov authored
      Currently, this call site in btrfs_clear_delalloc_extent() only converts
      the reservation. We are marking it not delalloc, so I don't think it
      makes sense to keep the rsv around.  This is a path where we are not
      sure to join a transaction, so it leads to incorrect free-ing during
      umount.
      
      Helps with the pass rate of generic/269 and generic/475.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3c6f0c5e
    • Boris Burkov's avatar
      btrfs: qgroup: convert PREALLOC to PERTRANS after record_root_in_trans · 211de933
      Boris Burkov authored
      The transaction is only able to free PERTRANS reservations for a root
      once that root has been recorded with the TRANS tag on the roots radix
      tree. Therefore, until we are sure that this root will get tagged, it
      isn't safe to convert. Generally, this is not an issue as *some*
      transaction will likely tag the root before long and this reservation
      will get freed in that transaction, but technically it could stick
      around until unmount and result in a warning about leaked metadata
      reservation space.
      
      This path is most exercised by running the generic/269 fstest with
      CONFIG_BTRFS_DEBUG.
      
      Fixes: a6496849 ("btrfs: fix start transaction qgroup rsv double free")
      CC: stable@vger.kernel.org # 6.6+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      211de933
    • Boris Burkov's avatar
      btrfs: record delayed inode root in transaction · 71537e35
      Boris Burkov authored
      When running delayed inode updates, we do not record the inode's root in
      the transaction, but we do allocate PREALLOC and thus converted PERTRANS
      space for it. To be sure we free that PERTRANS meta rsv, we must ensure
      that we record the root in the transaction.
      
      Fixes: 4f5427cc ("btrfs: delayed-inode: Use new qgroup meta rsv for delayed inode and item")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      71537e35
    • Boris Burkov's avatar
      btrfs: qgroup: fix qgroup prealloc rsv leak in subvolume operations · 74e97958
      Boris Burkov authored
      Create subvolume, create snapshot and delete subvolume all use
      btrfs_subvolume_reserve_metadata() to reserve metadata for the changes
      done to the parent subvolume's fs tree, which cannot be mediated in the
      normal way via start_transaction. When quota groups (squota or qgroups)
      are enabled, this reserves qgroup metadata of type PREALLOC. Once the
      operation is associated to a transaction, we convert PREALLOC to
      PERTRANS, which gets cleared in bulk at the end of the transaction.
      
      However, the error paths of these three operations were not implementing
      this lifecycle correctly. They unconditionally converted the PREALLOC to
      PERTRANS in a generic cleanup step regardless of errors or whether the
      operation was fully associated to a transaction or not. This resulted in
      error paths occasionally converting this rsv to PERTRANS without calling
      record_root_in_trans successfully, which meant that unless that root got
      recorded in the transaction by some other thread, the end of the
      transaction would not free that root's PERTRANS, leaking it. Ultimately,
      this resulted in hitting a WARN in CONFIG_BTRFS_DEBUG builds at unmount
      for the leaked reservation.
      
      The fix is to ensure that every qgroup PREALLOC reservation observes the
      following properties:
      
      1. any failure before record_root_in_trans is called successfully
         results in freeing the PREALLOC reservation.
      2. after record_root_in_trans, we convert to PERTRANS, and now the
         transaction owns freeing the reservation.
      
      This patch enforces those properties on the three operations. Without
      it, generic/269 with squotas enabled at mkfs time would fail in ~5-10
      runs on my system. With this patch, it ran successfully 1000 times in a
      row.
      
      Fixes: e85fde51 ("btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      74e97958
    • Boris Burkov's avatar
      btrfs: qgroup: correctly model root qgroup rsv in convert · 141fb8cd
      Boris Burkov authored
      We use add_root_meta_rsv and sub_root_meta_rsv to track prealloc and
      pertrans reservations for subvolumes when quotas are enabled. The
      convert function does not properly increment pertrans after decrementing
      prealloc, so the count is not accurate.
      
      Note: we check that the fs is not read-only to mirror the logic in
      qgroup_convert_meta, which checks that before adding to the pertrans rsv.
      
      Fixes: 8287475a ("btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      141fb8cd
  7. 26 Mar, 2024 9 commits
    • Tavian Barnes's avatar
      btrfs: fix race in read_extent_buffer_pages() · ef1e6823
      Tavian Barnes authored
      There are reports from tree-checker that detects corrupted nodes,
      without any obvious pattern so possibly an overwrite in memory.
      After some debugging it turns out there's a race when reading an extent
      buffer the uptodate status can be missed.
      
      To prevent concurrent reads for the same extent buffer,
      read_extent_buffer_pages() performs these checks:
      
          /* (1) */
          if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
              return 0;
      
          /* (2) */
          if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
              goto done;
      
      At this point, it seems safe to start the actual read operation. Once
      that completes, end_bbio_meta_read() does
      
          /* (3) */
          set_extent_buffer_uptodate(eb);
      
          /* (4) */
          clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
      
      Normally, this is enough to ensure only one read happens, and all other
      callers wait for it to finish before returning.  Unfortunately, there is
      a racey interleaving:
      
          Thread A | Thread B | Thread C
          ---------+----------+---------
             (1)   |          |
                   |    (1)   |
             (2)   |          |
             (3)   |          |
             (4)   |          |
                   |    (2)   |
                   |          |    (1)
      
      When this happens, thread B kicks of an unnecessary read. Worse, thread
      C will see UPTODATE set and return immediately, while the read from
      thread B is still in progress.  This race could result in tree-checker
      errors like this as the extent buffer is concurrently modified:
      
          BTRFS critical (device dm-0): corrupted node, root=256
          block=8550954455682405139 owner mismatch, have 11858205567642294356
          expect [256, 18446744073709551360]
      
      Fix it by testing UPTODATE again after setting the READING bit, and if
      it's been set, skip the unnecessary read.
      
      Fixes: d7172f52 ("btrfs: use per-buffer locking for extent_buffer reading")
      Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
      Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
      Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
      CC: stable@vger.kernel.org # 6.5+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarTavian Barnes <tavianator@tavianator.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ minor update of changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ef1e6823
    • Anand Jain's avatar
      btrfs: return accurate error code on open failure in open_fs_devices() · 2f1aeab9
      Anand Jain authored
      When attempting to exclusive open a device which has no exclusive open
      permission, such as a physical device associated with the flakey dm
      device, the open operation will fail, resulting in a mount failure.
      
      In this particular scenario, we erroneously return -EINVAL instead of the
      correct error code provided by the bdev_open_by_path() function, which is
      -EBUSY.
      
      Fix this, by returning error code from the bdev_open_by_path() function.
      With this correction, the mount error message will align with that of
      ext4 and xfs.
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2f1aeab9
    • Johannes Thumshirn's avatar
      btrfs: zoned: don't skip block groups with 100% zone unusable · a8b70c7f
      Johannes Thumshirn authored
      Commit f4a9f219 ("btrfs: do not delete unused block group if it may be
      used soon") changed the behaviour of deleting unused block-groups on zoned
      filesystems. Starting with this commit, we're using
      btrfs_space_info_used() to calculate the number of used bytes in a
      space_info. But btrfs_space_info_used() also accounts
      btrfs_space_info::bytes_zone_unusable as used bytes.
      
      So if a block group is 100% zone_unusable it is skipped from the deletion
      step.
      
      In order not to skip fully zone_unusable block-groups, also check if the
      block-group has bytes left that can be used on a zoned filesystem.
      
      Fixes: f4a9f219 ("btrfs: do not delete unused block group if it may be used soon")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8b70c7f
    • Filipe Manana's avatar
      btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping() · 21334600
      Filipe Manana authored
      At btrfs_add_extent_mapping(), if we failed to merge the extent map, which
      is unexpected and theoretically should never happen, we use WARN_ONCE() to
      log a message which is not great because we don't get information about
      which filesystem it relates to in case we have multiple btrfs filesystems
      mounted. So change this to use btrfs_warn() and surround the error check
      with WARN_ON() so we always get a useful stack trace and the condition is
      flagged as "unlikely" since it's not expected to ever happen.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      21334600
    • Filipe Manana's avatar
      btrfs: fix message not properly printing interval when adding extent map · 379c8723
      Filipe Manana authored
      At btrfs_add_extent_mapping(), if we are unable to merge the existing
      extent map, we print a warning message that suggests interval ranges in
      the form "[X, Y)", where the first element is the inclusive start offset
      of a range and the second element is the exclusive end offset. However
      we end up printing the length of the ranges instead of the exclusive end
      offsets. So fix this by printing the range end offsets.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      379c8723
    • Filipe Manana's avatar
      btrfs: fix warning messages not printing interval at unpin_extent_range() · 4dc1d69c
      Filipe Manana authored
      At unpin_extent_range() we print warning messages that are supposed to
      print an interval in the form "[X, Y)", with the first element being an
      inclusive start offset and the second element being the exclusive end
      offset of a range. However we end up printing the range's length instead
      of the range's exclusive end offset, so fix that to avoid having confusing
      and non-sense messages in case we hit one of these unexpected scenarios.
      
      Fixes: 00deaf04 ("btrfs: log messages at unpin_extent_range() during unexpected cases")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4dc1d69c
    • Filipe Manana's avatar
      btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache() · 8a565ec0
      Filipe Manana authored
      At unpin_extent_cache() if we happen to find an extent map with an
      unexpected start offset, we jump to the 'out' label and never release the
      reference we added to the extent map through the call to
      lookup_extent_mapping(), therefore resulting in a leak. So fix this by
      moving the free_extent_map() under the 'out' label.
      
      Fixes: c03c89f8 ("btrfs: handle errors returned from unpin_extent_cache()")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a565ec0
    • Anand Jain's avatar
      btrfs: validate device maj:min during open · 9f7eb840
      Anand Jain authored
      Boris managed to create a device capable of changing its maj:min without
      altering its device path.
      
      Only multi-devices can be scanned. A device that gets scanned and remains
      in the btrfs kernel cache might end up with an incorrect maj:min.
      
      Despite the temp-fsid feature patch did not introduce this bug, it could
      lead to issues if the above multi-device is converted to a single device
      with a stale maj:min. Subsequently, attempting to mount the same device
      with the correct maj:min might mistake it for another device with the same
      fsid, potentially resulting in wrongly auto-enabling the temp-fsid feature.
      
      To address this, this patch validates the device's maj:min at the time of
      device open and updates it if it has changed since the last scan.
      
      CC: stable@vger.kernel.org # 6.7+
      Fixes: a5b8a5f9 ("btrfs: support cloned-device mount capability")
      Reported-by: default avatarBoris Burkov <boris@bur.io>
      Co-developed-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: Boris Burkov <boris@bur.io>#
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9f7eb840
    • Johannes Thumshirn's avatar
      btrfs: zoned: fix use-after-free in do_zone_finish() · 1ec17ef5
      Johannes Thumshirn authored
      Shinichiro reported the following use-after-free triggered by the device
      replace operation in fstests btrfs/070.
      
       BTRFS info (device nullb1): scrub: finished on devid 1 with status: 0
       ==================================================================
       BUG: KASAN: slab-use-after-free in do_zone_finish+0x91a/0xb90 [btrfs]
       Read of size 8 at addr ffff8881543c8060 by task btrfs-cleaner/3494007
      
       CPU: 0 PID: 3494007 Comm: btrfs-cleaner Tainted: G        W          6.8.0-rc5-kts #1
       Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
       Call Trace:
        <TASK>
        dump_stack_lvl+0x5b/0x90
        print_report+0xcf/0x670
        ? __virt_addr_valid+0x200/0x3e0
        kasan_report+0xd8/0x110
        ? do_zone_finish+0x91a/0xb90 [btrfs]
        ? do_zone_finish+0x91a/0xb90 [btrfs]
        do_zone_finish+0x91a/0xb90 [btrfs]
        btrfs_delete_unused_bgs+0x5e1/0x1750 [btrfs]
        ? __pfx_btrfs_delete_unused_bgs+0x10/0x10 [btrfs]
        ? btrfs_put_root+0x2d/0x220 [btrfs]
        ? btrfs_clean_one_deleted_snapshot+0x299/0x430 [btrfs]
        cleaner_kthread+0x21e/0x380 [btrfs]
        ? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
        kthread+0x2e3/0x3c0
        ? __pfx_kthread+0x10/0x10
        ret_from_fork+0x31/0x70
        ? __pfx_kthread+0x10/0x10
        ret_from_fork_asm+0x1b/0x30
        </TASK>
      
       Allocated by task 3493983:
        kasan_save_stack+0x33/0x60
        kasan_save_track+0x14/0x30
        __kasan_kmalloc+0xaa/0xb0
        btrfs_alloc_device+0xb3/0x4e0 [btrfs]
        device_list_add.constprop.0+0x993/0x1630 [btrfs]
        btrfs_scan_one_device+0x219/0x3d0 [btrfs]
        btrfs_control_ioctl+0x26e/0x310 [btrfs]
        __x64_sys_ioctl+0x134/0x1b0
        do_syscall_64+0x99/0x190
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
       Freed by task 3494056:
        kasan_save_stack+0x33/0x60
        kasan_save_track+0x14/0x30
        kasan_save_free_info+0x3f/0x60
        poison_slab_object+0x102/0x170
        __kasan_slab_free+0x32/0x70
        kfree+0x11b/0x320
        btrfs_rm_dev_replace_free_srcdev+0xca/0x280 [btrfs]
        btrfs_dev_replace_finishing+0xd7e/0x14f0 [btrfs]
        btrfs_dev_replace_by_ioctl+0x1286/0x25a0 [btrfs]
        btrfs_ioctl+0xb27/0x57d0 [btrfs]
        __x64_sys_ioctl+0x134/0x1b0
        do_syscall_64+0x99/0x190
        entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
       The buggy address belongs to the object at ffff8881543c8000
        which belongs to the cache kmalloc-1k of size 1024
       The buggy address is located 96 bytes inside of
        freed 1024-byte region [ffff8881543c8000, ffff8881543c8400)
      
       The buggy address belongs to the physical page:
       page:00000000fe2c1285 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1543c8
       head:00000000fe2c1285 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
       flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
       page_type: 0xffffffff()
       raw: 0017ffffc0000840 ffff888100042dc0 ffffea0019e8f200 dead000000000002
       raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
       page dumped because: kasan: bad access detected
      
       Memory state around the buggy address:
        ffff8881543c7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ffff8881543c7f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       >ffff8881543c8000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                              ^
        ffff8881543c8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
        ffff8881543c8100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      This UAF happens because we're accessing stale zone information of a
      already removed btrfs_device in do_zone_finish().
      
      The sequence of events is as follows:
      
      btrfs_dev_replace_start
        btrfs_scrub_dev
         btrfs_dev_replace_finishing
          btrfs_dev_replace_update_device_in_mapping_tree <-- devices replaced
          btrfs_rm_dev_replace_free_srcdev
           btrfs_free_device                              <-- device freed
      
      cleaner_kthread
       btrfs_delete_unused_bgs
        btrfs_zone_finish
         do_zone_finish              <-- refers the freed device
      
      The reason for this is that we're using a cached pointer to the chunk_map
      from the block group, but on device replace this cached pointer can
      contain stale device entries.
      
      The staleness comes from the fact, that btrfs_block_group::physical_map is
      not a pointer to a btrfs_chunk_map but a memory copy of it.
      
      Also take the fs_info::dev_replace::rwsem to prevent
      btrfs_dev_replace_update_device_in_mapping_tree() from changing the device
      underneath us again.
      
      Note: btrfs_dev_replace_update_device_in_mapping_tree() is holding
      fs_info::mapping_tree_lock, but as this is a spinning read/write lock we
      cannot take it as the call to blkdev_zone_mgmt() requires a memory
      allocation which may not sleep.
      But btrfs_dev_replace_update_device_in_mapping_tree() is always called with
      the fs_info::dev_replace::rwsem held in write mode.
      
      Many thanks to Shinichiro for analyzing the bug.
      Reported-by: default avatarShinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      CC: stable@vger.kernel.org # 6.8
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ec17ef5
  8. 15 Mar, 2024 1 commit
  9. 05 Mar, 2024 13 commits
    • Filipe Manana's avatar
      btrfs: reuse cloned extent buffer during fiemap to avoid re-allocations · 1cab1375
      Filipe Manana authored
      During fiemap we may have to visit multiple leaves of the subvolume's
      inode tree, and each time we are freeing and allocating an extent buffer
      to use as a clone of each visited leaf. Optimize this by reusing cloned
      extent buffers, to avoid the freeing and re-allocation both of the extent
      buffer structure itself and more importantly of the pages attached to the
      extent buffer.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1cab1375
    • Filipe Manana's avatar
      btrfs: fix race when detecting delalloc ranges during fiemap · 978b63f7
      Filipe Manana authored
      For fiemap we recently stopped locking the target extent range for the
      whole duration of the fiemap call, in order to avoid a deadlock in a
      scenario where the fiemap buffer happens to be a memory mapped range of
      the same file. This use case is very unlikely to be useful in practice but
      it may be triggered by fuzz testing (syzbot, etc).
      
      This however introduced a race that makes us miss delalloc ranges for
      file regions that are currently holes, so the caller of fiemap will not
      be aware that there's data for some file regions. This can be quite
      serious for some use cases - for example in coreutils versions before 9.0,
      the cp program used fiemap to detect holes and data in the source file,
      copying only regions with data (extents or delalloc) from the source file
      to the destination file in order to preserve holes (see the documentation
      for its --sparse command line option). This means that if cp was used
      with a source file that had delalloc in a hole, the destination file could
      end up without that data, which is effectively a data loss issue, if it
      happened to hit the race described below.
      
      The race happens like this:
      
      1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
         has delalloc in the file range [64M, 65M[, which is currently a hole;
      
      2) Fiemap locks the inode in shared mode, then starts iterating the
         inode's subvolume tree searching for file extent items, without having
         the whole fiemap target range locked in the inode's io tree - the
         change introduced recently by commit b0ad381f ("btrfs: fix
         deadlock with fiemap and extent locking"). It only locks ranges in
         the io tree when it finds a hole or prealloc extent since that
         commit;
      
      3) Note that fiemap clones each leaf before using it, and this is to
         avoid deadlocks when locking a file range in the inode's io tree and
         the fiemap buffer is memory mapped to some file, because writing
         to the page with btrfs_page_mkwrite() will wait on any ordered extent
         for the page's range and the ordered extent needs to lock the range
         and may need to modify the same leaf, therefore leading to a deadlock
         on the leaf;
      
      4) While iterating the file extent items in the cloned leaf before
         finding the hole in the range [64M, 65M[, the delalloc in that range
         is flushed and its ordered extent completes - meaning the corresponding
         file extent item is in the inode's subvolume tree, but not present in
         the cloned leaf that fiemap is iterating over;
      
      5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
         the cloned leaf (or a file extent item with disk_bytenr == 0 in case
         the NO_HOLES feature is not enabled), it will lock that file range in
         the inode's io tree and then search for delalloc by checking for the
         EXTENT_DELALLOC bit in the io tree for that range and ordered extents
         (with btrfs_find_delalloc_in_range()). But it finds nothing since the
         delalloc in that range was already flushed and the ordered extent
         completed and is gone - as a result fiemap will not report that there's
         delalloc or an extent for the range [64M, 65M[, so user space will be
         mislead into thinking that there's a hole in that range.
      
      This could actually be sporadically triggered with test case generic/094
      from fstests, which reports a missing extent/delalloc range like this:
      
        generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
            --- tests/generic/094.out	2020-06-10 19:29:03.830519425 +0100
            +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad	2024-02-28 11:00:00.381071525 +0000
            @@ -1,3 +1,9 @@
             QA output created by 094
             fiemap run with sync
             fiemap run without sync
            +ERROR: couldn't find extent at 7
            +map is 'HHDDHPPDPHPH'
            +logical: [       5..       6] phys:   301517..  301518 flags: 0x800 tot: 2
            +logical: [       8..       8] phys:   301520..  301520 flags: 0x800 tot: 1
            ...
            (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad'  to see the entire diff)
      
      So in order to fix this, while still avoiding deadlocks in the case where
      the fiemap buffer is memory mapped to the same file, change fiemap to work
      like the following:
      
      1) Always lock the whole range in the inode's io tree before starting to
         iterate the inode's subvolume tree searching for file extent items,
         just like we did before commit b0ad381f ("btrfs: fix deadlock with
         fiemap and extent locking");
      
      2) Now instead of writing to the fiemap buffer every time we have an extent
         to report, write instead to a temporary buffer (1 page), and when that
         buffer becomes full, stop iterating the file extent items, unlock the
         range in the io tree, release the search path, submit all the entries
         kept in that buffer to the fiemap buffer, and then resume the search
         for file extent items after locking again the remainder of the range in
         the io tree.
      
         The buffer having a size of a page, allows for 146 entries in a system
         with 4K pages. This is a large enough value to have a good performance
         by avoiding too many restarts of the search for file extent items.
         In other words this preserves the huge performance gains made in the
         last two years to fiemap, while avoiding the deadlocks in case the
         fiemap buffer is memory mapped to the same file (useless in practice,
         but possible and exercised by fuzz testing and syzbot).
      
      Fixes: b0ad381f ("btrfs: fix deadlock with fiemap and extent locking")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      978b63f7
    • Filipe Manana's avatar
      btrfs: fix off-by-one chunk length calculation at contains_pending_extent() · ae6bd7f9
      Filipe Manana authored
      At contains_pending_extent() the value of the end offset of a chunk we
      found in the device's allocation state io tree is inclusive, so when
      we calculate the length we pass to the in_range() macro, we must sum
      1 to the expression "physical_end - physical_offset".
      
      In practice the wrong calculation should be harmless as chunks sizes
      are never 1 byte and we should never have 1 byte ranges of unallocated
      space. Nevertheless fix the wrong calculation.
      Reported-by: default avatarAlex Lyakas <alex.lyakas@zadara.com>
      Link: https://lore.kernel.org/linux-btrfs/CAOcd+r30e-f4R-5x-S7sV22RJPe7+pgwherA6xqN2_qe7o4XTg@mail.gmail.com/
      Fixes: 1c11b63e ("btrfs: replace pending/pinned chunks lists with io tree")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae6bd7f9
    • Qu Wenruo's avatar
      btrfs: qgroup: allow quick inherit if snapshot is created and added to the same parent · b20fe56c
      Qu Wenruo authored
      Currently "btrfs subvolume snapshot -i <qgroupid>" would always mark the
      qgroup inconsistent.
      
      This can be annoying if the fs has a lot of snapshots, and needs qgroup
      to get the accounting for the amount of bytes it can free for each
      snapshot.
      
      Although we have the new simple quote as a solution, there is also a
      case where we can skip the full scan, if all the following conditions
      are met:
      
      - The source subvolume belongs to a higher level parent qgroup
      - The parent qgroup already owns all its bytes exclusively
      - The new snapshot is also added to the same parent qgroup
      
      In that case, we only need to add nodesize to the parent qgroup and
      avoid a full rescan.
      
      This patch would add the extra quick accounting update for such inherit.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b20fe56c
    • Qu Wenruo's avatar
      btrfs: qgroup: validate btrfs_qgroup_inherit parameter · 86211eea
      Qu Wenruo authored
      [BUG]
      Currently btrfs can create subvolume with an invalid qgroup inherit
      without triggering any error:
      
        # mkfs.btrfs -O quota -f $dev
        # mount $dev $mnt
        # btrfs subvolume create -i 2/0 $mnt/subv1
        # btrfs qgroup show -prce --sync $mnt
        Qgroupid    Referenced    Exclusive   Path
        --------    ----------    ---------   ----
        0/5           16.00KiB     16.00KiB   <toplevel>
        0/256         16.00KiB     16.00KiB   subv1
      
      [CAUSE]
      We only do a very basic size check for btrfs_qgroup_inherit structure,
      but never really verify if the values are correct.
      
      Thus in btrfs_qgroup_inherit() function, we have to skip non-existing
      qgroups, and never return any error.
      
      [FIX]
      Fix the behavior and introduce extra checks:
      
      - Introduce early check for btrfs_qgroup_inherit structure
        Not only the size, but also all the qgroup ids would be verified.
      
        And the timing is very early, so we can return error early.
        This early check is very important for snapshot creation, as snapshot
        is delayed to transaction commit.
      
      - Drop support for btrfs_qgroup_inherit::num_ref_copies and
        num_excl_copies
        Those two members are used to specify to copy refr/excl numbers from
        other qgroups.
        This would definitely mark qgroup inconsistent, and btrfs-progs has
        dropped the support for them for a long time.
        It's time to drop the support for kernel.
      
      - Verify the supported btrfs_qgroup_inherit::flags
        Just in case we want to add extra flags for btrfs_qgroup_inherit.
      
      Now above subvolume creation would fail with -ENOENT other than silently
      ignore the non-existing qgroup.
      
      CC: stable@vger.kernel.org # 6.7+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      86211eea
    • Anand Jain's avatar
      btrfs: include device major and minor numbers in the device scan notice · 0782303a
      Anand Jain authored
      To better debug issues surrounding device scans, include the device's
      major and minor numbers in the device scan notice for btrfs.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarSweet Tea Dorminy <sweettea-kernel@dorminy.me>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0782303a
    • Lijuan Li's avatar
      btrfs: mark btrfs_put_caching_control() static · 7ec28f83
      Lijuan Li authored
      btrfs_put_caching_control() is only used in block-group.c, so mark it
      static.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarLijuan Li <lilijuan@iscas.ac.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ec28f83
    • Chengming Zhou's avatar
      btrfs: remove SLAB_MEM_SPREAD flag use · ef5a05c5
      Chengming Zhou authored
      The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
      removed as of v6.8-rc1, so it became a dead flag since the commit
      16a1d968 ("mm/slab: remove mm/slab.c and slab_def.h"). And the
      series[1] went on to mark it obsolete to avoid confusion for users.
      Here we can just remove all its users, which has no functional change.
      
      [1] https://lore.kernel.org/all/20240223-slab-cleanup-flags-v2-1-02f1753e8303@suse.cz/Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ef5a05c5
    • Qu Wenruo's avatar
      btrfs: qgroup: always free reserved space for extent records · d139ded8
      Qu Wenruo authored
      [BUG]
      If qgroup is marked inconsistent (e.g. caused by operations needing full
      subtree rescan, like creating a snapshot and assign to a higher level
      qgroup), btrfs would immediately start leaking its data reserved space.
      
      The following script can easily reproduce it:
      
        mkfs.btrfs -O quota -f $dev
        mount $dev $mnt
        btrfs subvolume create $mnt/subv1
        btrfs qgroup create 1/0 $mnt
      
        # This snapshot creation would mark qgroup inconsistent,
        # as the ownership involves different higher level qgroup, thus
        # we have to rescan both source and snapshot, which can be very
        # time consuming, thus here btrfs just choose to mark qgroup
        # inconsistent, and let users to determine when to do the rescan.
        btrfs subv snapshot -i 1/0 $mnt/subv1 $mnt/snap1
      
        # Now this write would lead to qgroup rsv leak.
        xfs_io -f -c "pwrite 0 64k" $mnt/file1
      
        # And at unmount time, btrfs would report 64K DATA rsv space leaked.
        umount $mnt
      
      And we would have the following dmesg output for the unmount:
      
        BTRFS info (device dm-1): last unmount of filesystem 14a3d84e-f47b-4f72-b053-a8a36eef74d3
        BTRFS warning (device dm-1): qgroup 0/5 has unreleased space, type 0 rsv 65536
      
      [CAUSE]
      Since commit e15e9f43 ("btrfs: introduce
      BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"),
      we introduce a mode for btrfs qgroup to skip the timing consuming
      backref walk, if the qgroup is already inconsistent.
      
      But this skip also covered the data reserved freeing, thus the qgroup
      reserved space for each newly created data extent would not be freed,
      thus cause the leakage.
      
      [FIX]
      Make the data extent reserved space freeing mandatory.
      
      The qgroup reserved space handling is way cheaper compared to the
      backref walking part, and we always have the super sensitive leak
      detector, thus it's definitely worth to always free the qgroup
      reserved data space.
      Reported-by: default avatarFabian Vogt <fvogt@suse.com>
      Fixes: e15e9f43 ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting")
      CC: stable@vger.kernel.org # 6.1+
      Link: https://bugzilla.suse.com/show_bug.cgi?id=1216196Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d139ded8
    • Qu Wenruo's avatar
      btrfs: tree-checker: dump the page status if hit something wrong · dd6a5719
      Qu Wenruo authored
      [BUG]
      There is a bug report about very suspicious tree-checker got triggered:
      
        BTRFS critical (device dm-0): corrupted node, root=256
      block=8550954455682405139 owner mismatch, have 11858205567642294356
      expect [256, 18446744073709551360]
        BTRFS critical (device dm-0): corrupted node, root=256
      block=8550954455682405139 owner mismatch, have 11858205567642294356
      expect [256, 18446744073709551360]
        BTRFS critical (device dm-0): corrupted node, root=256
      block=8550954455682405139 owner mismatch, have 11858205567642294356
      expect [256, 18446744073709551360]
        SELinux: inode_doinit_use_xattr:  getxattr returned 117 for dev=dm-0
      ino=5737268
      
      [ANALYZE]
      The root cause is still unclear, but there are some clues already:
      
      - Unaligned eb bytenr
        The block bytenr is 8550954455682405139, which is not even aligned to
        2.
        This bytenr is fetched from extent buffer header, not from eb->start.
      
        This means, at the initial time of read, eb header bytenr is still
        correct (the very basis check to continue read), but later something
        wrong happened, got at least the first page corrupted.
        Thus we got such obviously incorrect value.
      
      - Invalid extent buffer header owner
        The read itself is triggered for subvolume 256, but the eb header
        owner is 11858205567642294356, which is not really possible.
        The problem here is, subvolume id is limited to (1 << 48 - 1),
        and this one definitely goes beyond that limit.
      
        So this value is another garbage.
      
      We already got two garbage from an extent buffer, which passed the
      initial bytenr and csum checks, but later the contents become garbage at
      some point.
      
      This looks like a page lifespan problem (e.g. we didn't properly hold the
      page).
      
      [ENHANCEMENT]
      The current tree-checker only outputs things from the extent buffer,
      nothing with the page status.
      
      So this patch would enhance the tree-checker output by also dumping the
      first page, which would look like this:
      
        page:00000000aa9f3ce8 refcount:4 mapcount:0 mapping:00000000169aa6b6 index:0x1d0c pfn:0x1022e5
        memcg:ffff888103456000
        aops:btree_aops [btrfs] ino:1
        flags: 0x2ffff0000008000(private|node=0|zone=2|lastcpupid=0xffff)
        page_type: 0xffffffff()
        raw: 02ffff0000008000 0000000000000000 dead000000000122 ffff88811e06e220
        raw: 0000000000001d0c ffff888102fdb1d8 00000004ffffffff ffff888103456000
        page dumped because: eb page dump
        BTRFS critical (device dm-3): corrupt leaf: root=5 block=30457856 slot=6 ino=257 file_offset=0, invalid disk_bytenr for file extent, have 10617606235235216665, should be aligned to 4096
        BTRFS error (device dm-3): read time tree block corruption detected on logical 30457856 mirror 1
      
      From the dump we can see some extra info, something can help us to do
      extra cross-checks:
      
      - Page refcount
        if it's too low, it definitely means something bad.
      
      - Page aops
        Any mapped eb page should have btree_aops with inode number 1.
      
      - Page index
        Since a mapped eb page should has its bytenr matching the page
        position, (index << PAGE_SHIFT) should match the bytenr of the
        bytenr from the critical line.
      
      - Page Private flags
        A mapped eb page should have Private flag set to indicate it's managed
        by btrfs.
      
      Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      dd6a5719
    • Qu Wenruo's avatar
      btrfs: compression: remove dead comments in btrfs_compress_heuristic() · 25da852d
      Qu Wenruo authored
      Since commit a440d48c ("Btrfs: heuristic: implement sampling
      logic"), btrfs_compress_heuristic() is no longer a simple "return true",
      but more complex to determine if we should compress.
      
      Thus the comment is dead and can be confusing, just remove it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      25da852d
    • Qu Wenruo's avatar
      btrfs: subpage: make writer lock utilize bitmap · b086c5bd
      Qu Wenruo authored
      For the writer counter, it's pretty much the same as the reader counter,
      and they are exclusive.  So move them to the new locked bitmap.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b086c5bd
    • Qu Wenruo's avatar
      btrfs: subpage: make reader lock utilize bitmap · 8e7e9c67
      Qu Wenruo authored
      Currently btrfs_subpage utilizes its atomic member @reader to manage the
      reader counter.  However it is only utilized to prevent the page to be
      released/unlocked when we still have reads underway.
      
      In that use case, we don't really allow multiple readers on the same
      subpage sector.  So here we can introduce a new locked bitmap to
      represent exactly which subpage range is locked for read.
      
      In theory we can remove btrfs_subpage::reader as it's just the set bits
      of the new locked bitmap.  But unfortunately bitmap doesn't provide such
      handy API yet, so we still keep the reader counter.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8e7e9c67