1. 12 Jun, 2023 1 commit
    • Christoph Hellwig's avatar
      btrfs: fix iomap_begin length for nocow writes · 7833b865
      Christoph Hellwig authored
      can_nocow_extent can reduce the len passed in, which needs to be
      propagated to btrfs_dio_iomap_begin so that iomap does not submit
      more data then is mapped.
      
      This problems exists since the btrfs_get_blocks_direct helper was added
      in commit c5794e51 ("btrfs: Factor out write portion of
      btrfs_get_blocks_direct"), but the ordered_extent splitting added in
      commit b73a6fd1 ("btrfs: split partial dio bios before submit")
      added a WARN_ON that made a syzkaller test fail.
      
      Reported-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
      Fixes: c5794e51 ("btrfs: Factor out write portion of btrfs_get_blocks_direct")
      CC: stable@vger.kernel.org # 6.1+
      Tested-by: syzbot+ee90502d5c8fd1d0dd93@syzkaller.appspotmail.com
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7833b865
  2. 08 Jun, 2023 2 commits
    • Qu Wenruo's avatar
      btrfs: scrub: also report errors hit during the initial read · 79b8ee70
      Qu Wenruo authored
      [BUG]
      After the recent scrub rework introduced in commit e02ee89b ("btrfs:
      scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"),
      btrfs scrub no longer reports repaired errors any more:
      
        # mkfs.btrfs -f $dev -d DUP
        # mount $dev $mnt
        # xfs_io -f -d -c "pwrite -b 64K -S 0xaa 0 64" $mnt/file
        # umount $dev
        # xfs_io -f -c "pwrite -S 0xff $phy1 64K" $dev # Corrupt the first mirror
        # mount $dev $mnt
        # btrfs scrub start -BR $mnt
        scrub done for 725e7cb7-8a4a-4c77-9f2a-86943619e218
        Scrub started:    Tue Jun  6 14:56:50 2023
        Status:           finished
        Duration:         0:00:00
        	data_extents_scrubbed: 2
        	tree_extents_scrubbed: 18
        	data_bytes_scrubbed: 131072
        	tree_bytes_scrubbed: 294912
        	read_errors: 0
        	csum_errors: 0 <<< No errors here
        	verify_errors: 0
               [...]
        	uncorrectable_errors: 0
        	unverified_errors: 0
        	corrected_errors: 16		<<< Only corrected errors
        	last_physical: 2723151872
      
      This can confuse btrfs-progs, as it relies on the csum_errors to
      determine if there is anything wrong.
      
      While on v6.3.x kernels, the report is different:
      
       	csum_errors: 16			<<<
       	verify_errors: 0
      	[...]
       	uncorrectable_errors: 0
       	unverified_errors: 0
       	corrected_errors: 16 <<<
      
      [CAUSE]
      In the reworked scrub, we update the scrub progress inside
      scrub_stripe_report_errors(), using various bitmaps to update the
      result.
      
      For example for csum_errors, we use bitmap_weight() of
      stripe->csum_error_bitmap.
      
      Unfortunately at that stage, all error bitmaps (except
      init_error_bitmap) are the result of the latest repair attempt, thus if
      the stripe is fully repaired, those error bitmaps will all be empty,
      resulting the above output mismatch.
      
      To fix this, record the number of errors into stripe->init_nr_*_errors.
      Since we don't really care about where those errors are, we only need to
      record the number of errors.
      
      Then in scrub_stripe_report_errors(), use those initial numbers to
      update the progress other than using the latest error bitmaps.
      
      Fixes: e02ee89b ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79b8ee70
    • Qu Wenruo's avatar
      btrfs: scrub: respect the read-only flag during repair · 1f2030ff
      Qu Wenruo authored
      [BUG]
      With recent scrub rework, the scrub operation no longer respects the
      read-only flag passed by "-r" option of "btrfs scrub start" command.
      
        # mkfs.btrfs -f -d raid1 $dev1 $dev2
        # mount $dev1 $mnt
        # xfs_io -f -d -c "pwrite -b 128K -S 0xaa 0 128k" $mnt/file
        # sync
        # xfs_io -c "pwrite -S 0xff $phy1 64k" $dev1
        # xfs_io -c "pwrite -S 0xff $((phy2 + 65536)) 64k" $dev2
        # mount $dev1 $mnt -o ro
        # btrfs scrub start -BrRd $mnt
        Scrub device $dev1 (id 1) done
        Scrub started:    Tue Jun  6 09:59:14 2023
        Status:           finished
        Duration:         0:00:00
               [...]
        	corrected_errors: 16 <<< Still has corrupted sectors
        	last_physical: 1372585984
      
        Scrub device $dev2 (id 2) done
        Scrub started:    Tue Jun  6 09:59:14 2023
        Status:           finished
        Duration:         0:00:00
               [...]
        	corrected_errors: 16 <<< Still has corrupted sectors
        	last_physical: 1351614464
      
        # btrfs scrub start -BrRd $mnt
        Scrub device $dev1 (id 1) done
        Scrub started:    Tue Jun  6 10:00:17 2023
        Status:           finished
        Duration:         0:00:00
               [...]
        	corrected_errors: 0 <<< No more errors
        	last_physical: 1372585984
      
        Scrub device $dev2 (id 2) done
               [...]
        	corrected_errors: 0 <<< No more errors
        	last_physical: 1372585984
      
      [CAUSE]
      In the newly reworked scrub code, repair is always submitted no matter
      if we're doing a read-only scrub.
      
      [FIX]
      Fix it by skipping the write submission if the scrub is a read-only one.
      
      Unfortunately for the report part, even for a read-only scrub we will
      still report it as corrected errors, as we know it's repairable, even we
      won't really submit the write.
      
      Fixes: e02ee89b ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1f2030ff
  3. 06 Jun, 2023 1 commit
    • Chris Mason's avatar
      btrfs: properly enable async discard when switching from RO->RW · 981a37ba
      Chris Mason authored
      The async discard uses the BTRFS_FS_DISCARD_RUNNING bit in the fs_info
      to force discards off when the filesystem has aborted or we're generally
      not able to run discards.  This gets flipped on when we're mounted rw,
      and also when we go from ro->rw.
      
      Commit 63a7cb13 ("btrfs: auto enable discard=async when possible")
      enabled async discard by default, and this meant
      "mount -o ro /dev/xxx /yyy" had async discards turned on.
      
      Unfortunately, this meant our check in btrfs_remount_cleanup() would see
      that discards are already on:
      
          /* If we toggled discard async */
          if (!btrfs_raw_test_opt(old_opts, DISCARD_ASYNC) &&
      	btrfs_test_opt(fs_info, DISCARD_ASYNC))
      	    btrfs_discard_resume(fs_info);
      
      So, we'd never call btrfs_discard_resume() when remounting the root
      filesystem from ro->rw.
      
      drgn shows this really nicely:
      
      import os
      import sys
      
      from drgn.helpers.linux.fs import path_lookup
      from drgn import NULL, Object, Type, cast
      
      def btrfs_sb(sb):
          return cast("struct btrfs_fs_info *", sb.s_fs_info)
      
      if len(sys.argv) == 1:
          path = "/"
      else:
          path = sys.argv[1]
      
      fs_info = cast("struct btrfs_fs_info *", path_lookup(prog, path).mnt.mnt_sb.s_fs_info)
      
      BTRFS_FS_DISCARD_RUNNING = 1 << prog['BTRFS_FS_DISCARD_RUNNING']
      if fs_info.flags & BTRFS_FS_DISCARD_RUNNING:
          print("discard running flag is on")
      else:
          print("discard running flag is off")
      
      [root]# mount | grep nvme
      /dev/nvme0n1p3 on / type btrfs
      (rw,relatime,compress-force=zstd:3,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)
      
      [root]# ./discard_running.drgn
      discard running flag is off
      
      [root]# mount -o remount,discard=sync /
      [root]# mount -o remount,discard=async /
      [root]# ./discard_running.drgn
      discard running flag is on
      
      The fix is to call btrfs_discard_resume() when we're going from ro->rw.
      It already checks to make sure the async discard flag is on, so it'll do
      the right thing.
      
      Fixes: 63a7cb13 ("btrfs: auto enable discard=async when possible")
      CC: stable@vger.kernel.org # 6.3+
      Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarChris Mason <clm@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      981a37ba
  4. 05 Jun, 2023 1 commit
    • Qu Wenruo's avatar
      btrfs: subpage: fix a crash in metadata repair path · 917ac778
      Qu Wenruo authored
      [BUG]
      Test case btrfs/027 would crash with subpage (64K page size, 4K
      sectorsize) with the following dying messages:
      
        debug: map_length=16384 length=65536 type=metadata|raid6(0x104)
        assertion failed: map_length >= length, in fs/btrfs/volumes.c:8093
        ------------[ cut here ]------------
        kernel BUG at fs/btrfs/messages.c:259!
        Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
        Call trace:
         btrfs_assertfail+0x28/0x2c [btrfs]
         btrfs_map_repair_block+0x150/0x2b8 [btrfs]
         btrfs_repair_io_failure+0xd4/0x31c [btrfs]
         btrfs_read_extent_buffer+0x150/0x16c [btrfs]
         read_tree_block+0x38/0xbc [btrfs]
         read_tree_root_path+0xfc/0x1bc [btrfs]
         btrfs_get_root_ref.part.0+0xd4/0x3a8 [btrfs]
         open_ctree+0xa30/0x172c [btrfs]
         btrfs_mount_root+0x3c4/0x4a4 [btrfs]
         legacy_get_tree+0x30/0x60
         vfs_get_tree+0x28/0xec
         vfs_kern_mount.part.0+0x90/0xd4
         vfs_kern_mount+0x14/0x28
         btrfs_mount+0x114/0x418 [btrfs]
         legacy_get_tree+0x30/0x60
         vfs_get_tree+0x28/0xec
         path_mount+0x3e0/0xb64
         __arm64_sys_mount+0x200/0x2d8
         invoke_syscall+0x48/0x114
         el0_svc_common.constprop.0+0x60/0x11c
         do_el0_svc+0x38/0x98
         el0_svc+0x40/0xa8
         el0t_64_sync_handler+0xf4/0x120
         el0t_64_sync+0x190/0x194
        Code: aa0403e2 b0fff060 91010000 959c2024 (d4210000)
      
      [CAUSE]
      In btrfs/027 we test RAID6 with missing devices, in this particular
      case, we're repairing a metadata at the end of a data stripe.
      
      But at btrfs_repair_io_failure(), we always pass a full PAGE for repair,
      and for subpage case this can cross stripe boundary and lead to the
      above BUG_ON().
      
      This metadata repair code is always there, since the introduction of
      subpage support, but this can trigger BUG_ON() after the bio split
      ability at btrfs_map_bio().
      
      [FIX]
      Instead of passing the old PAGE_SIZE, we calculate the correct length
      based on the eb size and page size for both regular and subpage cases.
      
      CC: stable@vger.kernel.org # 6.3+
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      917ac778
  5. 01 Jun, 2023 1 commit
    • Qu Wenruo's avatar
      btrfs: zoned: fix dev-replace after the scrub rework · b675df02
      Qu Wenruo authored
      [BUG]
      After commit e02ee89b ("btrfs: scrub: switch scrub_simple_mirror()
      to scrub_stripe infrastructure"), scrub no longer works for zoned device
      at all.
      
      Even an empty zoned btrfs cannot be replaced:
      
        # mkfs.btrfs -f /dev/nvme0n1
        # mount /dev/nvme0n1 /mnt/btrfs
        # btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
        Resetting device zones /dev/nvme1n1 (160 zones) ...
        ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error
      
      And we can hit kernel crash related to that:
      
        BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
        BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
        nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
        I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
        BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
        BUG: kernel NULL pointer dereference, address: 00000000000000a8
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
        RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
        Call Trace:
         <IRQ>
         btrfs_lookup_ordered_extent+0x31/0x190
         btrfs_record_physical_zoned+0x18/0x40
         btrfs_simple_end_io+0xaf/0xc0
         blk_update_request+0x153/0x4c0
         blk_mq_end_request+0x15/0xd0
         nvme_poll_cq+0x1d3/0x360
         nvme_irq+0x39/0x80
         __handle_irq_event_percpu+0x3b/0x190
         handle_irq_event+0x2f/0x70
         handle_edge_irq+0x7c/0x210
         __common_interrupt+0x34/0xa0
         common_interrupt+0x7d/0xa0
         </IRQ>
         <TASK>
         asm_common_interrupt+0x22/0x40
      
      [CAUSE]
      Dev-replace reuses scrub code to iterate all extents and write the
      existing content back to the new device.
      
      And for zoned devices, we call fill_writer_pointer_gap() to make sure
      all the writes into the zoned device is sequential, even if there may be
      some gaps between the writes.
      
      However we have several different bugs all related to zoned dev-replace:
      
      - We are using ZONE_APPEND operation for metadata style write back
        For zoned devices, btrfs has two ways to write data:
      
        * ZONE_APPEND for data
          This allows higher queue depth, but will not be able to know where
          the write would land.
          Thus needs to grab the real on-disk physical location in it's endio.
      
        * WRITE for metadata
          This requires single queue depth (new writes can only be submitted
          after previous one finished), and all writes must be sequential.
      
        For scrub, we go single queue depth, but still goes with ZONE_APPEND,
        which requires btrfs_bio::inode being populated.
        This is the cause of that crash.
      
      - No correct tracing of write_pointer
        After a write finished, we should forward sctx->write_pointer, or
        fill_writer_pointer_gap() would not work properly and cause more
        than necessary zero out, and fill the whole zone prematurely.
      
      - Incorrect physical bytenr passed to fill_writer_pointer_gap()
        In scrub_write_sectors(), one call site passes logical address, which
        is completely wrong.
      
        The other call site passes physical address of current sector, but
        we should pass the physical address of the btrfs_bio we're submitting.
      
        This is the cause of the -EIO errors.
      
      [FIX]
      - Do not use ZONE_APPEND for btrfs_submit_repair_write().
      
      - Manually forward sctx->write_pointer after successful writeback
      
      - Use the physical address of the to-be-submitted btrfs_bio for
        fill_writer_pointer_gap()
      
      Now zoned device replace would work as expected.
      Reported-by: default avatarChristoph Hellwig <hch@lst.de>
      Fixes: e02ee89b ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b675df02
  6. 26 May, 2023 3 commits
    • pengfuyuan's avatar
      btrfs: fix csum_tree_block page iteration to avoid tripping on -Werror=array-bounds · 5ad9b471
      pengfuyuan authored
      When compiling on a MIPS 64-bit machine we get these warnings:
      
          In file included from ./arch/mips/include/asm/cacheflush.h:13,
      	             from ./include/linux/cacheflush.h:5,
      	             from ./include/linux/highmem.h:8,
      		     from ./include/linux/bvec.h:10,
      		     from ./include/linux/blk_types.h:10,
                           from ./include/linux/blkdev.h:9,
      	             from fs/btrfs/disk-io.c:7:
          fs/btrfs/disk-io.c: In function ‘csum_tree_block’:
          fs/btrfs/disk-io.c:100:34: error: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Werror=array-bounds]
            100 |   kaddr = page_address(buf->pages[i]);
                |                        ~~~~~~~~~~^~~
          ./include/linux/mm.h:2135:48: note: in definition of macro ‘page_address’
           2135 | #define page_address(page) lowmem_page_address(page)
                |                                                ^~~~
          cc1: all warnings being treated as errors
      
      We can check if i overflows to solve the problem. However, this doesn't make
      much sense, since i == 1 and num_pages == 1 doesn't execute the body of the loop.
      In addition, i < num_pages can also ensure that buf->pages[i] will not cross
      the boundary. Unfortunately, this doesn't help with the problem observed here:
      gcc still complains.
      
      To fix this add a compile-time condition for the extent buffer page
      array size limit, which would eventually lead to eliminating the whole
      for loop.
      
      CC: stable@vger.kernel.org # 5.10+
      Signed-off-by: default avatarpengfuyuan <pengfuyuan@kylinos.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5ad9b471
    • Shida Zhang's avatar
      btrfs: fix an uninitialized variable warning in btrfs_log_inode · 8fd9f423
      Shida Zhang authored
      This fixes the following warning reported by gcc 10.2.1 under x86_64:
      
      ../fs/btrfs/tree-log.c: In function ‘btrfs_log_inode’:
      ../fs/btrfs/tree-log.c:6211:9: error: ‘last_range_start’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
       6211 |   ret = insert_dir_log_key(trans, log, path, key.objectid,
            |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       6212 |       first_dir_index, last_dir_index);
            |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      ../fs/btrfs/tree-log.c:6161:6: note: ‘last_range_start’ was declared here
       6161 |  u64 last_range_start;
            |      ^~~~~~~~~~~~~~~~
      
      This might be a false positive fixed in later compiler versions but we
      want to have it fixed.
      Reported-by: default avatark2ci <kernel-bot@kylinos.cn>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarShida Zhang <zhangshida@kylinos.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8fd9f423
    • Christoph Hellwig's avatar
      btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work · 45c2f368
      Christoph Hellwig authored
      When I implemented the storage layer bio splitting, I was under the
      assumption that we'll never split metadata bios.  But Qu reminded me that
      this can actually happen with very old file systems with unaligned
      metadata chunks and RAID0.
      
      I still haven't seen such a case in practice, but we better handled this
      case, especially as it is fairly easily to do not calling the ->end_іo
      method directly in btrfs_end_io_work, and using the proper
      btrfs_orig_bbio_end_io helper instead.
      
      In addition to the old file system with unaligned metadata chunks case
      documented in the commit log, the combination of the new scrub code
      with Johannes pending raid-stripe-tree also triggers this case.  We
      spent some time debugging it and found that this patch solves
      the problem.
      
      Fixes: 103c1972 ("btrfs: split the bio submission path into a separate file")
      CC: stable@vger.kernel.org # 6.3+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      45c2f368
  7. 17 May, 2023 3 commits
    • Josef Bacik's avatar
      btrfs: use nofs when cleaning up aborted transactions · 597441b3
      Josef Bacik authored
      Our CI system caught a lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        6.3.0-rc7+ #1167 Not tainted
        ------------------------------------------------------
        kswapd0/46 is trying to acquire lock:
        ffff8c6543abd650 (sb_internal#2){++++}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x5f/0x120
      
        but task is already holding lock:
        ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0xa5/0xe0
      	 kmem_cache_alloc+0x31/0x2c0
      	 alloc_extent_state+0x1d/0xd0
      	 __clear_extent_bit+0x2e0/0x4f0
      	 try_release_extent_mapping+0x216/0x280
      	 btrfs_release_folio+0x2e/0x90
      	 invalidate_inode_pages2_range+0x397/0x470
      	 btrfs_cleanup_dirty_bgs+0x9e/0x210
      	 btrfs_cleanup_one_transaction+0x22/0x760
      	 btrfs_commit_transaction+0x3b7/0x13a0
      	 create_subvol+0x59b/0x970
      	 btrfs_mksubvol+0x435/0x4f0
      	 __btrfs_ioctl_snap_create+0x11e/0x1b0
      	 btrfs_ioctl_snap_create_v2+0xbf/0x140
      	 btrfs_ioctl+0xa45/0x28f0
      	 __x64_sys_ioctl+0x88/0xc0
      	 do_syscall_64+0x38/0x90
      	 entry_SYSCALL_64_after_hwframe+0x72/0xdc
      
        -> #0 (sb_internal#2){++++}-{0:0}:
      	 __lock_acquire+0x1435/0x21a0
      	 lock_acquire+0xc2/0x2b0
      	 start_transaction+0x401/0x730
      	 btrfs_commit_inode_delayed_inode+0x5f/0x120
      	 btrfs_evict_inode+0x292/0x3d0
      	 evict+0xcc/0x1d0
      	 inode_lru_isolate+0x14d/0x1e0
      	 __list_lru_walk_one+0xbe/0x1c0
      	 list_lru_walk_one+0x58/0x80
      	 prune_icache_sb+0x39/0x60
      	 super_cache_scan+0x161/0x1f0
      	 do_shrink_slab+0x163/0x340
      	 shrink_slab+0x1d3/0x290
      	 shrink_node+0x300/0x720
      	 balance_pgdat+0x35c/0x7a0
      	 kswapd+0x205/0x410
      	 kthread+0xf0/0x120
      	 ret_from_fork+0x29/0x50
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(sb_internal#2);
      				 lock(fs_reclaim);
          lock(sb_internal#2);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/46:
         #0: ffffffffabe61b40 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x4aa/0x7a0
         #1: ffffffffabe50270 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x113/0x290
         #2: ffff8c6543abd0e0 (&type->s_umount_key#44){++++}-{3:3}, at: super_cache_scan+0x38/0x1f0
      
        stack backtrace:
        CPU: 0 PID: 46 Comm: kswapd0 Not tainted 6.3.0-rc7+ #1167
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         <TASK>
         dump_stack_lvl+0x58/0x90
         check_noncircular+0xd6/0x100
         ? save_trace+0x3f/0x310
         ? add_lock_to_list+0x97/0x120
         __lock_acquire+0x1435/0x21a0
         lock_acquire+0xc2/0x2b0
         ? btrfs_commit_inode_delayed_inode+0x5f/0x120
         start_transaction+0x401/0x730
         ? btrfs_commit_inode_delayed_inode+0x5f/0x120
         btrfs_commit_inode_delayed_inode+0x5f/0x120
         btrfs_evict_inode+0x292/0x3d0
         ? lock_release+0x134/0x270
         ? __pfx_wake_bit_function+0x10/0x10
         evict+0xcc/0x1d0
         inode_lru_isolate+0x14d/0x1e0
         __list_lru_walk_one+0xbe/0x1c0
         ? __pfx_inode_lru_isolate+0x10/0x10
         ? __pfx_inode_lru_isolate+0x10/0x10
         list_lru_walk_one+0x58/0x80
         prune_icache_sb+0x39/0x60
         super_cache_scan+0x161/0x1f0
         do_shrink_slab+0x163/0x340
         shrink_slab+0x1d3/0x290
         shrink_node+0x300/0x720
         balance_pgdat+0x35c/0x7a0
         kswapd+0x205/0x410
         ? __pfx_autoremove_wake_function+0x10/0x10
         ? __pfx_kswapd+0x10/0x10
         kthread+0xf0/0x120
         ? __pfx_kthread+0x10/0x10
         ret_from_fork+0x29/0x50
         </TASK>
      
      This happens because when we abort the transaction in the transaction
      commit path we call invalidate_inode_pages2_range on our block group
      cache inodes (if we have space cache v1) and any delalloc inodes we may
      have.  The plain invalidate_inode_pages2_range() call passes through
      GFP_KERNEL, which makes sense in most cases, but not here.  Wrap these
      two invalidate callees with memalloc_nofs_save/memalloc_nofs_restore to
      make sure we don't end up with the fs reclaim dependency under the
      transaction dependency.
      
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      597441b3
    • Johannes Thumshirn's avatar
      btrfs: handle memory allocation failure in btrfs_csum_one_bio · 806570c0
      Johannes Thumshirn authored
      Since f8a53bb5 ("btrfs: handle checksum generation in the storage
      layer") the failures of btrfs_csum_one_bio() are handled via
      bio_end_io().
      
      This means, we can return BLK_STS_RESOURCE from btrfs_csum_one_bio() in
      case the allocation of the ordered sums fails.
      
      This also fixes a syzkaller report, where injecting a failure into the
      kvzalloc() call results in a BUG_ON().
      
      Reported-by: syzbot+d8941552e21eac774778@syzkaller.appspotmail.com
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      806570c0
    • Qu Wenruo's avatar
      btrfs: scrub: try harder to mark RAID56 block groups read-only · 7561551e
      Qu Wenruo authored
      Currently we allow a block group not to be marked read-only for scrub.
      
      But for RAID56 block groups if we require the block group to be
      read-only, then we're allowed to use cached content from scrub stripe to
      reduce unnecessary RAID56 reads.
      
      So this patch would:
      
      - Make btrfs_inc_block_group_ro() try harder
        During my tests, for cases like btrfs/061 and btrfs/064, we can hit
        ENOSPC from btrfs_inc_block_group_ro() calls during scrub.
      
        The reason is if we only have one single data chunk, and trying to
        scrub it, we won't have any space left for any newer data writes.
      
        But this check should be done by the caller, especially for scrub
        cases we only temporarily mark the chunk read-only.
        And newer data writes would always try to allocate a new data chunk
        when needed.
      
      - Return error for scrub if we failed to mark a RAID56 chunk read-only
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7561551e
  8. 10 May, 2023 4 commits
    • Qu Wenruo's avatar
      btrfs: make clear_cache mount option to rebuild FST without disabling it · 1d6a4fc8
      Qu Wenruo authored
      Previously clear_cache mount option would simply disable free-space-tree
      feature temporarily then re-enable it to rebuild the whole free space
      tree.
      
      But this is problematic for block-group-tree feature, as we have an
      artificial dependency on free-space-tree feature.
      
      If we go the existing method, after clearing the free-space-tree
      feature, we would flip the filesystem to read-only mode, as we detect a
      super block write with block-group-tree but no free-space-tree feature.
      
      This patch would change the behavior by properly rebuilding the free
      space tree without disabling this feature, thus allowing clear_cache
      mount option to work with block group tree.
      
      Now we can mount a filesystem with block-group-tree feature and
      clear_mount option:
      
        $ mkfs.btrfs  -O block-group-tree /dev/test/scratch1  -f
        $ sudo mount /dev/test/scratch1 /mnt/btrfs -o clear_cache
        $ sudo dmesg -t | head -n 5
        BTRFS info (device dm-1): force clearing of disk cache
        BTRFS info (device dm-1): using free space tree
        BTRFS info (device dm-1): auto enabling async discard
        BTRFS info (device dm-1): rebuilding free space tree
        BTRFS info (device dm-1): checking UUID tree
      
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1d6a4fc8
    • Christoph Hellwig's avatar
      btrfs: zero the buffer before marking it dirty in btrfs_redirty_list_add · c83b56d1
      Christoph Hellwig authored
      btrfs_redirty_list_add zeroes the buffer data and sets the
      EXTENT_BUFFER_NO_CHECK to make sure writeback is fine with a bogus
      header.  But it does that after already marking the buffer dirty, which
      means that writeback could already be looking at the buffer.
      
      Switch the order of operations around so that the buffer is only marked
      dirty when we're ready to write it.
      
      Fixes: d3575156 ("btrfs: zoned: redirty released extent buffers")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c83b56d1
    • Naohiro Aota's avatar
      btrfs: zoned: fix full zone super block reading on ZNS · 02ca9e6f
      Naohiro Aota authored
      When both of the superblock zones are full, we need to check which
      superblock is newer. The calculation of last superblock position is wrong
      as it does not consider zone_capacity and uses the length.
      
      Fixes: 9658b72e ("btrfs: zoned: locate superblock position using zone capacity")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      02ca9e6f
    • Naohiro Aota's avatar
      btrfs: zoned: zone finish data relocation BG with last IO · f84353c7
      Naohiro Aota authored
      For data block groups, we zone finish a zone (or, just deactivate it) when
      seeing the last IO in btrfs_finish_ordered_io(). That is only called for
      IOs using ZONE_APPEND, but we use a regular WRITE command for data
      relocation IOs. Detect it and call btrfs_zone_finish_endio() properly.
      
      Fixes: be1a1d7a ("btrfs: zoned: finish fully written block group")
      CC: stable@vger.kernel.org # 6.1+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f84353c7
  9. 09 May, 2023 3 commits
    • Filipe Manana's avatar
      btrfs: fix backref walking not returning all inode refs · 0cad8f14
      Filipe Manana authored
      When using the logical to ino ioctl v2, if the flag to ignore offsets of
      file extent items (BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET) is given, the
      backref walking code ends up not returning references for all file offsets
      of an inode that point to the given logical bytenr. This happens since
      kernel 6.2, commit 6ce6ba53 ("btrfs: use a single argument for extent
      offset in backref walking functions") because:
      
      1) It mistakenly skipped the search for file extent items in a leaf that
         point to the target extent if that flag is given. Instead it should
         only skip the filtering done by check_extent_in_eb() - that is, it
         should not avoid the calls to that function (or find_extent_in_eb(),
         which uses it).
      
      2) It was also not building a list of inode extent elements (struct
         extent_inode_elem) if we have multiple inode references for an extent
         when the ignore offset flag is given to the logical to ino ioctl - it
         would leave a single element, only the last one that was found.
      
      These stem from the confusing old interface for backref walking functions
      where we had an extent item offset argument that was a pointer to a u64
      and another boolean argument that indicated if the offset should be
      ignored, but the pointer could be NULL. That NULL case is used by
      relocation, qgroup extent accounting and fiemap, simply to avoid building
      the inode extent list for each reference, as it's not necessary for those
      use cases and therefore avoids memory allocations and some computations.
      
      Fix this by adding a boolean argument to the backref walk context
      structure to indicate that the inode extent list should not be built,
      make relocation set that argument to true and fix the backref walking
      logic to skip the calls to check_extent_in_eb() and find_extent_in_eb()
      only if this new argument is true, instead of 'ignore_extent_item_pos'
      being true.
      
      A test case for fstests will be added soon, to provide cover not only
      for these cases but to the logical to ino ioctl in general as well, as
      currently we do not have a test case for it.
      Reported-by: default avatarVladimir Panteleev <git@vladimir.panteleev.md>
      Link: https://lore.kernel.org/linux-btrfs/CAHhfkvwo=nmzrJSqZ2qMfF-rZB-ab6ahHnCD_sq9h4o8v+M7QQ@mail.gmail.com/
      Fixes: 6ce6ba53 ("btrfs: use a single argument for extent offset in backref walking functions")
      CC: stable@vger.kernel.org # 6.2+
      Tested-by: default avatarVladimir Panteleev <git@vladimir.panteleev.md>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cad8f14
    • Filipe Manana's avatar
      btrfs: fix space cache inconsistency after error loading it from disk · 0004ff15
      Filipe Manana authored
      When loading a free space cache from disk, at __load_free_space_cache(),
      if we fail to insert a bitmap entry, we still increment the number of
      total bitmaps in the btrfs_free_space_ctl structure, which is incorrect
      since we failed to add the bitmap entry. On error we then empty the
      cache by calling __btrfs_remove_free_space_cache(), which will result
      in getting the total bitmaps counter set to 1.
      
      A failure to load a free space cache is not critical, so if a failure
      happens we just rebuild the cache by scanning the extent tree, which
      happens at block-group.c:caching_thread(). Yet the failure will result
      in having the total bitmaps of the btrfs_free_space_ctl always bigger
      by 1 then the number of bitmap entries we have. So fix this by having
      the total bitmaps counter be incremented only if we successfully added
      the bitmap entry.
      
      Fixes: a67509c3 ("Btrfs: add a io_ctl struct and helpers for dealing with the space cache")
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      CC: stable@vger.kernel.org # 4.4+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0004ff15
    • Anastasia Belova's avatar
      btrfs: print-tree: parent bytenr must be aligned to sector size · c87f318e
      Anastasia Belova authored
      Check nodesize to sectorsize in alignment check in print_extent_item.
      The comment states that and this is correct, similar check is done
      elsewhere in the functions.
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      
      Fixes: ea57788e ("btrfs: require only sector size alignment for parent eb bytenr")
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarAnastasia Belova <abelova@astralinux.ru>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c87f318e
  10. 03 May, 2023 1 commit
    • Josef Bacik's avatar
      btrfs: don't free qgroup space unless specified · d246331b
      Josef Bacik authored
      Boris noticed in his simple quotas testing that he was getting a leak
      with Sweet Tea's change to subvol create that stopped doing a
      transaction commit.  This was just a side effect of that change.
      
      In the delayed inode code we have an optimization that will free extra
      reservations if we think we can pack a dir item into an already modified
      leaf.  Previously this wouldn't be triggered in the subvolume create
      case because we'd commit the transaction, it was still possible but
      much harder to trigger.  It could actually be triggered if we did a
      mkdir && subvol create with qgroups enabled.
      
      This occurs because in btrfs_insert_delayed_dir_index(), which gets
      called when we're adding the dir item, we do the following:
      
        btrfs_block_rsv_release(fs_info, trans->block_rsv, bytes, NULL);
      
      if we're able to skip reserving space.
      
      The problem here is that trans->block_rsv points at the temporary block
      rsv for the subvolume create, which has qgroup reservations in the block
      rsv.
      
      This is a problem because btrfs_block_rsv_release() will do the
      following:
      
        if (block_rsv->qgroup_rsv_reserved >= block_rsv->qgroup_rsv_size) {
      	  qgroup_to_release = block_rsv->qgroup_rsv_reserved -
      		  block_rsv->qgroup_rsv_size;
      	  block_rsv->qgroup_rsv_reserved = block_rsv->qgroup_rsv_size;
        }
      
      The temporary block rsv just has ->qgroup_rsv_reserved set,
      ->qgroup_rsv_size == 0.  The optimization in
      btrfs_insert_delayed_dir_index() sets ->qgroup_rsv_reserved = 0.  Then
      later on when we call btrfs_subvolume_release_metadata() which has
      
        btrfs_block_rsv_release(fs_info, rsv, (u64)-1, &qgroup_to_release);
        btrfs_qgroup_convert_reserved_meta(root, qgroup_to_release);
      
      qgroup_to_release is set to 0, and we do not convert the reserved
      metadata space.
      
      The problem here is that the block rsv code has been unconditionally
      messing with ->qgroup_rsv_reserved, because the main place this is used
      is delalloc, and any time we call btrfs_block_rsv_release() we do it
      with qgroup_to_release set, and thus do the proper accounting.
      
      The subvolume code is the only other code that uses the qgroup
      reservation stuff, but it's intermingled with the above optimization,
      and thus was getting its reservation freed out from underneath it and
      thus leaking the reserved space.
      
      The solution is to simply not mess with the qgroup reservations if we
      don't have qgroup_to_release set.  This works with the existing code as
      anything that messes with the delalloc reservations always have
      qgroup_to_release set.  This fixes the leak that Boris was observing.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      CC: stable@vger.kernel.org # 5.4+
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d246331b
  11. 02 May, 2023 1 commit
    • Boris Burkov's avatar
      btrfs: fix encoded write i_size corruption with no-holes · e7db9e5c
      Boris Burkov authored
      We have observed a btrfs filesystem corruption on workloads using
      no-holes and encoded writes via send stream v2. The symptom is that a
      file appears to be truncated to the end of its last aligned extent, even
      though the final unaligned extent and even the file extent and otherwise
      correctly updated inode item have been written.
      
      So if we were writing out a 1MiB+X file via 8 128K extents and one
      extent of length X, i_size would be set to 1MiB, but the ninth extent,
      nbyte, etc. would all appear correct otherwise.
      
      The source of the race is a narrow (one line of code) window in which a
      no-holes fs has read in an updated i_size, but has not yet set a shared
      disk_i_size variable to write. Therefore, if two ordered extents run in
      parallel (par for the course for receive workloads), the following
      sequence can play out: (following "threads" a bit loosely, since there
      are callbacks involved for endio but extra threads aren't needed to
      cause the issue)
      
        ENC-WR1 (second to last)                                         ENC-WR2 (last)
        -------                                                          -------
        btrfs_do_encoded_write
          set i_size = 1M
          submit bio B1 ending at 1M
        endio B1
        btrfs_inode_safe_disk_i_size_write
          local i_size = 1M
          falls off a cliff for some reason
      							      btrfs_do_encoded_write
      								set i_size = 1M+X
      								submit bio B2 ending at 1M+X
      							      endio B2
      							      btrfs_inode_safe_disk_i_size_write
      								local i_size = 1M+X
      								disk_i_size = 1M+X
          disk_i_size = 1M
      							      btrfs_delayed_update_inode
          btrfs_delayed_update_inode
      
      And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and
      writes respect i_size and present a corrupted file missing its last
      extents.
      
      Fix this by holding the inode lock in the no-holes case so that a thread
      can't sneak in a write to disk_i_size that gets overwritten with an out
      of date i_size.
      
      Fixes: 41a2ee75 ("btrfs: introduce per-inode file extent tree")
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e7db9e5c
  12. 28 Apr, 2023 7 commits
    • Naohiro Aota's avatar
      btrfs: zoned: fix wrong use of bitops API in btrfs_ensure_empty_zones · 631003e2
      Naohiro Aota authored
      find_next_bit and find_next_zero_bit take @size as the second parameter and
      @offset as the third parameter. They are specified opposite in
      btrfs_ensure_empty_zones(). Thanks to the later loop, it never failed to
      detect the empty zones. Fix them and (maybe) return the result a bit
      faster.
      
      Note: the naming is a bit confusing, size has two meanings here, bitmap
      and our range size.
      
      Fixes: 1cd6121f ("btrfs: zoned: implement zoned chunk allocator")
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      631003e2
    • Qu Wenruo's avatar
      btrfs: properly reject clear_cache and v1 cache for block-group-tree · 64b5d5b2
      Qu Wenruo authored
      [BUG]
      With block-group-tree feature enabled, mounting it with clear_cache
      would cause the following transaction abort at mount or remount:
      
        BTRFS info (device dm-4): force clearing of disk cache
        BTRFS info (device dm-4): using free space tree
        BTRFS info (device dm-4): auto enabling async discard
        BTRFS info (device dm-4): clearing free space tree
        BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
        BTRFS info (device dm-4): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
        BTRFS error (device dm-4): block-group-tree feature requires fres-space-tree and no-holes
        BTRFS error (device dm-4): super block corruption detected before writing it to disk
        BTRFS: error (device dm-4) in write_all_supers:4288: errno=-117 Filesystem corrupted (unexpected superblock corruption detected)
        BTRFS warning (device dm-4: state E): Skipping commit of aborted transaction.
      
      [CAUSE]
      For block-group-tree feature, we have an artificial dependency on
      free-space-tree.
      
      This means if we detect block-group-tree without v2 cache, we consider
      it a corruption and cause the problem.
      
      For clear_cache mount option, it would temporary disable v2 cache, then
      re-enable it.
      
      But unfortunately for that temporary v2 cache disabled status, we refuse
      to write a superblock with bg tree only flag, thus leads to the above
      transaction abortion.
      
      [FIX]
      For now, just reject clear_cache and v1 cache mount option for block
      group tree.  So now we got a graceful rejection other than a transaction
      abort:
      
        BTRFS info (device dm-4): force clearing of disk cache
        BTRFS error (device dm-4): cannot disable free space tree with block-group-tree feature
        BTRFS error (device dm-4): open_ctree failed
      
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      64b5d5b2
    • Filipe Manana's avatar
      btrfs: print extent buffers when sibling keys check fails · a2cea677
      Filipe Manana authored
      When trying to move keys from one node/leaf to another sibling node/leaf,
      if the sibling keys check fails we just print an error message with the
      last key of the left sibling and the first key of the right sibling.
      However it's also useful to print all the keys of each sibling, as it
      may provide some clues to what went wrong, which code path may be
      inserting keys in an incorrect order. So just do that, print the siblings
      with btrfs_print_tree(), as it works for both leaves and nodes.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a2cea677
    • Filipe Manana's avatar
      btrfs: abort transaction when sibling keys check fails for leaves · 9ae5afd0
      Filipe Manana authored
      If the sibling keys check fails before we move keys from one sibling
      leaf to another, we are not aborting the transaction - we leave that to
      some higher level caller of btrfs_search_slot() (or anything else that
      uses it to insert items into a b+tree).
      
      This means that the transaction abort will provide a stack trace that
      omits the b+tree modification call chain. So change this to immediately
      abort the transaction and therefore get a more useful stack trace that
      shows us the call chain in the bt+tree modification code.
      
      It's also important to immediately abort the transaction just in case
      some higher level caller is not doing it, as this indicates a very
      serious corruption and we should stop the possibility of doing further
      damage.
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9ae5afd0
    • Filipe Manana's avatar
      btrfs: fix leak of source device allocation state after device replace · 611ccc58
      Filipe Manana authored
      When a device replace finishes, the source device is freed by calling
      btrfs_free_device() at btrfs_rm_dev_replace_free_srcdev(), but the
      allocation state, tracked in the device's alloc_state io tree, is never
      freed.
      
      This is a regression recently introduced by commit f0bb5474 ("btrfs:
      remove redundant release of btrfs_device::alloc_state"), which removed a
      call to extent_io_tree_release() from btrfs_free_device(), with the
      rationale that btrfs_close_one_device() already releases the allocation
      state from a device and btrfs_close_one_device() is always called before
      a device is freed with btrfs_free_device(). However that is not true for
      the device replace case, as btrfs_free_device() is called without any
      previous call to btrfs_close_one_device().
      
      The issue is trivial to reproduce, for example, by running test btrfs/027
      from fstests:
      
        $ ./check btrfs/027
        $ rmmod btrfs
        $ dmesg
        (...)
        [84519.395485] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg started
        [84519.466224] BTRFS info (device sdc): dev_replace from <missing disk> (devid 2) to /dev/sdg finished
        [84519.552251] BTRFS info (device sdc): scrub: started on devid 1
        [84519.552277] BTRFS info (device sdc): scrub: started on devid 2
        [84519.552332] BTRFS info (device sdc): scrub: started on devid 3
        [84519.552705] BTRFS info (device sdc): scrub: started on devid 4
        [84519.604261] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0
        [84519.609374] BTRFS info (device sdc): scrub: finished on devid 3 with status: 0
        [84519.610818] BTRFS info (device sdc): scrub: finished on devid 1 with status: 0
        [84519.610927] BTRFS info (device sdc): scrub: finished on devid 2 with status: 0
        [84559.503795] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
        [84559.506764] BTRFS: state leak: start 1048576 end 1347420159 state 1 in tree 1 refs 1
        [84559.510294] BTRFS: state leak: start 1048576 end 1351614463 state 1 in tree 1 refs 1
      
      So fix this by adding back the call to extent_io_tree_release() at
      btrfs_free_device().
      
      Fixes: f0bb5474 ("btrfs: remove redundant release of btrfs_device::alloc_state")
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      611ccc58
    • xiaoshoukui's avatar
      btrfs: fix assertion of exclop condition when starting balance · ac868bc9
      xiaoshoukui authored
      Balance as exclusive state is compatible with paused balance and device
      add, which makes some things more complicated. The assertion of valid
      states when starting from paused balance needs to take into account two
      more states, the combinations can be hit when there are several threads
      racing to start balance and device add. This won't typically happen when
      the commands are started from command line.
      
      Scenario 1: With exclusive_operation state == BTRFS_EXCLOP_NONE.
      
      Concurrently adding multiple devices to the same mount point and
      btrfs_exclop_finish executed finishes before assertion in
      btrfs_exclop_balance, exclusive_operation will changed to
      BTRFS_EXCLOP_NONE state which lead to assertion failed:
      
        fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE ||
        fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD,
        in fs/btrfs/ioctl.c:456
        Call Trace:
         <TASK>
         btrfs_exclop_balance+0x13c/0x310
         ? memdup_user+0xab/0xc0
         ? PTR_ERR+0x17/0x20
         btrfs_ioctl_add_dev+0x2ee/0x320
         btrfs_ioctl+0x9d5/0x10d0
         ? btrfs_ioctl_encoded_write+0xb80/0xb80
         __x64_sys_ioctl+0x197/0x210
         do_syscall_64+0x3c/0xb0
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Scenario 2: With exclusive_operation state == BTRFS_EXCLOP_BALANCE_PAUSED.
      
      Concurrently adding multiple devices to the same mount point and
      btrfs_exclop_balance executed finish before the latter thread execute
      assertion in btrfs_exclop_balance, exclusive_operation will changed to
      BTRFS_EXCLOP_BALANCE_PAUSED state which lead to assertion failed:
      
        fs_info->exclusive_operation == BTRFS_EXCLOP_BALANCE ||
        fs_info->exclusive_operation == BTRFS_EXCLOP_DEV_ADD ||
        fs_info->exclusive_operation == BTRFS_EXCLOP_NONE,
        fs/btrfs/ioctl.c:458
        Call Trace:
         <TASK>
         btrfs_exclop_balance+0x240/0x410
         ? memdup_user+0xab/0xc0
         ? PTR_ERR+0x17/0x20
         btrfs_ioctl_add_dev+0x2ee/0x320
         btrfs_ioctl+0x9d5/0x10d0
         ? btrfs_ioctl_encoded_write+0xb80/0xb80
         __x64_sys_ioctl+0x197/0x210
         do_syscall_64+0x3c/0xb0
         entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      An example of the failed assertion is below, which shows that the
      paused balance is also needed to be checked.
      
        root@syzkaller:/home/xsk# ./repro
        Failed to add device /dev/vda, errno 14
        Failed to add device /dev/vda, errno 14
        Failed to add device /dev/vda, errno 14
        Failed to add device /dev/vda, errno 14
        Failed to add device /dev/vda, errno 14
        Failed to add device /dev/vda, errno 14
        Failed to add device /dev/vda, errno 14
        Failed to add device /dev/vda, errno 14
        Failed to add device /dev/vda, errno 14
        [  416.611428][ T7970] BTRFS info (device loop0): fs_info exclusive_operation: 0
        Failed to add device /dev/vda, errno 14
        [  416.613973][ T7971] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        [  416.615456][ T7972] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        [  416.617528][ T7973] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        [  416.618359][ T7974] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        [  416.622589][ T7975] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        [  416.624034][ T7976] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        [  416.626420][ T7977] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        [  416.627643][ T7978] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        [  416.629006][ T7979] BTRFS info (device loop0): fs_info exclusive_operation: 3
        [  416.630298][ T7980] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        Failed to add device /dev/vda, errno 14
        [  416.632787][ T7981] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        [  416.634282][ T7982] BTRFS info (device loop0): fs_info exclusive_operation: 3
        Failed to add device /dev/vda, errno 14
        [  416.636202][ T7983] BTRFS info (device loop0): fs_info exclusive_operation: 3
        [  416.637012][ T7984] BTRFS info (device loop0): fs_info exclusive_operation: 1
        Failed to add device /dev/vda, errno 14
        [  416.637759][ T7984] assertion failed: fs_info->exclusive_operation ==
        BTRFS_EXCLOP_BALANCE || fs_info->exclusive_operation ==
        BTRFS_EXCLOP_DEV_ADD || fs_info->exclusive_operation ==
        BTRFS_EXCLOP_NONE, in fs/btrfs/ioctl.c:458
        [  416.639845][ T7984] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
        [  416.640485][ T7984] CPU: 0 PID: 7984 Comm: repro Not tainted 6.2.0 #7
        [  416.641172][ T7984] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
        [  416.642090][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e
        [  416.644423][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282
        [  416.645018][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000
        [  416.645763][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7
        [  416.646554][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000
        [  416.647299][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0
        [  416.648041][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000
        [  416.648785][ T7984] FS:  00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000
        [  416.649616][ T7984] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  416.650238][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0
        [  416.650980][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [  416.651725][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [  416.652502][ T7984] PKRU: 55555554
        [  416.652888][ T7984] Call Trace:
        [  416.653241][ T7984]  <TASK>
        [  416.653527][ T7984]  btrfs_exclop_balance+0x240/0x410
        [  416.654036][ T7984]  ? memdup_user+0xab/0xc0
        [  416.654465][ T7984]  ? PTR_ERR+0x17/0x20
        [  416.654874][ T7984]  btrfs_ioctl_add_dev+0x2ee/0x320
        [  416.655380][ T7984]  btrfs_ioctl+0x9d5/0x10d0
        [  416.655822][ T7984]  ? btrfs_ioctl_encoded_write+0xb80/0xb80
        [  416.656400][ T7984]  __x64_sys_ioctl+0x197/0x210
        [  416.656874][ T7984]  do_syscall_64+0x3c/0xb0
        [  416.657346][ T7984]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
        [  416.657922][ T7984] RIP: 0033:0x4546af
        [  416.660170][ T7984] RSP: 002b:00007fa2985d4150 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
        [  416.660972][ T7984] RAX: ffffffffffffffda RBX: 00007fa2985d4640 RCX: 00000000004546af
        [  416.661714][ T7984] RDX: 0000000000000000 RSI: 000000005000940a RDI: 0000000000000003
        [  416.662449][ T7984] RBP: 00007fa2985d41d0 R08: 0000000000000000 R09: 00007ffee37a4c4f
        [  416.663195][ T7984] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fa2985d4640
        [  416.663951][ T7984] R13: 0000000000000009 R14: 000000000041b320 R15: 00007fa297dd4000
        [  416.664703][ T7984]  </TASK>
        [  416.665040][ T7984] Modules linked in:
        [  416.665590][ T7984] ---[ end trace 0000000000000000 ]---
        [  416.666176][ T7984] RIP: 0010:btrfs_assertfail+0x2c/0x2e
        [  416.668775][ T7984] RSP: 0018:ffffc90003ea7e28 EFLAGS: 00010282
        [  416.669425][ T7984] RAX: 00000000000000cc RBX: 0000000000000000 RCX: 0000000000000000
        [  416.670235][ T7984] RDX: ffff88801d030000 RSI: ffffffff81637e7c RDI: fffff520007d4fb7
        [  416.671050][ T7984] RBP: ffffffff8a533de0 R08: 00000000000000cc R09: 0000000000000000
        [  416.671867][ T7984] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffff8a533da0
        [  416.672685][ T7984] R13: 00000000000001ca R14: 000000005000940a R15: 0000000000000000
        [  416.673501][ T7984] FS:  00007fa2985d4640(0000) GS:ffff88802cc00000(0000) knlGS:0000000000000000
        [  416.674425][ T7984] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [  416.675114][ T7984] CR2: 0000000000000000 CR3: 0000000018e5e000 CR4: 0000000000750ef0
        [  416.675933][ T7984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [  416.676760][ T7984] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Link: https://lore.kernel.org/linux-btrfs/20230324031611.98986-1-xiaoshoukui@gmail.com/
      CC: stable@vger.kernel.org # 6.1+
      Signed-off-by: default avatarxiaoshoukui <xiaoshoukui@ruijie.com.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ac868bc9
    • Filipe Manana's avatar
      btrfs: fix btrfs_prev_leaf() to not return the same key twice · 6f932d4e
      Filipe Manana authored
      A call to btrfs_prev_leaf() may end up returning a path that points to the
      same item (key) again. This happens if while btrfs_prev_leaf(), after we
      release the path, a concurrent insertion happens, which moves items off
      from a sibling into the front of the previous leaf, and an item with the
      computed previous key does not exists.
      
      For example, suppose we have the two following leaves:
      
        Leaf A
      
        -------------------------------------------------------------
        | ...   key (300 96 10)   key (300 96 15)   key (300 96 16) |
        -------------------------------------------------------------
                    slot 20             slot 21             slot 22
      
        Leaf B
      
        -------------------------------------------------------------
        | key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
        -------------------------------------------------------------
            slot 0             slot 1             slot 2
      
      If we call btrfs_prev_leaf(), from btrfs_previous_item() for example, with
      a path pointing to leaf B and slot 0 and the following happens:
      
      1) At btrfs_prev_leaf() we compute the previous key to search as:
         (300 96 19), which is a key that does not exists in the tree;
      
      2) Then we call btrfs_release_path() at btrfs_prev_leaf();
      
      3) Some other task inserts a key at leaf A, that sorts before the key at
         slot 20, for example it has an objectid of 299. In order to make room
         for the new key, the key at slot 22 is moved to the front of leaf B.
         This happens at push_leaf_right(), called from split_leaf().
      
         After this leaf B now looks like:
      
        --------------------------------------------------------------------------------
        | key (300 96 16)    key (300 96 20)   key (300 96 21)   key (300 96 22)   ... |
        --------------------------------------------------------------------------------
             slot 0              slot 1             slot 2             slot 3
      
      4) At btrfs_prev_leaf() we call btrfs_search_slot() for the computed
         previous key: (300 96 19). Since the key does not exists,
         btrfs_search_slot() returns 1 and with a path pointing to leaf B
         and slot 1, the item with key (300 96 20);
      
      5) This makes btrfs_prev_leaf() return a path that points to slot 1 of
         leaf B, the same key as before it was called, since the key at slot 0
         of leaf B (300 96 16) is less than the computed previous key, which is
         (300 96 19);
      
      6) As a consequence btrfs_previous_item() returns a path that points again
         to the item with key (300 96 20).
      
      For some users of btrfs_prev_leaf() or btrfs_previous_item() this may not
      be functional a problem, despite not making sense to return a new path
      pointing again to the same item/key. However for a caller such as
      tree-log.c:log_dir_items(), this has a bad consequence, as it can result
      in not logging some dir index deletions in case the directory is being
      logged without holding the inode's VFS lock (logging triggered while
      logging a child inode for example) - for the example scenario above, in
      case the dir index keys 17, 18 and 19 were deleted in the current
      transaction.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6f932d4e
  13. 17 Apr, 2023 12 commits
    • Josh Poimboeuf's avatar
      btrfs: mark btrfs_assertfail() __noreturn · f3724631
      Josh Poimboeuf authored
      Fixes a bunch of warnings including:
      
        vmlinux.o: warning: objtool: select_reloc_root+0x314: unreachable instruction
        vmlinux.o: warning: objtool: finish_inode_if_needed+0x15b1: unreachable instruction
        vmlinux.o: warning: objtool: get_bio_sector_nr+0x259: unreachable instruction
        vmlinux.o: warning: objtool: raid_wait_read_end_io+0xc26: unreachable instruction
        vmlinux.o: warning: objtool: raid56_parity_alloc_scrub_rbio+0x37b: unreachable instruction
        ...
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Link: https://lore.kernel.org/oe-kbuild-all/202302210709.IlXfgMpX-lkp@intel.com/Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@kernel.org>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f3724631
    • Genjian Zhang's avatar
      btrfs: fix uninitialized variable warnings · 8ba7d5f5
      Genjian Zhang authored
      There are some warnings on older compilers (gcc 10, 7) or non-x86_64
      architectures (aarch64).  As btrfs wants to enable -Wmaybe-uninitialized
      by default, fix the warnings even though it's not necessary on recent
      compilers (gcc 12+).
      
      ../fs/btrfs/volumes.c: In function ‘btrfs_init_new_device’:
      ../fs/btrfs/volumes.c:2703:3: error: ‘seed_devices’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
       2703 |   btrfs_setup_sprout(fs_info, seed_devices);
            |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      ../fs/btrfs/send.c: In function ‘get_cur_inode_state’:
      ../include/linux/compiler.h:70:32: error: ‘right_gen’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
         70 |   (__if_trace.miss_hit[1]++,1) :  \
            |                                ^
      ../fs/btrfs/send.c:1878:6: note: ‘right_gen’ was declared here
       1878 |  u64 right_gen;
            |      ^~~~~~~~~
      Reported-by: default avatark2ci <kernel-bot@kylinos.cn>
      Signed-off-by: default avatarGenjian Zhang <zhanggenjian@kylinos.cn>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update changelog ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8ba7d5f5
    • Filipe Manana's avatar
      btrfs: use log root when iterating over index keys when logging directory · 5d3e4f1d
      Filipe Manana authored
      When logging dir dentries of a directory, we iterate over the subvolume
      tree to find dir index keys on leaves modified in the current transaction.
      This however is heavy on locking, since btrfs_search_forward() may often
      keep locks on extent buffers for quite a while when walking the tree to
      find a suitable leaf modified in the current transaction and with a key
      not smaller than then the provided minimum key. That means it will block
      other tasks trying to access the subvolume tree, which may be common fs
      operations like creating, renaming, linking, unlinking, reflinking files,
      etc.
      
      A better solution is to iterate the log tree, since it's much smaller than
      a subvolume tree and just use plain btrfs_search_slot() (or the wrapper
      btrfs_for_each_slot()) and only contains dir index keys added in the
      current transaction.
      
      The following bonnie++ test on a non-debug kernel (with Debian's default
      kernel config) on a 20G null block device, was used to measure the impact:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/nullb0
         MNT=/mnt/nullb0
      
         NR_DIRECTORIES=20
         NR_FILES=20480  # must be a multiple of 1024
         DATASET_SIZE=$(( (8 * 1024 * 1024 * 1024) / 1048576 )) # 8 GiB as megabytes
         DIRECTORY_SIZE=$(( DATASET_SIZE / NR_FILES ))
         NR_FILES=$(( NR_FILES / 1024 ))
      
         umount $DEV &> /dev/null
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         bonnie++ -u root -d $MNT \
             -n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \
             -r 0 -s $DATASET_SIZE -b
      
         umount $MNT
      
      Before patchset:
      
         Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                             -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
         Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
         debian0          8G  376k  99  1.1g  98  939m  92 1527k  99  3.2g  99  9060 256
         Latency             24920us     207us     680ms    5594us     171us    2891us
         Version 2.00a       ------Sequential Create------ --------Random Create--------
         debian0             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                       files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                       20/20 20480  96 +++++ +++ 20480  95 20480  99 +++++ +++ 20480  97
         Latency              8708us     137us    5128us    6743us      60us   19712us
      
      After patchset:
      
         Version 2.00a       ------Sequential Output------ --Sequential Input- --Random-
                             -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
         Name:Size etc        /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
         debian0          8G  384k  99  1.2g  99  971m  91 1533k  99  3.3g  99  9180 309
         Latency             24930us     125us     661ms    5587us      46us    2020us
         Version 2.00a       ------Sequential Create------ --------Random Create--------
         debian0             -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
                       files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                       20/20 20480  90 +++++ +++ 20480  99 20480  99 +++++ +++ 20480  97
         Latency              7030us      61us    1246us    4942us      56us   16855us
      
      The patchset consists of this patch plus a previous one that has the
      following subject:
      
         "btrfs: avoid iterating over all indexes when logging directory"
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5d3e4f1d
    • Filipe Manana's avatar
      btrfs: avoid iterating over all indexes when logging directory · fa4b8cb1
      Filipe Manana authored
      When logging a directory, after copying all directory index items from the
      subvolume tree to the log tree, we iterate over the subvolume tree to find
      all dir index items that are located in leaves COWed (or created) in the
      current transaction. If we keep logging a directory several times during
      the same transaction, we end up iterating over the same dir index items
      everytime we log the directory, wasting time and adding extra lock
      contention on the subvolume tree.
      
      So just keep track of the last logged dir index offset in order to start
      the search for that index (+1) the next time the directory is logged, as
      dir index values (key offsets) come from a monotonically increasing
      counter.
      
      The following test measures the difference before and after this change:
      
        $ cat test.sh
        #!/bin/bash
      
        DEV=/dev/nullb0
        MNT=/mnt/nullb0
      
        umount $DEV &> /dev/null
        mkfs.btrfs -f $DEV
        mount -o ssd $DEV $MNT
      
        # Time values in milliseconds.
        declare -a fsync_times
        # Total number of files added to the test directory.
        num_files=1000000
        # Fsync directory after every N files are added.
        fsync_period=100
      
        mkdir $MNT/testdir
      
        fsync_total_time=0
        for ((i = 1; i <= $num_files; i++)); do
              echo -n > $MNT/testdir/file_$i
      
              if [ $((i % fsync_period)) -eq 0 ]; then
                      start=$(date +%s%N)
                      xfs_io -c "fsync" $MNT/testdir
                      end=$(date +%s%N)
                      fsync_total_time=$((fsync_total_time + (end - start)))
                      fsync_times[i]=$(( (end - start) / 1000000 ))
                      echo -n -e "Progress $i / $num_files\r"
              fi
        done
      
        echo -e "\nHistogram of directory fsync duration in ms:\n"
      
        printf '%s\n' "${fsync_times[@]}" | \
           perl -MStatistics::Histogram -e '@d = <>; print get_histogram(\@d);'
      
        fsync_total_time=$((fsync_total_time / 1000000))
        echo -e "\nTotal time spent in fsync: $fsync_total_time ms\n"
        echo
      
        umount $MNT
      
      The test was run on a non-debug kernel (Debian's default kernel config)
      against a 15G null block device.
      
      Result before this change:
      
         Histogram of directory fsync duration in ms:
      
         Count: 10000
         Range:  3.000 - 362.000; Mean: 34.556; Median: 31.000; Stddev: 25.751
         Percentiles:  90th: 71.000; 95th: 77.000; 99th: 81.000
            3.000 -    5.278:  1423 #################################
            5.278 -    8.854:  1173 ###########################
            8.854 -   14.467:   591 ##############
           14.467 -   23.277:  1025 #######################
           23.277 -   37.105:  1422 #################################
           37.105 -   58.809:  2036 ###############################################
           58.809 -   92.876:  2316 #####################################################
           92.876 -  146.346:     6 |
          146.346 -  230.271:     6 |
          230.271 -  362.000:     2 |
      
         Total time spent in fsync: 350527 ms
      
      Result after this change:
      
         Histogram of directory fsync duration in ms:
      
         Count: 10000
         Range:  3.000 - 1088.000; Mean:  8.704; Median:  8.000; Stddev: 12.576
         Percentiles:  90th: 12.000; 95th: 14.000; 99th: 17.000
            3.000 -    6.007:  3222 #################################
            6.007 -   11.276:  5197 #####################################################
           11.276 -   20.506:  1551 ################
           20.506 -   36.674:    24 |
           36.674 -  201.552:     1 |
          201.552 -  353.841:     4 |
          353.841 - 1088.000:     1 |
      
         Total time spent in fsync: 92114 ms
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa4b8cb1
    • Qu Wenruo's avatar
      btrfs: dev-replace: error out if we have unrepaired metadata error during · 8eb3dd17
      Qu Wenruo authored
      [BUG]
      Even before the scrub rework, if we have some corrupted metadata failed
      to be repaired during replace, we still continue replacing and let it
      finish just as there is nothing wrong:
      
       BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
       BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
       BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
       BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
       BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
      
      This can lead to unexpected problems for the resulting filesystem.
      
      [CAUSE]
      Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
      But unlike scrub, dev-replace doesn't really bother to check the scrub
      progress, which records all the errors found during replace.
      
      And even if we check the progress, we cannot really determine which
      errors are minor, which are critical just by the plain numbers.
      (remember we don't treat metadata/data checksum error differently).
      
      This behavior is there from the very beginning.
      
      [FIX]
      Instead of continuing the replace, just error out if we hit an
      unrepaired metadata sector.
      
      Now the dev-replace would be rejected with -EIO, to let the user know.
      Although it also means, the filesystem has some metadata error which
      cannot be repaired, the user would be upset anyway.
      
      The new dmesg would look like this:
      
       BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
       BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
       BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
       BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
       BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
       BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8eb3dd17
    • Filipe Manana's avatar
      btrfs: remove pointless loop at btrfs_get_next_valid_item() · 524f14bb
      Filipe Manana authored
      It's pointless to have a while loop at btrfs_get_next_valid_item(), as if
      the slot on the current leaf is beyond the last item, we call
      btrfs_next_leaf(), which leaves us at a valid slot of the next leaf (or
      a valid slot in the current leaf if after releasing the path an item gets
      pushed from the next leaf to the current leaf).
      
      So just call btrfs_next_leaf() if the current slot on the current leaf is
      beyond the last item.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      524f14bb
    • Qu Wenruo's avatar
      btrfs: scrub: reject unsupported scrub flags · 604e6681
      Qu Wenruo authored
      Since the introduction of scrub interface, the only flag that we support
      is BTRFS_SCRUB_READONLY.  Thus there is no sanity checks, if there are
      some undefined flags passed in, we just ignore them.
      
      This is problematic if we want to introduce new scrub flags, as we have
      no way to determine if such flags are supported.
      
      Address the problem by introducing a check for the flags, and if
      unsupported flags are set, return -EOPNOTSUPP to inform the user space.
      
      This check should be backported for all supported kernels before any new
      scrub flags are introduced.
      
      CC: stable@vger.kernel.org # 4.14+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      604e6681
    • Boris Burkov's avatar
      btrfs: reinterpret async discard iops_limit=0 as no delay · f263a7c3
      Boris Burkov authored
      Currently, a limit of 0 results in a hard coded metering over 6 hours.
      Since the default is a set limit, I suspect no one truly depends on this
      rather arbitrary setting. Repurpose it for an arguably more useful
      "unlimited" mode, where the delay is 0.
      
      Note that if block groups are too new, or go fully empty, there is still
      a delay associated with those conditions. Those delays implement
      heuristics for not trimming a region we are relatively likely to fully
      overwrite soon.
      
      CC: stable@vger.kernel.org # 6.2+
      Reviewed-by: default avatarNeal Gompa <neal@gompa.dev>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f263a7c3
    • Boris Burkov's avatar
      btrfs: set default discard iops_limit to 1000 · cfe3445a
      Boris Burkov authored
      Previously, the default was a relatively conservative 10. This results
      in a 100ms delay, so with ~300 discards in a commit, it takes the full
      30s till the next commit to finish the discards. On a workstation, this
      results in the disk never going idle, wasting power/battery, etc.
      
      Set the default to 1000, which results in using the smallest possible
      delay, currently, which is 1ms. This has shown to not pathologically
      keep the disk busy by the original reporter.
      
      Link: https://lore.kernel.org/linux-btrfs/Y%2F+n1wS%2F4XAH7X1p@nz/
      Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182228
      CC: stable@vger.kernel.org # 6.2+
      Reviewed-by: Neal Gompa <neal@gompa.dev
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cfe3445a
    • Qu Wenruo's avatar
      btrfs: remove unused raid56 functions which were dedicated for scrub · aca43fe8
      Qu Wenruo authored
      Since the scrub rework, the following RAID56 functions are no longer
      called:
      
      - raid56_add_scrub_pages()
      - raid56_alloc_missing_rbio()
      - raid56_submit_missing_rbio()
      
      Those functions are all utilized by scrub to handle missing device cases
      for RAID56.
      
      However the new scrub code handle them in a completely different way:
      
      - If it's data stripe, go recovery path through btrfs_submit_bio()
      - If it's P/Q stripe, it would be handled through
        raid56_parity_submit_scrub_rbio()
        And that function would handle dev-replace and repair properly.
      
      Thus we can safely remove those functions.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      aca43fe8
    • Qu Wenruo's avatar
      btrfs: scrub: remove scrub_bio structure · 13a62fd9
      Qu Wenruo authored
      Since scrub path has been fully moved to scrub_stripe based facilities,
      no more scrub_bio would be submitted.
      Thus we can remove it completely, this involves:
      
      - SCRUB_SECTORS_PER_BIO macro
      - SCRUB_BIOS_PER_SCTX macro
      - SCRUB_MAX_PAGES macro
      - BTRFS_MAX_MIRRORS macro
      - scrub_bio structure
      - scrub_ctx::bios member
      - scrub_ctx::curr member
      - scrub_ctx::bios_in_flight member
      - scrub_ctx::workers_pending member
      - scrub_ctx::list_lock member
      - scrub_ctx::list_wait member
      
      - function scrub_bio_end_io_worker()
      - function scrub_pending_bio_inc()
      - function scrub_pending_bio_dec()
      - function scrub_throttle()
      - function scrub_submit()
      
      - function scrub_find_csum()
      - function drop_csum_range()
      
      - Some unnecessary flush and scrub pauses
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      13a62fd9
    • Qu Wenruo's avatar
      btrfs: scrub: remove scrub_block and scrub_sector structures · 001e3fc2
      Qu Wenruo authored
      Those two structures are used to represent a bunch of sectors for scrub,
      but now they are fully replaced by scrub_stripe in one go, so we can
      remove them. This involves:
      
      - structure scrub_block
      - structure scrub_sector
      
      - structure scrub_page_private
      - function attach_scrub_page_private()
      - function detach_scrub_page_private()
        Now we no longer need to use page::private to handle subpage.
      
      - function alloc_scrub_block()
      - function alloc_scrub_sector()
      - function scrub_sector_get_page()
      - function scrub_sector_get_page_offset()
      - function scrub_sector_get_kaddr()
      - function bio_add_scrub_sector()
      
      - function scrub_checksum_data()
      - function scrub_checksum_tree_block()
      - function scrub_checksum_super()
      - function scrub_check_fsid()
      - function scrub_block_get()
      - function scrub_block_put()
      - function scrub_sector_get()
      - function scrub_sector_put()
      - function scrub_bio_end_io()
      - function scrub_block_complete()
      - function scrub_add_sector_to_rd_bio()
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      001e3fc2