1. 23 Nov, 2022 3 commits
    • Zhen Lei's avatar
      btrfs: sysfs: normalize the error handling branch in btrfs_init_sysfs() · ffdbb44f
      Zhen Lei authored
      Although kset_unregister() can eventually remove all attribute files,
      explicitly rolling back with the matching function makes the code logic
      look clearer.
      
      CC: stable@vger.kernel.org # 5.4+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarZhen Lei <thunder.leizhen@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ffdbb44f
    • Filipe Manana's avatar
      btrfs: do not modify log tree while holding a leaf from fs tree locked · 796787c9
      Filipe Manana authored
      When logging an inode in full mode, or when logging xattrs or when logging
      the dir index items of a directory, we are modifying the log tree while
      holding a read lock on a leaf from the fs/subvolume tree. This can lead to
      a deadlock in rare circumstances, but it is a real possibility, and it was
      recently reported by syzbot with the following trace from lockdep:
      
         WARNING: possible circular locking dependency detected
         6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
         ------------------------------------------------------
         syz-executor.1/16154 is trying to acquire lock:
         ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
      
         but task is already holding lock:
         ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197
      
         which lock already depends on the new lock.
      
         the existing dependency chain (in reverse order) is:
      
         -> #2 (btrfs-log-00){++++}-{3:3}:
                down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634
                __btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135
                btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline]
                btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280
                btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline]
                btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998
                btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209
                btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021
                log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258
                copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403
                copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873
                btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495
                btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982
                btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083
                btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921
                vfs_fsync_range+0x13e/0x230 fs/sync.c:188
                generic_write_sync include/linux/fs.h:2856 [inline]
                iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
                btrfs_direct_write fs/btrfs/file.c:1536 [inline]
                btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
                call_write_iter include/linux/fs.h:2160 [inline]
                do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
                do_iter_write+0x182/0x700 fs/read_write.c:861
                vfs_iter_write+0x74/0xa0 fs/read_write.c:902
                iter_file_splice_write+0x745/0xc90 fs/splice.c:686
                do_splice_from fs/splice.c:764 [inline]
                direct_splice_actor+0x114/0x180 fs/splice.c:931
                splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
                do_splice_direct+0x1ab/0x280 fs/splice.c:974
                do_sendfile+0xb19/0x1270 fs/read_write.c:1255
                __do_sys_sendfile64 fs/read_write.c:1323 [inline]
                __se_sys_sendfile64 fs/read_write.c:1309 [inline]
                __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
                do_syscall_x64 arch/x86/entry/common.c:50 [inline]
                do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
                entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
         -> #1 (btrfs-tree-00){++++}-{3:3}:
                __lock_release kernel/locking/lockdep.c:5382 [inline]
                lock_release+0x371/0x810 kernel/locking/lockdep.c:5688
                up_write+0x2a/0x520 kernel/locking/rwsem.c:1614
                btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
                btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238
                search_leaf fs/btrfs/ctree.c:1832 [inline]
                btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074
                btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133
                btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746
                btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
                __btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline]
                __btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153
                flush_space+0x147/0xe90 fs/btrfs/space-info.c:728
                btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086
                process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
                worker_thread+0x669/0x1090 kernel/workqueue.c:2436
                kthread+0x2e8/0x3a0 kernel/kthread.c:376
                ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
      
         -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
                check_prev_add kernel/locking/lockdep.c:3097 [inline]
                check_prevs_add kernel/locking/lockdep.c:3216 [inline]
                validate_chain kernel/locking/lockdep.c:3831 [inline]
                __lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055
                lock_acquire kernel/locking/lockdep.c:5668 [inline]
                lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
                __mutex_lock_common kernel/locking/mutex.c:603 [inline]
                __mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747
                __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
                __btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline]
                btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
                btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285
                btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554
                evict+0x2ed/0x6b0 fs/inode.c:664
                dispose_list+0x117/0x1e0 fs/inode.c:697
                prune_icache_sb+0xeb/0x150 fs/inode.c:896
                super_cache_scan+0x391/0x590 fs/super.c:106
                do_shrink_slab+0x464/0xce0 mm/vmscan.c:843
                shrink_slab_memcg mm/vmscan.c:912 [inline]
                shrink_slab+0x388/0x660 mm/vmscan.c:991
                shrink_node_memcgs mm/vmscan.c:6088 [inline]
                shrink_node+0x93d/0x1f30 mm/vmscan.c:6117
                shrink_zones mm/vmscan.c:6355 [inline]
                do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417
                try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732
                reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393
                mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578
                try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816
                try_charge mm/memcontrol.c:2827 [inline]
                charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889
                __mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910
                mem_cgroup_charge include/linux/memcontrol.h:667 [inline]
                __filemap_add_folio+0x615/0xf80 mm/filemap.c:852
                filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934
                __filemap_get_folio+0x389/0xd80 mm/filemap.c:1976
                pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104
                find_or_create_page include/linux/pagemap.h:612 [inline]
                alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588
                btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline]
                btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988
                __btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440
                btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595
                btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038
                btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137
                update_log_root fs/btrfs/tree-log.c:2841 [inline]
                btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064
                btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947
                vfs_fsync_range+0x13e/0x230 fs/sync.c:188
                generic_write_sync include/linux/fs.h:2856 [inline]
                iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
                btrfs_direct_write fs/btrfs/file.c:1536 [inline]
                btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
                call_write_iter include/linux/fs.h:2160 [inline]
                do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
                do_iter_write+0x182/0x700 fs/read_write.c:861
                vfs_iter_write+0x74/0xa0 fs/read_write.c:902
                iter_file_splice_write+0x745/0xc90 fs/splice.c:686
                do_splice_from fs/splice.c:764 [inline]
                direct_splice_actor+0x114/0x180 fs/splice.c:931
                splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
                do_splice_direct+0x1ab/0x280 fs/splice.c:974
                do_sendfile+0xb19/0x1270 fs/read_write.c:1255
                __do_sys_sendfile64 fs/read_write.c:1323 [inline]
                __se_sys_sendfile64 fs/read_write.c:1309 [inline]
                __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
                do_syscall_x64 arch/x86/entry/common.c:50 [inline]
                do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
                entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
         other info that might help us debug this:
      
         Chain exists of:
           &delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00
      
         Possible unsafe locking scenario:
      
                CPU0                    CPU1
                ----                    ----
           lock(btrfs-log-00);
                                        lock(btrfs-tree-00);
                                        lock(btrfs-log-00);
           lock(&delayed_node->mutex);
      
      Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
      lock dependency when we are COWing extent buffers for the log tree and we
      have two tasks modifying the log tree, with each one in one of the
      following 2 scenarios:
      
      1) Modifying the log tree triggers an extent buffer allocation while
         holding a write lock on a parent extent buffer from the log tree.
         Allocating the pages for an extent buffer, or the extent buffer
         struct, can trigger inode eviction and finally the inode eviction
         will trigger a release/remove of a delayed node, which requires
         taking the delayed node's mutex;
      
      2) Allocating a metadata extent for a log tree can trigger the async
         reclaim thread and make us wait for it to release enough space and
         unblock our reservation ticket. The reclaim thread can start flushing
         delayed items, and that in turn results in the need to lock delayed
         node mutexes and in the need to write lock extent buffers of a
         subvolume tree - all this while holding a write lock on the parent
         extent buffer in the log tree.
      
      So one task in scenario 1) running in parallel with another task in
      scenario 2) could lead to a deadlock, one wanting to lock a delayed node
      mutex while having a read lock on a leaf from the subvolume, while the
      other is holding the delayed node's mutex and wants to write lock the same
      subvolume leaf for flushing delayed items.
      
      Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
      fs/subvolume leaf and use the clone leaf instead.
      
      Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/
      CC: stable@vger.kernel.org # 6.0+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      796787c9
    • Christoph Hellwig's avatar
      btrfs: use kvcalloc in btrfs_get_dev_zone_info · 8fe97d47
      Christoph Hellwig authored
      Otherwise the kernel memory allocator seems to be unhappy about failing
      order 6 allocations for the zones array, that cause 100% reproducible
      mount failures in my qemu setup:
      
        [26.078981] mount: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null)
        [26.079741] CPU: 0 PID: 2965 Comm: mount Not tainted 6.1.0-rc5+ #185
        [26.080181] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [26.080950] Call Trace:
        [26.081132]  <TASK>
        [26.081291]  dump_stack_lvl+0x56/0x6f
        [26.081554]  warn_alloc+0x117/0x140
        [26.081808]  ? __alloc_pages_direct_compact+0x1b5/0x300
        [26.082174]  __alloc_pages_slowpath.constprop.0+0xd0e/0xde0
        [26.082569]  __alloc_pages+0x32a/0x340
        [26.082836]  __kmalloc_large_node+0x4d/0xa0
        [26.083133]  ? trace_kmalloc+0x29/0xd0
        [26.083399]  kmalloc_large+0x14/0x60
        [26.083654]  btrfs_get_dev_zone_info+0x1b9/0xc00
        [26.083980]  ? _raw_spin_unlock_irqrestore+0x28/0x50
        [26.084328]  btrfs_get_dev_zone_info_all_devices+0x54/0x80
        [26.084708]  open_ctree+0xed4/0x1654
        [26.084974]  btrfs_mount_root.cold+0x12/0xde
        [26.085288]  ? lock_is_held_type+0xe2/0x140
        [26.085603]  legacy_get_tree+0x28/0x50
        [26.085876]  vfs_get_tree+0x1d/0xb0
        [26.086139]  vfs_kern_mount.part.0+0x6c/0xb0
        [26.086456]  btrfs_mount+0x118/0x3a0
        [26.086728]  ? lock_is_held_type+0xe2/0x140
        [26.087043]  legacy_get_tree+0x28/0x50
        [26.087323]  vfs_get_tree+0x1d/0xb0
        [26.087587]  path_mount+0x2ba/0xbe0
        [26.087850]  ? _raw_spin_unlock_irqrestore+0x38/0x50
        [26.088217]  __x64_sys_mount+0xfe/0x140
        [26.088506]  do_syscall_64+0x35/0x80
        [26.088776]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 5b316468 ("btrfs: get zone information of zoned block devices")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8fe97d47
  2. 21 Nov, 2022 3 commits
    • ChenXiaoSong's avatar
      btrfs: qgroup: fix sleep from invalid context bug in btrfs_qgroup_inherit() · f7e942b5
      ChenXiaoSong authored
      Syzkaller reported BUG as follows:
      
        BUG: sleeping function called from invalid context at
             include/linux/sched/mm.h:274
        Call Trace:
         <TASK>
         dump_stack_lvl+0xcd/0x134
         __might_resched.cold+0x222/0x26b
         kmem_cache_alloc+0x2e7/0x3c0
         update_qgroup_limit_item+0xe1/0x390
         btrfs_qgroup_inherit+0x147b/0x1ee0
         create_subvol+0x4eb/0x1710
         btrfs_mksubvol+0xfe5/0x13f0
         __btrfs_ioctl_snap_create+0x2b0/0x430
         btrfs_ioctl_snap_create_v2+0x25a/0x520
         btrfs_ioctl+0x2a1c/0x5ce0
         __x64_sys_ioctl+0x193/0x200
         do_syscall_64+0x35/0x80
      
      Fix this by calling qgroup_dirty() on @dstqgroup, and update limit item in
      btrfs_run_qgroups() later outside of the spinlock context.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChenXiaoSong <chenxiaosong2@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f7e942b5
    • Filipe Manana's avatar
      btrfs: send: avoid unaligned encoded writes when attempting to clone range · a11452a3
      Filipe Manana authored
      When trying to see if we can clone a file range, there are cases where we
      end up sending two write operations in case the inode from the source root
      has an i_size that is not sector size aligned and the length from the
      current offset to its i_size is less than the remaining length we are
      trying to clone.
      
      Issuing two write operations when we could instead issue a single write
      operation is not incorrect. However it is not optimal, specially if the
      extents are compressed and the flag BTRFS_SEND_FLAG_COMPRESSED was passed
      to the send ioctl. In that case we can end up sending an encoded write
      with an offset that is not sector size aligned, which makes the receiver
      fallback to decompressing the data and writing it using regular buffered
      IO (so re-compressing the data in case the fs is mounted with compression
      enabled), because encoded writes fail with -EINVAL when an offset is not
      sector size aligned.
      
      The following example, which triggered a bug in the receiver code for the
      fallback logic of decompressing + regular buffer IO and is fixed by the
      patchset referred in a Link at the bottom of this changelog, is an example
      where we have the non-optimal behaviour due to an unaligned encoded write:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV > /dev/null
         mount -o compress $DEV $MNT
      
         # File foo has a size of 33K, not aligned to the sector size.
         xfs_io -f -c "pwrite -S 0xab 0 33K" $MNT/foo
      
         xfs_io -f -c "pwrite -S 0xcd 0 64K" $MNT/bar
      
         # Now clone the first 32K of file bar into foo at offset 0.
         xfs_io -c "reflink $MNT/bar 0 0 32K" $MNT/foo
      
         # Snapshot the default subvolume and create a full send stream (v2).
         btrfs subvolume snapshot -r $MNT $MNT/snap
      
         btrfs send --compressed-data -f /tmp/test.send $MNT/snap
      
         echo -e "\nFile bar in the original filesystem:"
         od -A d -t x1 $MNT/snap/bar
      
         umount $MNT
         mkfs.btrfs -f $DEV > /dev/null
         mount $DEV $MNT
      
         echo -e "\nReceiving stream in a new filesystem..."
         btrfs receive -f /tmp/test.send $MNT
      
         echo -e "\nFile bar in the new filesystem:"
         od -A d -t x1 $MNT/snap/bar
      
         umount $MNT
      
      Before this patch, the send stream included one regular write and one
      encoded write for file 'bar', with the later being not sector size aligned
      and causing the receiver to fallback to decompression + buffered writes.
      The output of the btrfs receive command in verbose mode (-vvv):
      
         (...)
         mkfile o258-7-0
         rename o258-7-0 -> bar
         utimes
         clone bar - source=foo source offset=0 offset=0 length=32768
         write bar - offset=32768 length=1024
         encoded_write bar - offset=33792, len=4096, unencoded_offset=33792, unencoded_file_len=31744, unencoded_len=65536, compression=1, encryption=0
         encoded_write bar - falling back to decompress and write due to errno 22 ("Invalid argument")
         (...)
      
      This patch avoids the regular write followed by an unaligned encoded write
      so that we end up sending a single encoded write that is aligned. So after
      this patch the stream content is (output of btrfs receive -vvv):
      
         (...)
         mkfile o258-7-0
         rename o258-7-0 -> bar
         utimes
         clone bar - source=foo source offset=0 offset=0 length=32768
         encoded_write bar - offset=32768, len=4096, unencoded_offset=32768, unencoded_file_len=32768, unencoded_len=65536, compression=1, encryption=0
         (...)
      
      So we get more optimal behaviour and avoid the silent data loss bug in
      versions of btrfs-progs affected by the bug referred by the Link tag
      below (btrfs-progs v5.19, v5.19.1, v6.0 and v6.0.1).
      
      Link: https://lore.kernel.org/linux-btrfs/cover.1668529099.git.fdmanana@suse.com/Reviewed-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a11452a3
    • Christoph Hellwig's avatar
      btrfs: zoned: fix missing endianness conversion in sb_write_pointer · c51f0e6a
      Christoph Hellwig authored
      generation is an on-disk __le64 value, so use btrfs_super_generation to
      convert it to host endian before comparing it.
      
      Fixes: 12659251 ("btrfs: implement log-structured superblock for ZONED mode")
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c51f0e6a
  3. 15 Nov, 2022 5 commits
    • Anand Jain's avatar
      btrfs: free btrfs_path before copying subvol info to userspace · 013c1c55
      Anand Jain authored
      btrfs_ioctl_get_subvol_info() frees the search path after the userspace
      copy from the temp buffer @subvol_info. This can lead to a lock splat
      warning.
      
      Fix this by freeing the path before we copy it to userspace.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      013c1c55
    • Anand Jain's avatar
      btrfs: free btrfs_path before copying fspath to userspace · 8cf96b40
      Anand Jain authored
      btrfs_ioctl_ino_to_path() frees the search path after the userspace copy
      from the temp buffer @ipath->fspath. Which potentially can lead to a lock
      splat warning.
      
      Fix this by freeing the path before we copy it to userspace.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cf96b40
    • Anand Jain's avatar
      btrfs: free btrfs_path before copying inodes to userspace · 418ffb9e
      Anand Jain authored
      btrfs_ioctl_logical_to_ino() frees the search path after the userspace
      copy from the temp buffer @inodes. Which potentially can lead to a lock
      splat.
      
      Fix this by freeing the path before we copy @inodes to userspace.
      
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      418ffb9e
    • Josef Bacik's avatar
      btrfs: free btrfs_path before copying root refs to userspace · b740d806
      Josef Bacik authored
      Syzbot reported the following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Not tainted
      ------------------------------------------------------
      syz-executor307/3029 is trying to acquire lock:
      ffff0000c02525d8 (&mm->mmap_lock){++++}-{3:3}, at: __might_fault+0x54/0xb4 mm/memory.c:5576
      
      but task is already holding lock:
      ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline]
      ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline]
      ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #3 (btrfs-root-00){++++}-{3:3}:
             down_read_nested+0x64/0x84 kernel/locking/rwsem.c:1624
             __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline]
             btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline]
             btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279
             btrfs_search_slot_get_root+0x74/0x338 fs/btrfs/ctree.c:1637
             btrfs_search_slot+0x1b0/0xfd8 fs/btrfs/ctree.c:1944
             btrfs_update_root+0x6c/0x5a0 fs/btrfs/root-tree.c:132
             commit_fs_roots+0x1f0/0x33c fs/btrfs/transaction.c:1459
             btrfs_commit_transaction+0x89c/0x12d8 fs/btrfs/transaction.c:2343
             flush_space+0x66c/0x738 fs/btrfs/space-info.c:786
             btrfs_async_reclaim_metadata_space+0x43c/0x4e0 fs/btrfs/space-info.c:1059
             process_one_work+0x2d8/0x504 kernel/workqueue.c:2289
             worker_thread+0x340/0x610 kernel/workqueue.c:2436
             kthread+0x12c/0x158 kernel/kthread.c:376
             ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860
      
      -> #2 (&fs_info->reloc_mutex){+.+.}-{3:3}:
             __mutex_lock_common+0xd4/0xca8 kernel/locking/mutex.c:603
             __mutex_lock kernel/locking/mutex.c:747 [inline]
             mutex_lock_nested+0x38/0x44 kernel/locking/mutex.c:799
             btrfs_record_root_in_trans fs/btrfs/transaction.c:516 [inline]
             start_transaction+0x248/0x944 fs/btrfs/transaction.c:752
             btrfs_start_transaction+0x34/0x44 fs/btrfs/transaction.c:781
             btrfs_create_common+0xf0/0x1b4 fs/btrfs/inode.c:6651
             btrfs_create+0x8c/0xb0 fs/btrfs/inode.c:6697
             lookup_open fs/namei.c:3413 [inline]
             open_last_lookups fs/namei.c:3481 [inline]
             path_openat+0x804/0x11c4 fs/namei.c:3688
             do_filp_open+0xdc/0x1b8 fs/namei.c:3718
             do_sys_openat2+0xb8/0x22c fs/open.c:1313
             do_sys_open fs/open.c:1329 [inline]
             __do_sys_openat fs/open.c:1345 [inline]
             __se_sys_openat fs/open.c:1340 [inline]
             __arm64_sys_openat+0xb0/0xe0 fs/open.c:1340
             __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
             invoke_syscall arch/arm64/kernel/syscall.c:52 [inline]
             el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142
             do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206
             el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636
             el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654
             el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581
      
      -> #1 (sb_internal#2){.+.+}-{0:0}:
             percpu_down_read include/linux/percpu-rwsem.h:51 [inline]
             __sb_start_write include/linux/fs.h:1826 [inline]
             sb_start_intwrite include/linux/fs.h:1948 [inline]
             start_transaction+0x360/0x944 fs/btrfs/transaction.c:683
             btrfs_join_transaction+0x30/0x40 fs/btrfs/transaction.c:795
             btrfs_dirty_inode+0x50/0x140 fs/btrfs/inode.c:6103
             btrfs_update_time+0x1c0/0x1e8 fs/btrfs/inode.c:6145
             inode_update_time fs/inode.c:1872 [inline]
             touch_atime+0x1f0/0x4a8 fs/inode.c:1945
             file_accessed include/linux/fs.h:2516 [inline]
             btrfs_file_mmap+0x50/0x88 fs/btrfs/file.c:2407
             call_mmap include/linux/fs.h:2192 [inline]
             mmap_region+0x7fc/0xc14 mm/mmap.c:1752
             do_mmap+0x644/0x97c mm/mmap.c:1540
             vm_mmap_pgoff+0xe8/0x1d0 mm/util.c:552
             ksys_mmap_pgoff+0x1cc/0x278 mm/mmap.c:1586
             __do_sys_mmap arch/arm64/kernel/sys.c:28 [inline]
             __se_sys_mmap arch/arm64/kernel/sys.c:21 [inline]
             __arm64_sys_mmap+0x58/0x6c arch/arm64/kernel/sys.c:21
             __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
             invoke_syscall arch/arm64/kernel/syscall.c:52 [inline]
             el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142
             do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206
             el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636
             el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654
             el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581
      
      -> #0 (&mm->mmap_lock){++++}-{3:3}:
             check_prev_add kernel/locking/lockdep.c:3095 [inline]
             check_prevs_add kernel/locking/lockdep.c:3214 [inline]
             validate_chain kernel/locking/lockdep.c:3829 [inline]
             __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053
             lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666
             __might_fault+0x7c/0xb4 mm/memory.c:5577
             _copy_to_user include/linux/uaccess.h:134 [inline]
             copy_to_user include/linux/uaccess.h:160 [inline]
             btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203
             btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556
             vfs_ioctl fs/ioctl.c:51 [inline]
             __do_sys_ioctl fs/ioctl.c:870 [inline]
             __se_sys_ioctl fs/ioctl.c:856 [inline]
             __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856
             __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
             invoke_syscall arch/arm64/kernel/syscall.c:52 [inline]
             el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142
             do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206
             el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636
             el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654
             el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581
      
      other info that might help us debug this:
      
      Chain exists of:
        &mm->mmap_lock --> &fs_info->reloc_mutex --> btrfs-root-00
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(btrfs-root-00);
                                     lock(&fs_info->reloc_mutex);
                                     lock(btrfs-root-00);
        lock(&mm->mmap_lock);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor307/3029:
       #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock fs/btrfs/locking.c:134 [inline]
       #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_tree_read_lock fs/btrfs/locking.c:140 [inline]
       #0: ffff0000c958a608 (btrfs-root-00){++++}-{3:3}, at: btrfs_read_lock_root_node+0x13c/0x1c0 fs/btrfs/locking.c:279
      
      stack backtrace:
      CPU: 0 PID: 3029 Comm: syz-executor307 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022
      Call trace:
       dump_backtrace+0x1c4/0x1f0 arch/arm64/kernel/stacktrace.c:156
       show_stack+0x2c/0x54 arch/arm64/kernel/stacktrace.c:163
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x104/0x16c lib/dump_stack.c:106
       dump_stack+0x1c/0x58 lib/dump_stack.c:113
       print_circular_bug+0x2c4/0x2c8 kernel/locking/lockdep.c:2053
       check_noncircular+0x14c/0x154 kernel/locking/lockdep.c:2175
       check_prev_add kernel/locking/lockdep.c:3095 [inline]
       check_prevs_add kernel/locking/lockdep.c:3214 [inline]
       validate_chain kernel/locking/lockdep.c:3829 [inline]
       __lock_acquire+0x1530/0x30a4 kernel/locking/lockdep.c:5053
       lock_acquire+0x100/0x1f8 kernel/locking/lockdep.c:5666
       __might_fault+0x7c/0xb4 mm/memory.c:5577
       _copy_to_user include/linux/uaccess.h:134 [inline]
       copy_to_user include/linux/uaccess.h:160 [inline]
       btrfs_ioctl_get_subvol_rootref+0x3a8/0x4bc fs/btrfs/ioctl.c:3203
       btrfs_ioctl+0xa08/0xa64 fs/btrfs/ioctl.c:5556
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:870 [inline]
       __se_sys_ioctl fs/ioctl.c:856 [inline]
       __arm64_sys_ioctl+0xd0/0x140 fs/ioctl.c:856
       __invoke_syscall arch/arm64/kernel/syscall.c:38 [inline]
       invoke_syscall arch/arm64/kernel/syscall.c:52 [inline]
       el0_svc_common+0x138/0x220 arch/arm64/kernel/syscall.c:142
       do_el0_svc+0x48/0x164 arch/arm64/kernel/syscall.c:206
       el0_svc+0x58/0x150 arch/arm64/kernel/entry-common.c:636
       el0t_64_sync_handler+0x84/0xf0 arch/arm64/kernel/entry-common.c:654
       el0t_64_sync+0x18c/0x190 arch/arm64/kernel/entry.S:581
      
      We do generally the right thing here, copying the references into a
      temporary buffer, however we are still holding the path when we do
      copy_to_user from the temporary buffer.  Fix this by freeing the path
      before we copy to user space.
      
      Reported-by: syzbot+4ef9e52e464c6ff47d9d@syzkaller.appspotmail.com
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b740d806
    • Filipe Manana's avatar
      btrfs: fix assertion failure and blocking during nowait buffered write · bdcdd86c
      Filipe Manana authored
      When doing a nowait buffered write we can trigger the following assertion:
      
      [11138.437027] assertion failed: !path->nowait, in fs/btrfs/ctree.c:4658
      [11138.438251] ------------[ cut here ]------------
      [11138.438254] kernel BUG at fs/btrfs/messages.c:259!
      [11138.438762] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
      [11138.439450] CPU: 4 PID: 1091021 Comm: fsstress Not tainted 6.1.0-rc4-btrfs-next-128 #1
      [11138.440611] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [11138.442553] RIP: 0010:btrfs_assertfail+0x19/0x1b [btrfs]
      [11138.443583] Code: 5b 41 5a 41 (...)
      [11138.446437] RSP: 0018:ffffbaf0cf05b840 EFLAGS: 00010246
      [11138.447235] RAX: 0000000000000039 RBX: ffffbaf0cf05b938 RCX: 0000000000000000
      [11138.448303] RDX: 0000000000000000 RSI: ffffffffb2ef59f6 RDI: 00000000ffffffff
      [11138.449370] RBP: ffff9165f581eb68 R08: 00000000ffffffff R09: 0000000000000001
      [11138.450493] R10: ffff9167a88421f8 R11: 0000000000000000 R12: ffff9164981b1000
      [11138.451661] R13: 000000008c8f1000 R14: ffff9164991d4000 R15: ffff9164981b1000
      [11138.452225] FS:  00007f1438a66440(0000) GS:ffff9167ad600000(0000) knlGS:0000000000000000
      [11138.452949] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [11138.453394] CR2: 00007f1438a64000 CR3: 0000000100c36002 CR4: 0000000000370ee0
      [11138.454057] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [11138.454879] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [11138.455779] Call Trace:
      [11138.456211]  <TASK>
      [11138.456598]  btrfs_next_old_leaf.cold+0x18/0x1d [btrfs]
      [11138.457827]  ? kmem_cache_alloc+0x18d/0x2a0
      [11138.458516]  btrfs_lookup_csums_range+0x149/0x4d0 [btrfs]
      [11138.459407]  csum_exist_in_range+0x56/0x110 [btrfs]
      [11138.460271]  can_nocow_file_extent+0x27c/0x310 [btrfs]
      [11138.461155]  can_nocow_extent+0x1ec/0x2e0 [btrfs]
      [11138.461672]  btrfs_check_nocow_lock+0x114/0x1c0 [btrfs]
      [11138.462951]  btrfs_buffered_write+0x44c/0x8e0 [btrfs]
      [11138.463482]  btrfs_do_write_iter+0x42b/0x5f0 [btrfs]
      [11138.463982]  ? lock_release+0x153/0x4a0
      [11138.464347]  io_write+0x11b/0x570
      [11138.464660]  ? lock_release+0x153/0x4a0
      [11138.465213]  ? lock_is_held_type+0xe8/0x140
      [11138.466003]  io_issue_sqe+0x63/0x4a0
      [11138.466339]  io_submit_sqes+0x238/0x770
      [11138.466741]  __do_sys_io_uring_enter+0x37b/0xb10
      [11138.467206]  ? lock_is_held_type+0xe8/0x140
      [11138.467879]  ? syscall_enter_from_user_mode+0x1d/0x50
      [11138.468688]  do_syscall_64+0x38/0x90
      [11138.469265]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      [11138.470017] RIP: 0033:0x7f1438c539e6
      
      This is because to check if we can NOCOW, we check that if we can NOCOW
      into an extent (it's prealloc extent or the inode has NOCOW attribute),
      and then check if there are csums for the extent's range in the csum tree.
      The search may leave us beyond the last slot of a leaf, and then when
      we call btrfs_next_leaf() we end up at btrfs_next_old_leaf() with a
      time_seq of 0.
      
      This triggers a failure of the first assertion at btrfs_next_old_leaf(),
      since we have a nowait path. With assertions disabled, we simply don't
      respect the NOWAIT semantics, allowing the write to block on locks or
      blocking on IO for reading an extent buffer from disk.
      
      Fix this by:
      
      1) Triggering the assertion only if time_seq is not 0, which means that
         search is being done by a tree mod log user, and in the buffered and
         direct IO write paths we don't use the tree mod log;
      
      2) Implementing NOWAIT semantics at btrfs_next_old_leaf(). Any failure to
         lock an extent buffer should return immediately and not retry the
         search, as well as if we need to do IO to read an extent buffer from
         disk.
      
      Fixes: c922b016 ("btrfs: assert nowait mode is not used for some btree search functions")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bdcdd86c
  4. 07 Nov, 2022 7 commits
    • Johannes Thumshirn's avatar
      btrfs: zoned: fix locking imbalance on scrub · c62f6bec
      Johannes Thumshirn authored
      If we're doing device replace on a zoned filesystem and discover in
      scrub_enumerate_chunks() that we don't have to copy the block group it is
      unlocked before it gets skipped.
      
      But as the block group hasn't yet been locked before it leads to a locking
      imbalance. To fix this simply remove the unlock.
      
      This was uncovered by fstests' testcase btrfs/163.
      
      Fixes: 9283b9e0 ("btrfs: remove lock protection for BLOCK_GROUP_FLAG_TO_COPY")
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c62f6bec
    • Johannes Thumshirn's avatar
      btrfs: zoned: initialize device's zone info for seeding · a8d1b164
      Johannes Thumshirn authored
      When performing seeding on a zoned filesystem it is necessary to
      initialize each zoned device's btrfs_zoned_device_info structure,
      otherwise mounting the filesystem will cause a NULL pointer dereference.
      
      This was uncovered by fstests' testcase btrfs/163.
      
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a8d1b164
    • Johannes Thumshirn's avatar
      btrfs: zoned: clone zoned device info when cloning a device · 21e61ec6
      Johannes Thumshirn authored
      When cloning a btrfs_device, we're not cloning the associated
      btrfs_zoned_device_info structure of the device in case of a zoned
      filesystem.
      
      Later on this leads to a NULL pointer dereference when accessing the
      device's zone_info for instance when setting a zone as active.
      
      This was uncovered by fstests' testcase btrfs/161.
      
      CC: stable@vger.kernel.org # 5.15+
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      21e61ec6
    • Qu Wenruo's avatar
      Revert "btrfs: scrub: use larger block size for data extent scrub" · b75b51f8
      Qu Wenruo authored
      This reverts commit 786672e9.
      
      [BUG]
      Since commit 786672e9 ("btrfs: scrub: use larger block size for data
      extent scrub"), btrfs scrub no longer reports errors if the corruption
      is not in the first sector of a STRIPE_LEN.
      
      The following script can expose the problem:
      
        mkfs.btrfs -f $dev
        mount $dev $mnt
        xfs_io -f -c "pwrite -S 0xff 0 8k" $mnt/foobar
        umount $mnt
      
        # 13631488 is the logical bytenr of above 8K extent
        btrfs-map-logical -l 13631488 -b 4096 $dev
        mirror 1 logical 13631488 physical 13631488 device /dev/test/scratch1
      
        # Corrupt the 2nd sector of that extent
        xfs_io -f -c "pwrite -S 0x00 13635584 4k" $dev
      
        mount $dev $mnt
        btrfs scrub start -B $mnt
        scrub done for 54e63f9f-0c30-4c84-a33b-5c56014629b7
        Scrub started:    Mon Nov  7 07:18:27 2022
        Status:           finished
        Duration:         0:00:00
        Total to scrub:   536.00MiB
        Rate:             0.00B/s
        Error summary:    no errors found <<<
      
      [CAUSE]
      That offending commit enlarges the data extent scrub size from sector
      size to BTRFS_STRIPE_LEN, to avoid extra scrub_block to be allocated.
      
      But unfortunately the data extent scrub is still heavily relying on the
      fact that there is only one scrub_sector per scrub_block.
      
      Thus it will only check the first sector, and ignoring the remaining
      sectors.
      
      Furthermore the error reporting is not able to handle multiple sectors
      either.
      
      [FIX]
      For now just revert the offending commit.
      
      The consequence is just extra memory usage during scrub.
      We will need a proper change to make the remaining data scrub path to
      handle multiple sectors before we enlarging the data scrub size.
      Reported-by: default avatarLi Zhang <zhanglikernel@gmail.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b75b51f8
    • David Sterba's avatar
      btrfs: don't print stack trace when transaction is aborted due to ENOMEM · 8bb808c6
      David Sterba authored
      Add ENOMEM among the error codes that don't print stack trace on
      transaction abort. We've got several reports from syzbot that detects
      stacks as errors but caused by limiting memory. As this is an artificial
      condition we don't need to know where exactly the error happens, the
      abort and error cleanup will continue like e.g. for EIO.
      
      As the transaction aborts code needs to be inline in a lot of code, the
      implementation cases about minimal bloat. The error codes are in a
      separate function and the WARN uses the condition directly. This
      increases the code size by 571 bytes on release build.
      
      Alternatives considered: add -ENOMEM among the errors, this increases
      size by 2340 bytes, various attempts to combine the WARN and helper
      calls, increase by 700 or more bytes.
      
      Example syzbot reports (error -12):
      
      - https://syzkaller.appspot.com/bug?extid=5244d35be7f589cf093e
      - https://syzkaller.appspot.com/bug?extid=9c37714c07194d816417Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8bb808c6
    • Zhang Xiaoxu's avatar
      btrfs: selftests: fix wrong error check in btrfs_free_dummy_root() · 9b2f2034
      Zhang Xiaoxu authored
      The btrfs_alloc_dummy_root() uses ERR_PTR as the error return value
      rather than NULL, if error happened, there will be a NULL pointer
      dereference:
      
        BUG: KASAN: null-ptr-deref in btrfs_free_dummy_root+0x21/0x50 [btrfs]
        Read of size 8 at addr 000000000000002c by task insmod/258926
      
        CPU: 2 PID: 258926 Comm: insmod Tainted: G        W          6.1.0-rc2+ #5
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33 04/01/2014
        Call Trace:
         <TASK>
         dump_stack_lvl+0x34/0x44
         kasan_report+0xb7/0x140
         kasan_check_range+0x145/0x1a0
         btrfs_free_dummy_root+0x21/0x50 [btrfs]
         btrfs_test_free_space_cache+0x1a8c/0x1add [btrfs]
         btrfs_run_sanity_tests+0x65/0x80 [btrfs]
         init_btrfs_fs+0xec/0x154 [btrfs]
         do_one_initcall+0x87/0x2a0
         do_init_module+0xdf/0x320
         load_module+0x3006/0x3390
         __do_sys_finit_module+0x113/0x1b0
         do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fixes: aaedb55b ("Btrfs: add tests for btrfs_get_extent")
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9b2f2034
    • Liu Shixin's avatar
      btrfs: fix match incorrectly in dev_args_match_device · 0fca385d
      Liu Shixin authored
      syzkaller found a failed assertion:
      
        assertion failed: (args->devid != (u64)-1) || args->missing, in fs/btrfs/volumes.c:6921
      
      This can be triggered when we set devid to (u64)-1 by ioctl. In this
      case, the match of devid will be skipped and the match of device may
      succeed incorrectly.
      
      Patch 562d7b15 introduced this function which is used to match device.
      This function contains two matching scenarios, we can distinguish them by
      checking the value of args->missing rather than check whether args->devid
      and args->uuid is default value.
      
      Reported-by: syzbot+031687116258450f9853@syzkaller.appspotmail.com
      Fixes: 562d7b15 ("btrfs: handle device lookup with btrfs_dev_lookup_args")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarLiu Shixin <liushixin2@huawei.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0fca385d
  5. 02 Nov, 2022 6 commits
    • Filipe Manana's avatar
      btrfs: fix inode reserve space leak due to nowait buffered write · eb81b682
      Filipe Manana authored
      During a nowait buffered write, if we fail to balance dirty pages we exit
      btrfs_buffered_write() without releasing the delalloc space reserved for
      an extent, resulting in leaking space from the inode's block reserve.
      
      So fix that by releasing the delalloc space for the extent when balancing
      dirty pages fails.
      Reported-by: default avatarkernel test robot <yujie.liu@intel.com>
      Link: https://lore.kernel.org/all/202210111304.d369bc32-yujie.liu@intel.com
      Fixes: 965f47ae ("btrfs: make btrfs_buffered_write nowait compatible")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      eb81b682
    • Filipe Manana's avatar
      btrfs: fix nowait buffered write returning -ENOSPC · a348c8d4
      Filipe Manana authored
      If we are doing a buffered write in NOWAIT context and we can't reserve
      metadata space due to -ENOSPC, then we should return -EAGAIN so that we
      retry the write in a context allowed to block and do metadata reservation
      with flushing, which might succeed this time due to the allowed flushing.
      
      Returning -ENOSPC while in NOWAIT context simply makes some writes fail
      with -ENOSPC when they would likely succeed after switching from NOWAIT
      context to blocking context. That is unexpected behaviour and even fio
      complains about it with a warning like this:
      
        fio: io_u error on file /mnt/sdi/task_0.0.0: No space left on device: write offset=1535705088, buflen=65536
        fio: pid=592630, err=28/file:io_u.c:1846, func=io_u error, error=No space left on device
      
      The fio's job config is this:
      
         [global]
         bs=64K
         ioengine=io_uring
         iodepth=1
         size=2236962133
         nr_files=1
         filesize=2236962133
         direct=0
         runtime=10
         fallocate=posix
         io_size=2236962133
         group_reporting
         time_based
      
         [task_0]
         rw=randwrite
         directory=/mnt/sdi
         numjobs=4
      
      So fix this by returning -EAGAIN if we are in NOWAIT context and the
      metadata reservation failed with -ENOSPC.
      
      Fixes: 304e45ac ("btrfs: plumb NOWAIT through the write path")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a348c8d4
    • Filipe Manana's avatar
      btrfs: remove pointless and double ulist frees in error paths of qgroup tests · d0ea17ae
      Filipe Manana authored
      Several places in the qgroup self tests follow the pattern of freeing the
      ulist pointer they passed to btrfs_find_all_roots() if the call to that
      function returned an error. That is pointless because that function always
      frees the ulist in case it returns an error.
      
      Also In some places like at test_multiple_refs(), after a call to
      btrfs_qgroup_account_extent() we also leave "old_roots" and "new_roots"
      pointing to ulists that were freed, because btrfs_qgroup_account_extent()
      has freed those ulists, and if after that the next call to
      btrfs_find_all_roots() fails, we call ulist_free() on the "old_roots"
      ulist again, resulting in a double free.
      
      So remove those calls to reduce the code size and avoid double ulist
      free in case of an error.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d0ea17ae
    • Filipe Manana's avatar
      btrfs: fix ulist leaks in error paths of qgroup self tests · d37de92b
      Filipe Manana authored
      In the test_no_shared_qgroup() and test_multiple_refs() qgroup self tests,
      if we fail to add the tree ref, remove the extent item or remove the
      extent ref, we are returning from the test function without freeing the
      "old_roots" ulist that was allocated by the previous calls to
      btrfs_find_all_roots(). Fix that by calling ulist_free() before returning.
      
      Fixes: 442244c9 ("btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism.")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d37de92b
    • Filipe Manana's avatar
      btrfs: fix inode list leak during backref walking at find_parent_nodes() · 92876eec
      Filipe Manana authored
      During backref walking, at find_parent_nodes(), if we are dealing with a
      data extent and we get an error while resolving the indirect backrefs, at
      resolve_indirect_refs(), or in the while loop that iterates over the refs
      in the direct refs rbtree, we end up leaking the inode lists attached to
      the direct refs we have in the direct refs rbtree that were not yet added
      to the refs ulist passed as argument to find_parent_nodes(). Since they
      were not yet added to the refs ulist and prelim_release() does not free
      the lists, on error the caller can only free the lists attached to the
      refs that were added to the refs ulist, all the remaining refs get their
      inode lists never freed, therefore leaking their memory.
      
      Fix this by having prelim_release() always free any attached inode list
      to each ref found in the rbtree, and have find_parent_nodes() set the
      ref's inode list to NULL once it transfers ownership of the inode list
      to a ref added to the refs ulist passed to find_parent_nodes().
      
      Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      92876eec
    • Filipe Manana's avatar
      btrfs: fix inode list leak during backref walking at resolve_indirect_refs() · 5614dc3a
      Filipe Manana authored
      During backref walking, at resolve_indirect_refs(), if we get an error
      we jump to the 'out' label and call ulist_free() on the 'parents' ulist,
      which frees all the elements in the ulist - however that does not free
      any inode lists that may be attached to elements, through the 'aux' field
      of a ulist node, so we end up leaking lists if we have any attached to
      the unodes.
      
      Fix this by calling free_leaf_list() instead of ulist_free() when we exit
      from resolve_indirect_refs(). The static function free_leaf_list() is
      moved up for this to be possible and it's slightly simplified by removing
      unnecessary code.
      
      Fixes: 3301958b ("Btrfs: add inodes before dropping the extent lock in find_all_leafs")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5614dc3a
  6. 31 Oct, 2022 2 commits
  7. 25 Oct, 2022 1 commit
    • Qu Wenruo's avatar
      btrfs: don't use btrfs_chunk::sub_stripes from disk · 76a66ba1
      Qu Wenruo authored
      [BUG]
      There are two reports (the earliest one from LKP, a more recent one from
      kernel bugzilla) that we can have some chunks with 0 as sub_stripes.
      
      This will cause divide-by-zero errors at btrfs_rmap_block, which is
      introduced by a recent kernel patch ac067734 ("btrfs: merge
      calculations for simple striped profiles in btrfs_rmap_block"):
      
      		if (map->type & (BTRFS_BLOCK_GROUP_RAID0 |
      				 BTRFS_BLOCK_GROUP_RAID10)) {
      			stripe_nr = stripe_nr * map->num_stripes + i;
      			stripe_nr = div_u64(stripe_nr, map->sub_stripes); <<<
      		}
      
      [CAUSE]
      From the more recent report, it has been proven that we have some chunks
      with 0 as sub_stripes, mostly caused by older mkfs.
      
      It turns out that the mkfs.btrfs fix is only introduced in 6718ab4d33aa
      ("btrfs-progs: Initialize sub_stripes to 1 in btrfs_alloc_data_chunk")
      which is included in v5.4 btrfs-progs release.
      
      So there would be quite some old filesystems with such 0 sub_stripes.
      
      [FIX]
      Just don't trust the sub_stripes values from disk.
      
      We have a trusted btrfs_raid_array[] to fetch the correct sub_stripes
      numbers for each profile and that are fixed.
      
      By this, we can keep the compatibility with older filesystems while
      still avoid divide-by-zero bugs.
      Reported-by: default avatarkernel test robot <oliver.sang@intel.com>
      Reported-by: default avatarViktor Kuzmin <kvaster@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216559
      Fixes: ac067734 ("btrfs: merge calculations for simple striped profiles in btrfs_rmap_block")
      CC: stable@vger.kernel.org # 6.0
      Reviewed-by: default avatarSu Yue <glass@fydeos.io>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      76a66ba1
  8. 24 Oct, 2022 7 commits
    • David Sterba's avatar
      btrfs: fix type of parameter generation in btrfs_get_dentry · 2398091f
      David Sterba authored
      The type of parameter generation has been u32 since the beginning,
      however all callers pass a u64 generation, so unify the types to prevent
      potential loss.
      
      CC: stable@vger.kernel.org # 4.9+
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2398091f
    • BingJing Chang's avatar
      btrfs: send: fix send failure of a subcase of orphan inodes · 9b8be45f
      BingJing Chang authored
      Commit 9ed0a72e ("btrfs: send: fix failures when processing inodes with
      no links") tries to fix all incremental send cases of orphan inodes the
      send operation will meet. However, there's still a bug causing the corner
      subcase fails with a ENOENT error.
      
      Here's shortened steps of that subcase:
      
        $ btrfs subvolume create vol
        $ touch vol/foo
      
        $ btrfs subvolume snapshot -r vol snap1
        $ btrfs subvolume snapshot -r vol snap2
      
        # Turn the second snapshot to RW mode and delete the file while
        # holding an open file descriptor on it
        $ btrfs property set snap2 ro false
        $ exec 73<snap2/foo
        $ rm snap2/foo
      
        # Set the second snapshot back to RO mode and do an incremental send
        # with an unusal reverse order
        $ btrfs property set snap2 ro true
        $ btrfs send -p snap2 snap1 > /dev/null
        At subvol snap1
        ERROR: send ioctl failed with -2: No such file or directory
      
      It's subcase 3 of BTRFS_COMPARE_TREE_CHANGED in the commit 9ed0a72e
      ("btrfs: send: fix failures when processing inodes with no links"). And
      it's not a common case. We still have not met it in the real world.
      Theoretically, this case can happen in a batch cascading snapshot backup.
      In cascading backups, the receive operation in the middle may cause orphan
      inodes to appear because of the open file descriptors on the snapshot files
      during receiving. And if we don't do the batch snapshot backups in their
      creation order, then we can have an inode, which is an orphan in the parent
      snapshot but refers to a file in the send snapshot. Since an orphan inode
      has no paths, the send operation will fail with a ENOENT error if it
      tries to generate a path for it.
      
      In that patch, this subcase will be treated as an inode with a new
      generation. However, when the routine tries to delete the old paths in
      the parent snapshot, the function process_all_refs() doesn't check whether
      there are paths recorded or not before it calls the function
      process_recorded_refs(). And the function process_recorded_refs() try
      to get the first path in the parent snapshot in the beginning. Since it has
      no paths in the parent snapshot, the send operation fails.
      
      To fix this, we can easily put a link count check to avoid entering the
      deletion routine like what we do a link count check to avoid creating a
      new one. Moreover, we can assume that the function process_all_refs()
      can always collect references to process because we know it has a
      positive link count.
      
      Fixes: 9ed0a72e ("btrfs: send: fix failures when processing inodes with no links")
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarBingJing Chang <bingjingc@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9b8be45f
    • Qu Wenruo's avatar
      btrfs: make thaw time super block check to also verify checksum · 3d17adea
      Qu Wenruo authored
      Previous commit a05d3c91 ("btrfs: check superblock to ensure the fs
      was not modified at thaw time") only checks the content of the super
      block, but it doesn't really check if the on-disk super block has a
      matching checksum.
      
      This patch will add the checksum verification to thaw time superblock
      verification.
      
      This involves the following extra changes:
      
      - Export btrfs_check_super_csum()
        As we need to call it in super.c.
      
      - Change the argument list of btrfs_check_super_csum()
        Instead of passing a char *, directly pass struct btrfs_super_block *
        pointer.
      
      - Verify that our checksum type didn't change before checking the
        checksum value, like it's done at mount time
      
      Fixes: a05d3c91 ("btrfs: check superblock to ensure the fs was not modified at thaw time")
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d17adea
    • Josef Bacik's avatar
      btrfs: fix tree mod log mishandling of reallocated nodes · 968b7158
      Josef Bacik authored
      We have been seeing the following panic in production
      
        kernel BUG at fs/btrfs/tree-mod-log.c:677!
        invalid opcode: 0000 [#1] SMP
        RIP: 0010:tree_mod_log_rewind+0x1b4/0x200
        RSP: 0000:ffffc9002c02f890 EFLAGS: 00010293
        RAX: 0000000000000003 RBX: ffff8882b448c700 RCX: 0000000000000000
        RDX: 0000000000008000 RSI: 00000000000000a7 RDI: ffff88877d831c00
        RBP: 0000000000000002 R08: 000000000000009f R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000100c40 R12: 0000000000000001
        R13: ffff8886c26d6a00 R14: ffff88829f5424f8 R15: ffff88877d831a00
        FS:  00007fee1d80c780(0000) GS:ffff8890400c0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007fee1963a020 CR3: 0000000434f33002 CR4: 00000000007706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        PKRU: 55555554
        Call Trace:
         btrfs_get_old_root+0x12b/0x420
         btrfs_search_old_slot+0x64/0x2f0
         ? tree_mod_log_oldest_root+0x3d/0xf0
         resolve_indirect_ref+0xfd/0x660
         ? ulist_alloc+0x31/0x60
         ? kmem_cache_alloc_trace+0x114/0x2c0
         find_parent_nodes+0x97a/0x17e0
         ? ulist_alloc+0x30/0x60
         btrfs_find_all_roots_safe+0x97/0x150
         iterate_extent_inodes+0x154/0x370
         ? btrfs_search_path_in_tree+0x240/0x240
         iterate_inodes_from_logical+0x98/0xd0
         ? btrfs_search_path_in_tree+0x240/0x240
         btrfs_ioctl_logical_to_ino+0xd9/0x180
         btrfs_ioctl+0xe2/0x2ec0
         ? __mod_memcg_lruvec_state+0x3d/0x280
         ? do_sys_openat2+0x6d/0x140
         ? kretprobe_dispatcher+0x47/0x70
         ? kretprobe_rethook_handler+0x38/0x50
         ? rethook_trampoline_handler+0x82/0x140
         ? arch_rethook_trampoline_callback+0x3b/0x50
         ? kmem_cache_free+0xfb/0x270
         ? do_sys_openat2+0xd5/0x140
         __x64_sys_ioctl+0x71/0xb0
         do_syscall_64+0x2d/0x40
      
      Which is this code in tree_mod_log_rewind()
      
      	switch (tm->op) {
              case BTRFS_MOD_LOG_KEY_REMOVE_WHILE_FREEING:
      		BUG_ON(tm->slot < n);
      
      This occurs because we replay the nodes in order that they happened, and
      when we do a REPLACE we will log a REMOVE_WHILE_FREEING for every slot,
      starting at 0.  'n' here is the number of items in this block, which in
      this case was 1, but we had 2 REMOVE_WHILE_FREEING operations.
      
      The actual root cause of this was that we were replaying operations for
      a block that shouldn't have been replayed.  Consider the following
      sequence of events
      
      1. We have an already modified root, and we do a btrfs_get_tree_mod_seq().
      2. We begin removing items from this root, triggering KEY_REPLACE for
         it's child slots.
      3. We remove one of the 2 children this root node points to, thus triggering
         the root node promotion of the remaining child, and freeing this node.
      4. We modify a new root, and re-allocate the above node to the root node of
         this other root.
      
      The tree mod log looks something like this
      
      	logical 0	op KEY_REPLACE (slot 1)			seq 2
      	logical 0	op KEY_REMOVE (slot 1)			seq 3
      	logical 0	op KEY_REMOVE_WHILE_FREEING (slot 0)	seq 4
      	logical 4096	op LOG_ROOT_REPLACE (old logical 0)	seq 5
      	logical 8192	op KEY_REMOVE_WHILE_FREEING (slot 1)	seq 6
      	logical 8192	op KEY_REMOVE_WHILE_FREEING (slot 0)	seq 7
      	logical 0	op LOG_ROOT_REPLACE (old logical 8192)	seq 8
      
      >From here the bug is triggered by the following steps
      
      1.  Call btrfs_get_old_root() on the new_root.
      2.  We call tree_mod_log_oldest_root(btrfs_root_node(new_root)), which is
          currently logical 0.
      3.  tree_mod_log_oldest_root() calls tree_mod_log_search_oldest(), which
          gives us the KEY_REPLACE seq 2, and since that's not a
          LOG_ROOT_REPLACE we incorrectly believe that we don't have an old
          root, because we expect that the most recent change should be a
          LOG_ROOT_REPLACE.
      4.  Back in tree_mod_log_oldest_root() we don't have a LOG_ROOT_REPLACE,
          so we don't set old_root, we simply use our existing extent buffer.
      5.  Since we're using our existing extent buffer (logical 0) we call
          tree_mod_log_search(0) in order to get the newest change to start the
          rewind from, which ends up being the LOG_ROOT_REPLACE at seq 8.
      6.  Again since we didn't find an old_root we simply clone logical 0 at
          it's current state.
      7.  We call tree_mod_log_rewind() with the cloned extent buffer.
      8.  Set n = btrfs_header_nritems(logical 0), which would be whatever the
          original nritems was when we COWed the original root, say for this
          example it's 2.
      9.  We start from the newest operation and work our way forward, so we
          see LOG_ROOT_REPLACE which we ignore.
      10. Next we see KEY_REMOVE_WHILE_FREEING for slot 0, which triggers the
          BUG_ON(tm->slot < n), because it expects if we've done this we have a
          completely empty extent buffer to replay completely.
      
      The correct thing would be to find the first LOG_ROOT_REPLACE, and then
      get the old_root set to logical 8192.  In fact making that change fixes
      this particular problem.
      
      However consider the much more complicated case.  We have a child node
      in this tree and the above situation.  In the above case we freed one
      of the child blocks at the seq 3 operation.  If this block was also
      re-allocated and got new tree mod log operations we would have a
      different problem.  btrfs_search_old_slot(orig root) would get down to
      the logical 0 root that still pointed at that node.  However in
      btrfs_search_old_slot() we call tree_mod_log_rewind(buf) directly.  This
      is not context aware enough to know which operations we should be
      replaying.  If the block was re-allocated multiple times we may only
      want to replay a range of operations, and determining what that range is
      isn't possible to determine.
      
      We could maybe solve this by keeping track of which root the node
      belonged to at every tree mod log operation, and then passing this
      around to make sure we're only replaying operations that relate to the
      root we're trying to rewind.
      
      However there's a simpler way to solve this problem, simply disallow
      reallocations if we have currently running tree mod log users.  We
      already do this for leaf's, so we're simply expanding this to nodes as
      well.  This is a relatively uncommon occurrence, and the problem is
      complicated enough I'm worried that we will still have corner cases in
      the reallocation case.  So fix this in the most straightforward way
      possible.
      
      Fixes: bd989ba3 ("Btrfs: add tree modification log functions")
      CC: stable@vger.kernel.org # 3.3+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      968b7158
    • David Sterba's avatar
      btrfs: reorder btrfs_bio for better packing · ae0e5df4
      David Sterba authored
      After changes in commit 917f32a2 ("btrfs: give struct btrfs_bio a
      real end_io handler") the layout of btrfs_bio can be improved.  There
      are two holes and the structure size is 264 bytes on release build. By
      reordering the iterator we can get rid of the holes and the size is 256
      bytes which fits to slabs much better.
      
      Final layout:
      
      struct btrfs_bio {
      	unsigned int               mirror_num;           /*     0     4 */
      	struct bvec_iter           iter;                 /*     4    20 */
      	u64                        file_offset;          /*    24     8 */
      	struct btrfs_device *      device;               /*    32     8 */
      	u8 *                       csum;                 /*    40     8 */
      	u8                         csum_inline[64];      /*    48    64 */
      	/* --- cacheline 1 boundary (64 bytes) was 48 bytes ago --- */
      	btrfs_bio_end_io_t         end_io;               /*   112     8 */
      	void *                     private;              /*   120     8 */
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	struct work_struct         end_io_work;          /*   128    32 */
      	struct bio                 bio;                  /*   160    96 */
      
      	/* size: 256, cachelines: 4, members: 10 */
      };
      
      Fixes: 917f32a2 ("btrfs: give struct btrfs_bio a real end_io handler")
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae0e5df4
    • Qu Wenruo's avatar
      btrfs: raid56: avoid double freeing for rbio if full_stripe_write() failed · ab4c54c6
      Qu Wenruo authored
      Currently if full_stripe_write() failed to allocate the pages for
      parity, it will call __free_raid_bio() first, then return -ENOMEM.
      
      But some caller of full_stripe_write() will also call __free_raid_bio()
      again, this would cause double freeing.
      
      And it's not a logically sound either, normally we should either free
      the memory at the same level where we allocated it, or let endio to
      handle everything.
      
      So this patch will solve the double freeing by make
      raid56_parity_write() to handle the error and free the rbio.
      
      Just like what we do in raid56_parity_recover().
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ab4c54c6
    • Qu Wenruo's avatar
      btrfs: raid56: properly handle the error when unable to find the missing stripe · f15fb2cd
      Qu Wenruo authored
      In raid56_alloc_missing_rbio(), if we can not determine where the
      missing device is inside the full stripe, we just BUG_ON().
      
      This is not necessary especially the only caller inside scrub.c is
      already properly checking the return value, and will treat it as a
      memory allocation failure.
      
      Fix the error handling by:
      
      - Add an extra warning for the reason
        Although personally speaking it may be better to be an ASSERT().
      
      - Properly free the allocated rbio
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f15fb2cd
  9. 14 Oct, 2022 1 commit
  10. 11 Oct, 2022 5 commits
    • Filipe Manana's avatar
      btrfs: ignore fiemap path cache if we have multiple leaves for a data extent · 63c84b46
      Filipe Manana authored
      The path cache used during fiemap used to determine the sharedness of
      extent buffers in a path from a leaf containing a file extent item
      pointing to our data extent up to the root node of the tree, is meant to
      be used for a single path. Having a single path is by far the most common
      case, and therefore worth to optimize for, but it's possible to actually
      have multiple paths because we have 2 or more leaves.
      
      If we have multiple leaves, the 'level' variable keeps getting incremented
      in each iteration of the while loop at btrfs_is_data_extent_shared(),
      which means we will treat the second leaf in the 'tmp' ulist as a level 1
      node, and so forth. In the worst case this can lead to getting a level
      greater than or equals to BTRFS_MAX_LEVEL (8), which will trigger a
      WARN_ON_ONCE() in the functions to lookup from or store in the path cache
      (lookup_backref_shared_cache() and store_backref_shared_cache()). If the
      current level never goes beyond 8, due to shared nodes in the paths and
      a fs tree height smaller than 8, it can still result in incorrectly
      marking one leaf as shared because some other leaf is shared and is stored
      one level below that other leaf, as when storing a true sharedness value
      in the cache results in updating the sharedness to true of all entries in
      the cache below the current level.
      
      Having multiple leaves happens in a case like the following:
      
        - We have a file extent item point to data extent at bytenr X, for
          a file range [0, 1M[ for example;
      
        - At this moment we have an extent data ref for the extent, with
          an offset of 0 and a count of 1;
      
        - A write into the middle of the extent happens, file range [64K, 128K)
          so the file extent item is split into two (at btrfs_drop_extents()):
      
          1) One for file range [0, 64K), with a length (num_bytes field) of
             64K and an extent offset of 0;
      
          2) Another one for file range [128K, 1M), with a length of 896K
             (1M - 128K) and an extent offset of 128K.
      
        - At this moment the two file extent items are located in the same
          leaf;
      
        - A new file extent item for the range [64K, 128K), pointing to a new
          data extent, is inserted in the leaf. This results in a leaf split
          and now those two file extent items pointing to data extent X end
          up located in different leaves;
      
        - Once delayed refs are run, we still have a single extent data ref
          item for our data extent at bytenr X, for offset 0, but now with a
          count of 2 instead of 1;
      
        - So during fiemap, at btrfs_is_data_extent_shared(), after we call
          find_parent_nodes() for the data extent, we get two leaves, since
          we have two file extent items point to data extent at bytenr X that
          are located in two different leaves.
      
      So skip the use of the path cache when we get more than one leaf.
      
      Fixes: 12a824dc ("btrfs: speedup checking for extent sharedness during fiemap")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      63c84b46
    • Filipe Manana's avatar
      btrfs: fix processing of delayed tree block refs during backref walking · 943553ef
      Filipe Manana authored
      During backref walking, when processing a delayed reference with a type of
      BTRFS_TREE_BLOCK_REF_KEY, we have two bugs there:
      
      1) We are accessing the delayed references extent_op, and its key, without
         the protection of the delayed ref head's lock;
      
      2) If there's no extent op for the delayed ref head, we end up with an
         uninitialized key in the stack, variable 'tmp_op_key', and then pass
         it to add_indirect_ref(), which adds the reference to the indirect
         refs rb tree.
      
         This is wrong, because indirect references should have a NULL key
         when we don't have access to the key, and in that case they should be
         added to the indirect_missing_keys rb tree and not to the indirect rb
         tree.
      
         This means that if have BTRFS_TREE_BLOCK_REF_KEY delayed ref resulting
         from freeing an extent buffer, therefore with a count of -1, it will
         not cancel out the corresponding reference we have in the extent tree
         (with a count of 1), since both references end up in different rb
         trees.
      
         When using fiemap, where we often need to check if extents are shared
         through shared subtrees resulting from snapshots, it means we can
         incorrectly report an extent as shared when it's no longer shared.
         However this is temporary because after the transaction is committed
         the extent is no longer reported as shared, as running the delayed
         reference results in deleting the tree block reference from the extent
         tree.
      
         Outside the fiemap context, the result is unpredictable, as the key was
         not initialized but it's used when navigating the rb trees to insert
         and search for references (prelim_ref_compare()), and we expect all
         references in the indirect rb tree to have valid keys.
      
      The following reproducer triggers the second bug:
      
         $ cat test.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount -o compress $DEV $MNT
      
         # With a compressed 128M file we get a tree height of 2 (level 1 root).
         xfs_io -f -c "pwrite -b 1M 0 128M" $MNT/foo
      
         btrfs subvolume snapshot $MNT $MNT/snap
      
         # Fiemap should output 0x2008 in the flags column.
         # 0x2000 means shared extent
         # 0x8 means encoded extent (because it's compressed)
         echo
         echo "fiemap after snapshot, range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         # Overwrite one extent and fsync to flush delalloc and COW a new path
         # in the snapshot's tree.
         #
         # After this we have a BTRFS_DROP_DELAYED_REF delayed ref of type
         # BTRFS_TREE_BLOCK_REF_KEY with a count of -1 for every COWed extent
         # buffer in the path.
         #
         # In the extent tree we have inline references of type
         # BTRFS_TREE_BLOCK_REF_KEY, with a count of 1, for the same extent
         # buffers, so they should cancel each other, and the extent buffers in
         # the fs tree should no longer be considered as shared.
         #
         echo "Overwriting file range [120M, 120M + 128K)..."
         xfs_io -c "pwrite -b 128K 120M 128K" $MNT/snap/foo
         xfs_io -c "fsync" $MNT/snap/foo
      
         # Fiemap should output 0x8 in the flags column. The extent in the range
         # [120M, 120M + 128K) is no longer shared, it's now exclusive to the fs
         # tree.
         echo
         echo "fiemap after overwrite range [120M, 120M + 128K):"
         xfs_io -c "fiemap -v 120M 128K" $MNT/foo
         echo
      
         umount $MNT
      
      Running it before this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1152 sec (1.085 GiB/sec and 1110.5809 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (683.060 MiB/sec and 5464.4809 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
      The extent in the range [120M, 120M + 128K) is still reported as shared
      (0x2000 bit set) after overwriting that range and flushing delalloc, which
      is not correct - an entire path was COWed in the snapshot's tree and the
      extent is now only referenced by the original fs tree.
      
      Running it after this patch:
      
         $ ./test.sh
         (...)
         wrote 134217728/134217728 bytes at offset 0
         128 MiB, 128 ops; 0.1198 sec (1.043 GiB/sec and 1068.2067 ops/sec)
         Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
      
         fiemap after snapshot, range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256 0x2008
      
         Overwriting file range [120M, 120M + 128K)...
         wrote 131072/131072 bytes at offset 125829120
         128 KiB, 1 ops; 0.0001 sec (694.444 MiB/sec and 5555.5556 ops/sec)
      
         fiemap after overwrite range [120M, 120M + 128K):
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [245760..246015]: 34304..34559       256   0x8
      
      Now the extent is not reported as shared anymore.
      
      So fix this by passing a NULL key pointer to add_indirect_ref() when
      processing a delayed reference for a tree block if there's no extent op
      for our delayed ref head with a defined key. Also access the extent op
      only after locking the delayed ref head's lock.
      
      The reproducer will be converted later to a test case for fstests.
      
      Fixes: 86d5f994 ("btrfs: convert prelimary reference tracking to use rbtrees")
      Fixes: a6dbceaf ("btrfs: Remove unused op_key var from add_delayed_refs")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      943553ef
    • Filipe Manana's avatar
      btrfs: fix processing of delayed data refs during backref walking · 4fc7b572
      Filipe Manana authored
      When processing delayed data references during backref walking and we are
      using a share context (we are being called through fiemap), whenever we
      find a delayed data reference for an inode different from the one we are
      interested in, then we immediately exit and consider the data extent as
      shared. This is wrong, because:
      
      1) This might be a DROP reference that will cancel out a reference in the
         extent tree;
      
      2) Even if it's an ADD reference, it may be followed by a DROP reference
         that cancels it out.
      
      In either case we should not exit immediately.
      
      Fix this by never exiting when we find a delayed data reference for
      another inode - instead add the reference and if it does not cancel out
      other delayed reference, we will exit early when we call
      extent_is_shared() after processing all delayed references. If we find
      a drop reference, then signal the code that processes references from
      the extent tree (add_inline_refs() and add_keyed_refs()) to not exit
      immediately if it finds there a reference for another inode, since we
      have delayed drop references that may cancel it out. In this later case
      we exit once we don't have references in the rb trees that cancel out
      each other and have two references for different inodes.
      
      Example reproducer for case 1):
      
         $ cat test-1.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      Example reproducer for case 2):
      
         $ cat test-2.sh
         #!/bin/bash
      
         DEV=/dev/sdj
         MNT=/mnt/sdj
      
         mkfs.btrfs -f $DEV
         mount $DEV $MNT
      
         xfs_io -f -c "pwrite 0 64K" $MNT/foo
         cp --reflink=always $MNT/foo $MNT/bar
      
         # Flush delayed references to the extent tree and commit current
         # transaction.
         sync
      
         echo
         echo "fiemap after cloning:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         rm -f $MNT/bar
         echo
         echo "fiemap after removing file bar:"
         xfs_io -c "fiemap -v" $MNT/foo
      
         umount $MNT
      
      Running it before this patch, the extent is still listed as shared, it has
      the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
      After this patch, after deleting bar in both tests, the extent is not
      reported with the 0x2000 flag anymore, it gets only the flag 0x1
      (which is FIEMAP_EXTENT_LAST):
      
         $ ./test-1.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
         $ ./test-2.sh
         fiemap after cloning:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128 0x2001
      
         fiemap after removing file bar:
         /mnt/sdj/foo:
          EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
            0: [0..127]:        26624..26751       128   0x1
      
      These tests will later be converted to a test case for fstests.
      
      Fixes: dc046b10 ("Btrfs: make fiemap not blow when you have lots of snapshots")
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4fc7b572
    • David Sterba's avatar
      btrfs: delete stale comments after merge conflict resolution · 295a53cc
      David Sterba authored
      There are two comments in btrfs_cache_block_group that I left when
      resolving conflict between commits ced8ecf0 "btrfs: fix space cache
      corruption and potential double allocations" and 527c490f "btrfs:
      delete btrfs_wait_space_cache_v1_finished".
      
      The former reworked the caching logic to wait until the caching ends in
      btrfs_cache_block_group while the latter only open coded the waiting.
      Both removed btrfs_wait_space_cache_v1_finished, the correct code is
      with the waiting and returning error. Thus the conflict resolution was
      OK.
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      295a53cc
    • Josef Bacik's avatar
      btrfs: unlock locked extent area if we have contention · 9e769bd7
      Josef Bacik authored
      In production we hit the following deadlock
      
      task 1			task 2			task 3
      ------			------			------
      fiemap(file)		falloc(file)		fsync(file)
      						  write(0, 1MiB)
      						  btrfs_commit_transaction()
      						    wait_on(!pending_ordered)
      			  lock(512MiB, 1GiB)
      			  start_transaction
      			    wait_on_transaction
      
        lock(0, 1GiB)
          wait_extent_bit(512MiB)
      
      task 4
      ------
      finish_ordered_extent(0, 1MiB)
        lock(0, 1MiB)
        **DEADLOCK**
      
      This occurs because when task 1 does it's lock, it locks everything from
      0-512MiB, and then waits for the 512MiB chunk to unlock.  task 2 will
      never unlock because it's waiting on the transaction commit to happen,
      the transaction commit is waiting for the outstanding ordered extents,
      and then the ordered extent thread is blocked waiting on the 0-1MiB
      range to unlock.
      
      To fix this we have to clear anything we've locked so far, wait for the
      extent_state that we contended on, and then try to re-lock the entire
      range again.
      
      CC: stable@vger.kernel.org # 5.15+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e769bd7