• Filipe Manana's avatar
    btrfs: do not modify log tree while holding a leaf from fs tree locked · 796787c9
    Filipe Manana authored
    When logging an inode in full mode, or when logging xattrs or when logging
    the dir index items of a directory, we are modifying the log tree while
    holding a read lock on a leaf from the fs/subvolume tree. This can lead to
    a deadlock in rare circumstances, but it is a real possibility, and it was
    recently reported by syzbot with the following trace from lockdep:
    
       WARNING: possible circular locking dependency detected
       6.1.0-rc5-next-20221116-syzkaller #0 Not tainted
       ------------------------------------------------------
       syz-executor.1/16154 is trying to acquire lock:
       ffff88807e3084a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
    
       but task is already holding lock:
       ffff88807df33078 (btrfs-log-00){++++}-{3:3}, at: __btrfs_tree_lock+0x32/0x3d0 fs/btrfs/locking.c:197
    
       which lock already depends on the new lock.
    
       the existing dependency chain (in reverse order) is:
    
       -> #2 (btrfs-log-00){++++}-{3:3}:
              down_read_nested+0x9e/0x450 kernel/locking/rwsem.c:1634
              __btrfs_tree_read_lock+0x32/0x350 fs/btrfs/locking.c:135
              btrfs_tree_read_lock fs/btrfs/locking.c:141 [inline]
              btrfs_read_lock_root_node+0x82/0x3a0 fs/btrfs/locking.c:280
              btrfs_search_slot_get_root fs/btrfs/ctree.c:1678 [inline]
              btrfs_search_slot+0x3ca/0x2c70 fs/btrfs/ctree.c:1998
              btrfs_lookup_csum+0x116/0x3f0 fs/btrfs/file-item.c:209
              btrfs_csum_file_blocks+0x40e/0x1370 fs/btrfs/file-item.c:1021
              log_csums.isra.0+0x244/0x2d0 fs/btrfs/tree-log.c:4258
              copy_items.isra.0+0xbfb/0xed0 fs/btrfs/tree-log.c:4403
              copy_inode_items_to_log+0x13d6/0x1d90 fs/btrfs/tree-log.c:5873
              btrfs_log_inode+0xb19/0x4680 fs/btrfs/tree-log.c:6495
              btrfs_log_inode_parent+0x890/0x2a20 fs/btrfs/tree-log.c:6982
              btrfs_log_dentry_safe+0x59/0x80 fs/btrfs/tree-log.c:7083
              btrfs_sync_file+0xa41/0x13c0 fs/btrfs/file.c:1921
              vfs_fsync_range+0x13e/0x230 fs/sync.c:188
              generic_write_sync include/linux/fs.h:2856 [inline]
              iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
              btrfs_direct_write fs/btrfs/file.c:1536 [inline]
              btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
              call_write_iter include/linux/fs.h:2160 [inline]
              do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
              do_iter_write+0x182/0x700 fs/read_write.c:861
              vfs_iter_write+0x74/0xa0 fs/read_write.c:902
              iter_file_splice_write+0x745/0xc90 fs/splice.c:686
              do_splice_from fs/splice.c:764 [inline]
              direct_splice_actor+0x114/0x180 fs/splice.c:931
              splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
              do_splice_direct+0x1ab/0x280 fs/splice.c:974
              do_sendfile+0xb19/0x1270 fs/read_write.c:1255
              __do_sys_sendfile64 fs/read_write.c:1323 [inline]
              __se_sys_sendfile64 fs/read_write.c:1309 [inline]
              __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
              do_syscall_x64 arch/x86/entry/common.c:50 [inline]
              do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
              entry_SYSCALL_64_after_hwframe+0x63/0xcd
    
       -> #1 (btrfs-tree-00){++++}-{3:3}:
              __lock_release kernel/locking/lockdep.c:5382 [inline]
              lock_release+0x371/0x810 kernel/locking/lockdep.c:5688
              up_write+0x2a/0x520 kernel/locking/rwsem.c:1614
              btrfs_tree_unlock_rw fs/btrfs/locking.h:189 [inline]
              btrfs_unlock_up_safe+0x1e3/0x290 fs/btrfs/locking.c:238
              search_leaf fs/btrfs/ctree.c:1832 [inline]
              btrfs_search_slot+0x265e/0x2c70 fs/btrfs/ctree.c:2074
              btrfs_insert_empty_items+0xbd/0x1c0 fs/btrfs/ctree.c:4133
              btrfs_insert_delayed_item+0x826/0xfa0 fs/btrfs/delayed-inode.c:746
              btrfs_insert_delayed_items fs/btrfs/delayed-inode.c:824 [inline]
              __btrfs_commit_inode_delayed_items fs/btrfs/delayed-inode.c:1111 [inline]
              __btrfs_run_delayed_items+0x280/0x590 fs/btrfs/delayed-inode.c:1153
              flush_space+0x147/0xe90 fs/btrfs/space-info.c:728
              btrfs_async_reclaim_metadata_space+0x541/0xc10 fs/btrfs/space-info.c:1086
              process_one_work+0x9bf/0x1710 kernel/workqueue.c:2289
              worker_thread+0x669/0x1090 kernel/workqueue.c:2436
              kthread+0x2e8/0x3a0 kernel/kthread.c:376
              ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
    
       -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
              check_prev_add kernel/locking/lockdep.c:3097 [inline]
              check_prevs_add kernel/locking/lockdep.c:3216 [inline]
              validate_chain kernel/locking/lockdep.c:3831 [inline]
              __lock_acquire+0x2a43/0x56d0 kernel/locking/lockdep.c:5055
              lock_acquire kernel/locking/lockdep.c:5668 [inline]
              lock_acquire+0x1e3/0x630 kernel/locking/lockdep.c:5633
              __mutex_lock_common kernel/locking/mutex.c:603 [inline]
              __mutex_lock+0x12f/0x1360 kernel/locking/mutex.c:747
              __btrfs_release_delayed_node.part.0+0xa1/0xf30 fs/btrfs/delayed-inode.c:256
              __btrfs_release_delayed_node fs/btrfs/delayed-inode.c:251 [inline]
              btrfs_release_delayed_node fs/btrfs/delayed-inode.c:281 [inline]
              btrfs_remove_delayed_node+0x52/0x60 fs/btrfs/delayed-inode.c:1285
              btrfs_evict_inode+0x511/0xf30 fs/btrfs/inode.c:5554
              evict+0x2ed/0x6b0 fs/inode.c:664
              dispose_list+0x117/0x1e0 fs/inode.c:697
              prune_icache_sb+0xeb/0x150 fs/inode.c:896
              super_cache_scan+0x391/0x590 fs/super.c:106
              do_shrink_slab+0x464/0xce0 mm/vmscan.c:843
              shrink_slab_memcg mm/vmscan.c:912 [inline]
              shrink_slab+0x388/0x660 mm/vmscan.c:991
              shrink_node_memcgs mm/vmscan.c:6088 [inline]
              shrink_node+0x93d/0x1f30 mm/vmscan.c:6117
              shrink_zones mm/vmscan.c:6355 [inline]
              do_try_to_free_pages+0x3b4/0x17a0 mm/vmscan.c:6417
              try_to_free_mem_cgroup_pages+0x3a4/0xa70 mm/vmscan.c:6732
              reclaim_high.constprop.0+0x182/0x230 mm/memcontrol.c:2393
              mem_cgroup_handle_over_high+0x190/0x520 mm/memcontrol.c:2578
              try_charge_memcg+0xe0c/0x12f0 mm/memcontrol.c:2816
              try_charge mm/memcontrol.c:2827 [inline]
              charge_memcg+0x90/0x3b0 mm/memcontrol.c:6889
              __mem_cgroup_charge+0x2b/0x90 mm/memcontrol.c:6910
              mem_cgroup_charge include/linux/memcontrol.h:667 [inline]
              __filemap_add_folio+0x615/0xf80 mm/filemap.c:852
              filemap_add_folio+0xaf/0x1e0 mm/filemap.c:934
              __filemap_get_folio+0x389/0xd80 mm/filemap.c:1976
              pagecache_get_page+0x2e/0x280 mm/folio-compat.c:104
              find_or_create_page include/linux/pagemap.h:612 [inline]
              alloc_extent_buffer+0x2b9/0x1580 fs/btrfs/extent_io.c:4588
              btrfs_init_new_buffer fs/btrfs/extent-tree.c:4869 [inline]
              btrfs_alloc_tree_block+0x2e1/0x1320 fs/btrfs/extent-tree.c:4988
              __btrfs_cow_block+0x3b2/0x1420 fs/btrfs/ctree.c:440
              btrfs_cow_block+0x2fa/0x950 fs/btrfs/ctree.c:595
              btrfs_search_slot+0x11b0/0x2c70 fs/btrfs/ctree.c:2038
              btrfs_update_root+0xdb/0x630 fs/btrfs/root-tree.c:137
              update_log_root fs/btrfs/tree-log.c:2841 [inline]
              btrfs_sync_log+0xbfb/0x2870 fs/btrfs/tree-log.c:3064
              btrfs_sync_file+0xdb9/0x13c0 fs/btrfs/file.c:1947
              vfs_fsync_range+0x13e/0x230 fs/sync.c:188
              generic_write_sync include/linux/fs.h:2856 [inline]
              iomap_dio_complete+0x73a/0x920 fs/iomap/direct-io.c:128
              btrfs_direct_write fs/btrfs/file.c:1536 [inline]
              btrfs_do_write_iter+0xba2/0x1470 fs/btrfs/file.c:1668
              call_write_iter include/linux/fs.h:2160 [inline]
              do_iter_readv_writev+0x20b/0x3b0 fs/read_write.c:735
              do_iter_write+0x182/0x700 fs/read_write.c:861
              vfs_iter_write+0x74/0xa0 fs/read_write.c:902
              iter_file_splice_write+0x745/0xc90 fs/splice.c:686
              do_splice_from fs/splice.c:764 [inline]
              direct_splice_actor+0x114/0x180 fs/splice.c:931
              splice_direct_to_actor+0x335/0x8a0 fs/splice.c:886
              do_splice_direct+0x1ab/0x280 fs/splice.c:974
              do_sendfile+0xb19/0x1270 fs/read_write.c:1255
              __do_sys_sendfile64 fs/read_write.c:1323 [inline]
              __se_sys_sendfile64 fs/read_write.c:1309 [inline]
              __x64_sys_sendfile64+0x259/0x2c0 fs/read_write.c:1309
              do_syscall_x64 arch/x86/entry/common.c:50 [inline]
              do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
              entry_SYSCALL_64_after_hwframe+0x63/0xcd
    
       other info that might help us debug this:
    
       Chain exists of:
         &delayed_node->mutex --> btrfs-tree-00 --> btrfs-log-00
    
       Possible unsafe locking scenario:
    
              CPU0                    CPU1
              ----                    ----
         lock(btrfs-log-00);
                                      lock(btrfs-tree-00);
                                      lock(btrfs-log-00);
         lock(&delayed_node->mutex);
    
    Holding a read lock on a leaf from a fs/subvolume tree creates a nasty
    lock dependency when we are COWing extent buffers for the log tree and we
    have two tasks modifying the log tree, with each one in one of the
    following 2 scenarios:
    
    1) Modifying the log tree triggers an extent buffer allocation while
       holding a write lock on a parent extent buffer from the log tree.
       Allocating the pages for an extent buffer, or the extent buffer
       struct, can trigger inode eviction and finally the inode eviction
       will trigger a release/remove of a delayed node, which requires
       taking the delayed node's mutex;
    
    2) Allocating a metadata extent for a log tree can trigger the async
       reclaim thread and make us wait for it to release enough space and
       unblock our reservation ticket. The reclaim thread can start flushing
       delayed items, and that in turn results in the need to lock delayed
       node mutexes and in the need to write lock extent buffers of a
       subvolume tree - all this while holding a write lock on the parent
       extent buffer in the log tree.
    
    So one task in scenario 1) running in parallel with another task in
    scenario 2) could lead to a deadlock, one wanting to lock a delayed node
    mutex while having a read lock on a leaf from the subvolume, while the
    other is holding the delayed node's mutex and wants to write lock the same
    subvolume leaf for flushing delayed items.
    
    Fix this by cloning the leaf of the fs/subvolume tree, release/unlock the
    fs/subvolume leaf and use the clone leaf instead.
    
    Reported-by: syzbot+9b7c21f486f5e7f8d029@syzkaller.appspotmail.com
    Link: https://lore.kernel.org/linux-btrfs/000000000000ccc93c05edc4d8cf@google.com/
    CC: stable@vger.kernel.org # 6.0+
    Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
    796787c9
tree-log.c 210 KB