• Filipe Manana's avatar
    Btrfs: fix deadlock when finalizing block group creation · d9a0540a
    Filipe Manana authored
    Josef ran into a deadlock while a transaction handle was finalizing the
    creation of its block groups, which produced the following trace:
    
      [260445.593112] fio             D ffff88022a9df468     0  8924   4518 0x00000084
      [260445.593119]  ffff88022a9df468 ffffffff81c134c0 ffff880429693c00 ffff88022a9df488
      [260445.593126]  ffff88022a9e0000 ffff8803490d7b00 ffff8803490d7b18 ffff88022a9df4b0
      [260445.593132]  ffff8803490d7af8 ffff88022a9df488 ffffffff8175a437 ffff8803490d7b00
      [260445.593137] Call Trace:
      [260445.593145]  [<ffffffff8175a437>] schedule+0x37/0x80
      [260445.593189]  [<ffffffffa0850f37>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
      [260445.593197]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
      [260445.593225]  [<ffffffffa07eac44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
      [260445.593253]  [<ffffffffa07eff6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
      [260445.593295]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
      [260445.593324]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
      [260445.593351]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
      [260445.593394]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
      [260445.593427]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
      [260445.593459]  [<ffffffffa0800964>] do_chunk_alloc+0x2a4/0x2e0 [btrfs]
      [260445.593491]  [<ffffffffa0803815>] find_free_extent+0xa55/0xd90 [btrfs]
      [260445.593524]  [<ffffffffa0803c22>] btrfs_reserve_extent+0xd2/0x220 [btrfs]
      [260445.593532]  [<ffffffff8119fe5d>] ? account_page_dirtied+0xdd/0x170
      [260445.593564]  [<ffffffffa0803e78>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
      [260445.593597]  [<ffffffffa080c9de>] ? btree_set_page_dirty+0xe/0x10 [btrfs]
      [260445.593626]  [<ffffffffa07eb5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
      [260445.593654]  [<ffffffffa07ebbff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
      [260445.593682]  [<ffffffffa07ef8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
      [260445.593724]  [<ffffffffa08389df>] ? free_extent_buffer+0x4f/0x90 [btrfs]
      [260445.593752]  [<ffffffffa07f1a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
      [260445.593830]  [<ffffffffa07ea94a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
      [260445.593905]  [<ffffffffa08403b9>] btrfs_finish_chunk_alloc+0x1c9/0x570 [btrfs]
      [260445.593946]  [<ffffffffa08002ab>] btrfs_create_pending_block_groups+0x11b/0x200 [btrfs]
      [260445.593990]  [<ffffffffa0815798>] btrfs_commit_transaction+0xa8/0xb40 [btrfs]
      [260445.594042]  [<ffffffffa085abcd>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
      [260445.594089]  [<ffffffffa082bc84>] btrfs_sync_file+0x294/0x350 [btrfs]
      [260445.594115]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
      [260445.594133]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
      [260445.594149]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
      [260445.594169]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
      [260445.594187]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
      [260445.594204]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
    
    This happened because the same transaction handle created a large number
    of block groups and while finalizing their creation (inserting new items
    and updating existing items in the chunk and device trees) a new metadata
    extent had to be allocated and no free space was found in the current
    metadata block groups, which made find_free_extent() attempt to allocate
    a new block group via do_chunk_alloc(). However at do_chunk_alloc() we
    ended up allocating a new system chunk too and exceeded the threshold
    of 2Mb of reserved chunk bytes, which makes do_chunk_alloc() enter the
    final part of block group creation again (at
    btrfs_create_pending_block_groups()) and attempt to lock again the root
    of the chunk tree when it's already write locked by the same task.
    
    Similarly we can deadlock on extent tree nodes/leafs if while we are
    running delayed references we end up creating a new metadata block group
    in order to allocate a new node/leaf for the extent tree (as part of
    a CoW operation or growing the tree), as btrfs_create_pending_block_groups
    inserts items into the extent tree as well. In this case we get the
    following trace:
    
      [14242.773581] fio             D ffff880428ca3418     0  3615   3100 0x00000084
      [14242.773588]  ffff880428ca3418 ffff88042d66b000 ffff88042a03c800 ffff880428ca3438
      [14242.773594]  ffff880428ca4000 ffff8803e4b20190 ffff8803e4b201a8 ffff880428ca3460
      [14242.773600]  ffff8803e4b20188 ffff880428ca3438 ffffffff8175a437 ffff8803e4b20190
      [14242.773606] Call Trace:
      [14242.773613]  [<ffffffff8175a437>] schedule+0x37/0x80
      [14242.773656]  [<ffffffffa057ff07>] btrfs_tree_lock+0xa7/0x1f0 [btrfs]
      [14242.773664]  [<ffffffff810db7c0>] ? prepare_to_wait_event+0xf0/0xf0
      [14242.773692]  [<ffffffffa0519c44>] btrfs_lock_root_node+0x34/0x50 [btrfs]
      [14242.773720]  [<ffffffffa051ef6b>] btrfs_search_slot+0x88b/0xa00 [btrfs]
      [14242.773750]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
      [14242.773758]  [<ffffffff811ef4a2>] ? kmem_cache_alloc+0x1d2/0x200
      [14242.773786]  [<ffffffffa0520ad1>] btrfs_insert_item+0x71/0xf0 [btrfs]
      [14242.773818]  [<ffffffffa052f292>] btrfs_create_pending_block_groups+0x102/0x200 [btrfs]
      [14242.773850]  [<ffffffffa052f96e>] do_chunk_alloc+0x2ae/0x2f0 [btrfs]
      [14242.773934]  [<ffffffffa0532825>] find_free_extent+0xa55/0xd90 [btrfs]
      [14242.773998]  [<ffffffffa0532c22>] btrfs_reserve_extent+0xc2/0x1d0 [btrfs]
      [14242.774041]  [<ffffffffa0532e38>] btrfs_alloc_tree_block+0x108/0x4a0 [btrfs]
      [14242.774078]  [<ffffffffa051a5cd>] __btrfs_cow_block+0x12d/0x5b0 [btrfs]
      [14242.774118]  [<ffffffffa051abff>] btrfs_cow_block+0x11f/0x1c0 [btrfs]
      [14242.774155]  [<ffffffffa051e8c7>] btrfs_search_slot+0x1e7/0xa00 [btrfs]
      [14242.774194]  [<ffffffffa0528021>] ? __btrfs_free_extent.isra.70+0x2e1/0xcb0 [btrfs]
      [14242.774235]  [<ffffffffa0520a06>] btrfs_insert_empty_items+0x66/0xc0 [btrfs]
      [14242.774274]  [<ffffffffa051994a>] ? btrfs_alloc_path+0x1a/0x20 [btrfs]
      [14242.774318]  [<ffffffffa052c433>] __btrfs_run_delayed_refs+0xbb3/0x1020 [btrfs]
      [14242.774358]  [<ffffffffa052f404>] btrfs_run_delayed_refs.part.78+0x74/0x280 [btrfs]
      [14242.774391]  [<ffffffffa052f627>] btrfs_run_delayed_refs+0x17/0x20 [btrfs]
      [14242.774432]  [<ffffffffa05be236>] commit_cowonly_roots+0x8d/0x2bd [btrfs]
      [14242.774474]  [<ffffffffa059d07f>] ? __btrfs_run_delayed_items+0x1cf/0x210 [btrfs]
      [14242.774516]  [<ffffffffa05adac3>] ? btrfs_qgroup_account_extents+0x83/0x130 [btrfs]
      [14242.774558]  [<ffffffffa0544c40>] btrfs_commit_transaction+0x590/0xb40 [btrfs]
      [14242.774599]  [<ffffffffa0589b9d>] ? btrfs_log_dentry_safe+0x6d/0x80 [btrfs]
      [14242.774642]  [<ffffffffa055ac54>] btrfs_sync_file+0x294/0x350 [btrfs]
      [14242.774650]  [<ffffffff8123e29b>] vfs_fsync_range+0x3b/0xa0
      [14242.774657]  [<ffffffff81023891>] ? syscall_trace_enter_phase1+0x131/0x180
      [14242.774663]  [<ffffffff8123e35d>] do_fsync+0x3d/0x70
      [14242.774669]  [<ffffffff81023bb8>] ? syscall_trace_leave+0xb8/0x110
      [14242.774675]  [<ffffffff8123e600>] SyS_fsync+0x10/0x20
      [14242.774681]  [<ffffffff8175de6e>] entry_SYSCALL_64_fastpath+0x12/0x71
    
    Fix this by never recursing into the finalization phase of block group
    creation and making sure we never trigger the finalization of block group
    creation while running delayed references.
    Reported-by: default avatarJosef Bacik <jbacik@fb.com>
    Fixes: 00d80e34 ("Btrfs: fix quick exhaustion of the system array in the superblock")
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    d9a0540a
extent-tree.c 279 KB