1. 07 Oct, 2020 40 commits
    • Filipe Manana's avatar
      btrfs: reschedule if necessary when logging directory items · bb56f02f
      Filipe Manana authored
      Logging directories with many entries can take a significant amount of
      time, and in some cases monopolize a cpu/core for a long time if the
      logging task doesn't happen to block often enough.
      
      Johannes and Lu Fengqi reported test case generic/041 triggering a soft
      lockup when the kernel has CONFIG_SOFTLOCKUP_DETECTOR=y. For this test
      case we log an inode with 3002 hard links, and because the test removed
      one hard link before fsyncing the file, the inode logging causes the
      parent directory do be logged as well, which has 6004 directory items to
      log (3002 BTRFS_DIR_ITEM_KEY items plus 3002 BTRFS_DIR_INDEX_KEY items),
      so it can take a significant amount of time and trigger the soft lockup.
      
      So just make tree-log.c:log_dir_items() reschedule when necessary,
      releasing the current search path before doing so and then resume from
      where it was before the reschedule.
      
      The stack trace produced when the soft lockup happens is the following:
      
      [10480.277653] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [xfs_io:28172]
      [10480.279418] Modules linked in: dm_thin_pool dm_persistent_data (...)
      [10480.284915] irq event stamp: 29646366
      [10480.285987] hardirqs last  enabled at (29646365): [<ffffffff85249b66>] __slab_alloc.constprop.0+0x56/0x60
      [10480.288482] hardirqs last disabled at (29646366): [<ffffffff8579b00d>] irqentry_enter+0x1d/0x50
      [10480.290856] softirqs last  enabled at (4612): [<ffffffff85a00323>] __do_softirq+0x323/0x56c
      [10480.293615] softirqs last disabled at (4483): [<ffffffff85800dbf>] asm_call_on_stack+0xf/0x20
      [10480.296428] CPU: 2 PID: 28172 Comm: xfs_io Not tainted 5.9.0-rc4-default+ #1248
      [10480.298948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014
      [10480.302455] RIP: 0010:__slab_alloc.constprop.0+0x19/0x60
      [10480.304151] Code: 86 e8 31 75 21 00 66 66 2e 0f 1f 84 00 00 00 (...)
      [10480.309558] RSP: 0018:ffffadbe09397a58 EFLAGS: 00000282
      [10480.311179] RAX: ffff8a495ab92840 RBX: 0000000000000282 RCX: 0000000000000006
      [10480.313242] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff85249b66
      [10480.315260] RBP: ffff8a497d04b740 R08: 0000000000000001 R09: 0000000000000001
      [10480.317229] R10: ffff8a497d044800 R11: ffff8a495ab93c40 R12: 0000000000000000
      [10480.319169] R13: 0000000000000000 R14: 0000000000000c40 R15: ffffffffc01daf70
      [10480.321104] FS:  00007fa1dc5c0e40(0000) GS:ffff8a497da00000(0000) knlGS:0000000000000000
      [10480.323559] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [10480.325235] CR2: 00007fa1dc5befb8 CR3: 0000000004f8a006 CR4: 0000000000170ea0
      [10480.327259] Call Trace:
      [10480.328286]  ? overwrite_item+0x1f0/0x5a0 [btrfs]
      [10480.329784]  __kmalloc+0x831/0xa20
      [10480.331009]  ? btrfs_get_32+0xb0/0x1d0 [btrfs]
      [10480.332464]  overwrite_item+0x1f0/0x5a0 [btrfs]
      [10480.333948]  log_dir_items+0x2ee/0x570 [btrfs]
      [10480.335413]  log_directory_changes+0x82/0xd0 [btrfs]
      [10480.336926]  btrfs_log_inode+0xc9b/0xda0 [btrfs]
      [10480.338374]  ? init_once+0x20/0x20 [btrfs]
      [10480.339711]  btrfs_log_inode_parent+0x8d3/0xd10 [btrfs]
      [10480.341257]  ? dget_parent+0x97/0x2e0
      [10480.342480]  btrfs_log_dentry_safe+0x3a/0x50 [btrfs]
      [10480.343977]  btrfs_sync_file+0x24b/0x5e0 [btrfs]
      [10480.345381]  do_fsync+0x38/0x70
      [10480.346483]  __x64_sys_fsync+0x10/0x20
      [10480.347703]  do_syscall_64+0x2d/0x70
      [10480.348891]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [10480.350444] RIP: 0033:0x7fa1dc80970b
      [10480.351642] Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 (...)
      [10480.356952] RSP: 002b:00007fffb3d081d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
      [10480.359458] RAX: ffffffffffffffda RBX: 0000562d93d45e40 RCX: 00007fa1dc80970b
      [10480.361426] RDX: 0000562d93d44ab0 RSI: 0000562d93d45e60 RDI: 0000000000000003
      [10480.363367] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007fa1dc7b2a40
      [10480.365317] R10: 0000562d93d0e366 R11: 0000000000000293 R12: 0000000000000001
      [10480.367299] R13: 0000562d93d45290 R14: 0000562d93d45e40 R15: 0000562d93d45e60
      
      Link: https://lore.kernel.org/linux-btrfs/20180713090216.GC575@fnst.localdomain/Reported-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      CC: stable@vger.kernel.org # 4.4+
      Tested-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bb56f02f
    • Josef Bacik's avatar
      btrfs: do not create raid sysfs entries under any locks · 49ea112d
      Josef Bacik authored
      While running xfstests btrfs/177 I got the following lockdep splat
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc3+ #5 Not tainted
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc+0x37/0x270
      	 alloc_inode+0x82/0xb0
      	 iget_locked+0x10d/0x2c0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x240
      	 sysfs_get_tree+0x16/0x40
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (kernfs_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_dir_ns+0x7a/0xb0
      	 sysfs_create_dir_ns+0x60/0xb0
      	 kobject_add_internal+0xc0/0x2c0
      	 kobject_add+0x6e/0x90
      	 btrfs_sysfs_add_block_group_type+0x102/0x160
      	 btrfs_make_block_group+0x167/0x230
      	 btrfs_alloc_chunk+0x54f/0xb80
      	 btrfs_chunk_alloc+0x18e/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_insert_empty_items+0x64/0xb0
      	 btrfs_new_inode+0x225/0x730
      	 btrfs_create+0xab/0x1f0
      	 lookup_open.isra.0+0x52d/0x690
      	 path_openat+0x2a7/0x9e0
      	 do_filp_open+0x75/0x100
      	 do_sys_openat2+0x7b/0x130
      	 __x64_sys_openat+0x46/0x70
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_lookup_inode+0x2a/0x8f
      	 __btrfs_update_delayed_inode+0x80/0x240
      	 btrfs_commit_inode_delayed_inode+0x119/0x120
      	 btrfs_evict_inode+0x357/0x500
      	 evict+0xcf/0x1f0
      	 do_unlinkat+0x1a9/0x2b0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x119c/0x1fc0
      	 lock_acquire+0xa7/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x119c/0x1fc0
         lock_acquire+0xa7/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa7/0x3d0
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because when we link in a block group with a new raid index
      type we'll create the corresponding sysfs entries for it.  This is
      problematic because while restriping we're holding the chunk_mutex, and
      while mounting we're holding the tree locks.
      
      Fixing this isn't pretty, we move the call to the sysfs stuff into the
      btrfs_create_pending_block_groups() work, where we're not holding any
      locks.  This creates a slight race where other threads could see that
      there's no sysfs kobj for that raid type, and race to create the
      sysfs dir.  Fix this by wrapping the creation in space_info->lock, so we
      only get one thread calling kobject_add() for the new directory.  We
      don't worry about the lock on cleanup as it only gets deleted on
      unmount.
      
      On mount it's more straightforward, we loop through the space_infos
      already, just check every raid index in each space_info and added the
      sysfs entries for the corresponding block groups.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      49ea112d
    • Josef Bacik's avatar
      btrfs: kill the RCU protection for fs_info->space_info · 72804905
      Josef Bacik authored
      We have this thing wrapped in an RCU lock, but it's really not needed.
      We create all the space_info's on mount, and we destroy them on unmount.
      The list never changes and we're protected from messing with it by the
      normal mount/umount path, so kill the RCU stuff around it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      72804905
    • Nikolay Borisov's avatar
      btrfs: improve error message in setup_items_for_insert · 7269ddd2
      Nikolay Borisov authored
      Reword and update formats to match variable types.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update formats ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7269ddd2
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
      btrfs: sink total_data parameter in setup_items_for_insert · fc0d82e1
      Nikolay Borisov authored
      That parameter can easily be derived based on the "data_size" and "nr"
      parameters exploit this fact to simply the function's signature. No
      functional changes.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc0d82e1
    • Nikolay Borisov's avatar
      btrfs: eliminate total_size parameter from setup_items_for_insert · 3dc9dc89
      Nikolay Borisov authored
      The value of this argument can be derived from the total_data as it's
      simply the value of the data size + size of btrfs_items being touched.
      Move the parameter calculation inside the function. This results in a
      simpler interface and also a minor size reduction:
      
      ./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o
      add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34)
      Function                                     old     new   delta
      btrfs_duplicate_item                         260     259      -1
      setup_items_for_insert                      1200    1190     -10
      btrfs_insert_empty_items                     177     154     -23
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3dc9dc89
    • Nikolay Borisov's avatar
      btrfs: re-arrange statements in setup_items_for_insert · fc0716c2
      Nikolay Borisov authored
      Rearrange statements calculating the offset of the newly added items so
      that the calculation has to be done only once. No functional change.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fc0716c2
    • Omar Sandoval's avatar
      btrfs: sysfs: export supported send stream version · 7573df55
      Omar Sandoval authored
      This reports the latest send stream version supported by the kernel as
      the feature in /sys/fs/btrfs/features/send_stream_version .
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7573df55
    • Omar Sandoval's avatar
      btrfs: send: use btrfs_file_extent_end() in send_write_or_clone() · c9a949af
      Omar Sandoval authored
      send_write_or_clone() basically has an open-coded copy of
      btrfs_file_extent_end() except that it (incorrectly) aligns to PAGE_SIZE
      instead of sectorsize. Fix and simplify the code by using
      btrfs_file_extent_end().
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c9a949af
    • Omar Sandoval's avatar
      btrfs: send: avoid copying file data · 8c7d9fe0
      Omar Sandoval authored
      send_write() currently copies from the page cache to sctx->read_buf, and
      then from sctx->read_buf to sctx->send_buf. Similarly, send_hole()
      zeroes sctx->read_buf and then copies from sctx->read_buf to
      sctx->send_buf. However, if we write the TLV header manually, we can
      copy to sctx->send_buf directly and get rid of sctx->read_buf.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8c7d9fe0
    • Omar Sandoval's avatar
      btrfs: send: get rid of i_size logic in send_write() · a9b2e0de
      Omar Sandoval authored
      send_write()/fill_read_buf() have some logic for avoiding reading past
      i_size. However, everywhere that we call
      send_write()/send_extent_data(), we've already clamped the length down
      to i_size. Get rid of the i_size handling, which simplifies the next
      change.
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a9b2e0de
    • Filipe Manana's avatar
      btrfs: rename btrfs_insert_clone_extent() to a more generic name · 0cbb5bdf
      Filipe Manana authored
      Now that we use the same mechanism to replace all the extents in a file
      range with either a hole, an existing extent (when cloning) or a new
      extent (when using fallocate), the name of btrfs_insert_clone_extent()
      no longer reflects its genericity.
      
      So rename it to btrfs_insert_replace_extent(), since what it does is
      to either insert an existing extent or a new extent into a file range.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0cbb5bdf
    • Filipe Manana's avatar
      btrfs: rename btrfs_punch_hole_range() to a more generic name · 306bfec0
      Filipe Manana authored
      The function btrfs_punch_hole_range() is now used to replace all the file
      extents in a given file range with an extent described in the given struct
      btrfs_replace_extent_info argument. This extent can either be an existing
      extent that is being cloned or it can be a new extent (namely a prealloc
      extent). When that argument is NULL it only punches a hole (drops all the
      existing extents) in the file range.
      
      So rename the function to btrfs_replace_file_extents().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      306bfec0
    • Filipe Manana's avatar
      btrfs: rename struct btrfs_clone_extent_info to a more generic name · bf385648
      Filipe Manana authored
      Now that we can use btrfs_clone_extent_info to convey information for a
      new prealloc extent as well, and not just for existing extents that are
      being cloned, rename it to btrfs_replace_extent_info, which reflects the
      fact that this is now more generic and it is used to replace all existing
      extents in a file range with the extent described by the structure.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      bf385648
    • Filipe Manana's avatar
      btrfs: remove item_size member of struct btrfs_clone_extent_info · fb870f6c
      Filipe Manana authored
      The value of item_size of struct btrfs_clone_extent_info is always set to
      the size of a non-inline file extent item, and in fact the infrastructure
      that uses this structure (btrfs_punch_hole_range()) does not work with
      inline file extents at all (and it is not supposed to).
      
      So just remove that field from the structure and use directly
      sizeof(struct btrfs_file_extent_item) instead. Also assert that the
      file extent type is not inline at btrfs_insert_clone_extent().
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fb870f6c
    • Filipe Manana's avatar
      btrfs: fix metadata reservation for fallocate that leads to transaction aborts · 8fccebfa
      Filipe Manana authored
      When doing an fallocate(), specially a zero range operation, we assume
      that reserving 3 units of metadata space is enough, that at most we touch
      one leaf in subvolume/fs tree for removing existing file extent items and
      inserting a new file extent item. This assumption is generally true for
      most common use cases. However when we end up needing to remove file extent
      items from multiple leaves, we can end up failing with -ENOSPC and abort
      the current transaction, turning the filesystem to RO mode. When this
      happens a stack trace like the following is dumped in dmesg/syslog:
      
      [ 1500.620934] ------------[ cut here ]------------
      [ 1500.620938] BTRFS: Transaction aborted (error -28)
      [ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs]
      [ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...)
      [ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G        W         5.9.0-rc3-btrfs-next-67 #1
      [ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      [ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs]
      [ 1500.621026] Code: 8b 40 50 f0 48 (...)
      [ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286
      [ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000
      [ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
      [ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001
      [ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000
      [ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60
      [ 1500.621039] FS:  00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000
      [ 1500.621041] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0
      [ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1500.621049] Call Trace:
      [ 1500.621076]  btrfs_prealloc_file_range+0x10/0x20 [btrfs]
      [ 1500.621087]  btrfs_fallocate+0xccd/0x1280 [btrfs]
      [ 1500.621108]  vfs_fallocate+0x14d/0x290
      [ 1500.621112]  ksys_fallocate+0x3a/0x70
      [ 1500.621117]  __x64_sys_fallocate+0x1a/0x20
      [ 1500.621120]  do_syscall_64+0x33/0x80
      [ 1500.621123]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 1500.621126] RIP: 0033:0x7fb5b248c477
      [ 1500.621128] Code: 89 7c 24 08 (...)
      [ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
      [ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477
      [ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003
      [ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000
      [ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010
      [ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003
      [ 1500.621151] irq event stamp: 1026217
      [ 1500.621154] hardirqs last  enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0
      [ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0
      [ 1500.621159] softirqs last  enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606
      [ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20
      [ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]---
      [ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left
      
      When we use fallocate() internally, for reserving an extent for a space
      cache, inode cache or relocation, we can't hit this problem since either
      there aren't any file extent items to remove from the subvolume tree or
      there is at most one.
      
      When using plain fallocate() it's very unlikely, since that would require
      having many file extent items representing holes for the target range and
      crossing multiple leafs - we attempt to increase the range (merge) of such
      file extent items when punching holes, so at most we end up with 2 file
      extent items for holes at leaf boundaries.
      
      However when using the zero range operation of fallocate() for a large
      range (100+ MiB for example) that's fairly easy to trigger. The following
      example reproducer triggers the issue:
      
        $ cat reproducer.sh
        #!/bin/bash
      
        umount /dev/sdj &> /dev/null
        mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null
        mount /dev/sdj /mnt/sdj
      
        # Create a 100M file with many file extent items. Punch a hole every 8K
        # just to speedup the file creation - we could do 4K sequential writes
        # followed by fsync (or O_SYNC) as well, but that takes a lot of time.
        file_size=$((100 * 1024 * 1024))
        xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar
        for ((i = 0; i < $file_size; i += 8192)); do
            xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar
        done
      
        # Force a transaction commit, so the zero range operation will be forced
        # to COW all metadata extents it need to touch.
        sync
      
        xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar
      
        umount /mnt/sdj
      
        $ ./reproducer.sh
        wrote 104857600/104857600 bytes at offset 0
        100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec)
        fallocate: No space left on device
      
        $ dmesg
        <shows the same stack trace pasted before>
      
      To fix this use the existing infrastructure that hole punching and
      extent cloning use for replacing a file range with another extent. This
      deals with doing the removal of file extent items and inserting the new
      one using an incremental approach, reserving more space when needed and
      always ensuring we don't leave an implicit hole in the range in case
      we need to do multiple iterations and a crash happens between iterations.
      
      A test case for fstests will follow up soon.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8fccebfa
    • YueHaibing's avatar
      btrfs: remove unused function calc_global_rsv_need_space() · a31a5876
      YueHaibing authored
      It is not used since commit 0096420a ("btrfs: do not
      account global reserve in can_overcommit").
      Reviewed-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      a31a5876
    • Anand Jain's avatar
      btrfs: move btrfs_dev_replace_update_device_in_mapping_tree to drop declaration · 0725c0c9
      Anand Jain authored
      The function is short and simple, we can get rid of the declaration as
      it's not necessary for a static function. Move it before its first
      caller.  No functional changes.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0725c0c9
    • Anand Jain's avatar
      btrfs: simplify gotos in open_seed_device · c83b60c0
      Anand Jain authored
      The function does not have a common exit block and returns immediatelly
      so there's no point having the goto. Remove the two cases.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c83b60c0
    • Anand Jain's avatar
      btrfs: remove unnecessary tmp variable in btrfs_assign_next_active_device() · e493e8f9
      Anand Jain authored
      We can check the argument value directly, no need for the temporary
      variable.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e493e8f9
    • Anand Jain's avatar
      btrfs: remove tmp variable for list traversal in btrfs_init_dev_replace_tgtdev · 1888709d
      Anand Jain authored
      In the function btrfs_init_dev_replace_tgtdev(), the local variable
      devices is used only once, we can remove it.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1888709d
    • Anand Jain's avatar
      btrfs: use sprout device_list_mutex in btrfs_init_devices_late · e17125b5
      Anand Jain authored
      On a mounted sprout filesystem, all threads now are using the
      sprout::device_list_mutex, and this is the only code using the
      seed::device_list_mutex. This patch converts to use the sprouts
      fs_info->fs_devices->device_list_mutex.
      
      The same reasoning holds true here, that device delete is holding
      the sprout::device_list_mutex.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      e17125b5
    • Anand Jain's avatar
      btrfs: reada: lock all seed/sprout devices in __reada_start_machine · 2fca0db0
      Anand Jain authored
      On an fs mounted using a sprout device, the seed fs_devices are
      maintained in a linked list under fs_info->fs_devices. Each seeds
      fs_devices also has device_list_mutex initialized to protect against the
      potential race with delete threads. But the delete thread (at
      btrfs_rm_device()) is holding the fs_info::fs_devices::device_list_mutex
      mutex which belongs to sprout device_list_mutex instead of seed
      device_list_mutex. Moreover, there aren't any significient benefits in
      using the seed::device_list_mutex instead of sprout::device_list_mutex.
      
      So this patch converts them of using the seed::device_list_mutex to
      sprout::device_list_mutex.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2fca0db0
    • Anand Jain's avatar
      btrfs: handle errors in btrfs_sysfs_add_fs_devices · 7ad3912a
      Anand Jain authored
      btrfs_sysfs_add_fs_devices() is called by btrfs_sysfs_add_mounted().
      btrfs_sysfs_add_mounted() assumes that btrfs_sysfs_add_fs_devices() will
      either add sysfs entries for all the devices or none. So this patch keeps up
      to its caller expecatation and cleans up the created sysfs entries if it
      has to fail at some device in the list.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ad3912a
    • Anand Jain's avatar
      btrfs: initialize sysfs devid and device link for seed device · 30b0e4e0
      Anand Jain authored
      We don't initialize the sysfs devid kobject and device-link yet for the
      seed devices in an sprouted filesystem.
      So this patch initializes the seed device devid kobject and the device
      link in the sysfs.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      30b0e4e0
    • Anand Jain's avatar
      btrfs: split and refactor btrfs_sysfs_remove_devices_dir · 53f8a74c
      Anand Jain authored
      Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split
      btrfs_sysfs_remove_devices_dir() so that we don't have to use the device
      argument to indicate whether to free all devices or just one device.
      
      Export btrfs_sysfs_remove_device() as device operations outside of
      sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir().
      
      btrfs_sysfs_remove_devices_dir() is renamed to
      btrfs_sysfs_remove_fs_devices() to suite its new role.
      
      Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices()
      so it is redeclared s static. And the same function had to be moved
      before its first caller.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53f8a74c
    • Anand Jain's avatar
      btrfs: simplify parameters of btrfs_sysfs_add_devices_dir · cd36da2e
      Anand Jain authored
      When we add a device we need to add it to sysfs, so instead of using the
      btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to
      add a device or all of fs_devices, call the helper function directly
      btrfs_sysfs_add_device() and thus make it non-static.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cd36da2e
    • Anand Jain's avatar
      btrfs: make btrfs_sysfs_remove_devices_dir return void · 6a416a01
      Anand Jain authored
      btrfs_sysfs_remove_devices_dir() return value is unused declare it as
      void.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6a416a01
    • Anand Jain's avatar
      btrfs: add btrfs_sysfs_remove_device helper · 985e233e
      Anand Jain authored
      btrfs_sysfs_remove_devices_dir() removes device link and devid kobject
      (sysfs entries) for a device or all the devices in the btrfs_fs_devices.
      In preparation to remove these sysfs entries for the seed as well, add
      a btrfs_sysfs_remove_device() helper function and avoid code
      duplication.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      985e233e
    • Anand Jain's avatar
      btrfs: add btrfs_sysfs_add_device helper · 178a16c9
      Anand Jain authored
      btrfs_sysfs_add_devices_dir() adds device link and devid kobject
      (sysfs entries) for a device or all the devices in the btrfs_fs_devices.
      In preparation to add these sysfs entries for the seed as well, add
      a btrfs_sysfs_add_device() helper function and avoid code duplication.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      178a16c9
    • Anand Jain's avatar
      btrfs: fix replace of seed device · c6a5d954
      Anand Jain authored
      If you replace a seed device in a sprouted fs, it appears to have
      successfully replaced the seed device, but if you look closely, it
      didn't.  Here is an example.
      
        $ mkfs.btrfs /dev/sda
        $ btrfstune -S1 /dev/sda
        $ mount /dev/sda /btrfs
        $ btrfs device add /dev/sdb /btrfs
        $ umount /btrfs
        $ btrfs device scan --forget
        $ mount -o device=/dev/sda /dev/sdb /btrfs
        $ btrfs replace start -f /dev/sda /dev/sdc /btrfs
        $ echo $?
        0
      
        BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc started
        BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc finished
      
        $ btrfs fi show
        Label: none  uuid: ab2c88b7-be81-4a7e-9849-c3666e7f9f4f
      	  Total devices 2 FS bytes used 256.00KiB
      	  devid    1 size 3.00GiB used 520.00MiB path /dev/sdc
      	  devid    2 size 3.00GiB used 896.00MiB path /dev/sdb
      
        Label: none  uuid: 10bd3202-0415-43af-96a8-d5409f310a7e
      	  Total devices 1 FS bytes used 128.00KiB
      	  devid    1 size 3.00GiB used 536.00MiB path /dev/sda
      
      So as per the replace start command and kernel log replace was successful.
      Now let's try to clean mount.
      
        $ umount /btrfs
        $ btrfs device scan --forget
      
        $ mount -o device=/dev/sdc /dev/sdb /btrfs
        mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error.
      
        [  636.157517] BTRFS error (device sdc): failed to read chunk tree: -2
        [  636.180177] BTRFS error (device sdc): open_ctree failed
      
      That's because per dev items it is still looking for the original seed
      device.
      
       $ btrfs inspect-internal dump-tree -d /dev/sdb
      
      	item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
      		devid 1 total_bytes 3221225472 bytes_used 545259520
      		io_align 4096 io_width 4096 sector_size 4096 type 0
      		generation 6 start_offset 0 dev_group 0
      		seek_speed 0 bandwidth 0
      		uuid 59368f50-9af2-4b17-91da-8a783cc418d4  <--- seed uuid
      		fsid 10bd3202-0415-43af-96a8-d5409f310a7e  <--- seed fsid
      	item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98
      		devid 2 total_bytes 3221225472 bytes_used 939524096
      		io_align 4096 io_width 4096 sector_size 4096 type 0
      		generation 0 start_offset 0 dev_group 0
      		seek_speed 0 bandwidth 0
      		uuid 56a0a6bc-4630-4998-8daf-3c3030c4256a  <- sprout uuid
      		fsid ab2c88b7-be81-4a7e-9849-c3666e7f9f4f <- sprout fsid
      
      But the replaced target has the following uuid+fsid in its superblock
      which doesn't match with the expected uuid+fsid in its devitem.
      
        $ btrfs in dump-super /dev/sdc | egrep '^generation|dev_item.uuid|dev_item.fsid|devid'
        generation	20
        dev_item.uuid	59368f50-9af2-4b17-91da-8a783cc418d4
        dev_item.fsid	ab2c88b7-be81-4a7e-9849-c3666e7f9f4f [match]
        dev_item.devid	1
      
      So if you provide the original seed device the mount shall be
      successful.  Which so long happening in the test case btrfs/163.
      
        $ btrfs device scan --forget
        $ mount -o device=/dev/sda /dev/sdb /btrfs
      
      Fix in this patch:
      If a seed is not sprouted then there is no replacement of it, because of
      its read-only filesystem with a read-only device. Similarly, in the case
      of a sprouted filesystem, the seed device is still read only. So, mark
      it as you can't replace a seed device, you can only add a new device and
      then delete the seed device. If replace is attempted then returns
      -EINVAL.
      Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c6a5d954
    • Anand Jain's avatar
      btrfs: improve device scanning messages · 79dae17d
      Anand Jain authored
      Systems booting without the initramfs seems to scan an unusual kind
      of device path (/dev/root). And at a later time, the device is updated
      to the correct path. We generally print the process name and PID of the
      process scanning the device but we don't capture the same information if
      the device path is rescanned with a different pathname.
      
      The current message is too long, so drop the unnecessary UUID and add
      process name and PID.
      
      While at this also update the duplicate device warning to include the
      process name and PID so the messages are consistent
      
      CC: stable@vger.kernel.org # 4.19+
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721Signed-off-by: default avatarAnand Jain <anand.jain@oracle.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      79dae17d
    • Josef Bacik's avatar
      btrfs: pretty print leaked root name · 457f1864
      Josef Bacik authored
      I'm a actual human being so am incapable of converting u64 to s64 in my
      head, so add a helper to get the pretty name of a root objectid and use
      that helper to spit out the name for any special roots for leaked roots,
      so I don't have to scratch my head and figure out which root I messed up
      the refs for.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      457f1864
    • Goldwyn Rodrigues's avatar
      btrfs: sysfs: export currently running exclusive operation · 66a2823c
      Goldwyn Rodrigues authored
      /sys/fs/<fsid>/exclusive_operation contains the currently executing
      exclusive operation. Add a sysfs_notify() when operation end, so
      userspace can be notified of exclusive operation is finished.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      66a2823c
    • Goldwyn Rodrigues's avatar
      btrfs: enumerate the type of exclusive operation in progress · c3e1f96c
      Goldwyn Rodrigues authored
      Instead of using a flag bit for exclusive operation, use a variable to
      store which exclusive operation is being performed.  Introduce an API
      to start and finish an exclusive operation.
      
      This would enable another way for tools to check which operation is
      running on why starting an exclusive operation failed. The followup
      patch adds a sysfs_notify() to alert userspace when the state changes, so
      userspace can perform select() on it to get notified of the change.
      
      This would enable us to enqueue a command which will wait for current
      exclusive operation to complete before issuing the next exclusive
      operation. This has been done synchronously as opposed to a background
      process, or else error collection (if any) will become difficult.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      c3e1f96c
    • Josef Bacik's avatar
      btrfs: sysfs: init devices outside of the chunk_mutex · ca10845a
      Josef Bacik authored
      While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the
      following lockdep splat:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        5.9.0-rc3+ #4 Not tainted
        ------------------------------------------------------
        kswapd0/100 is trying to acquire lock:
        ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330
      
        but task is already holding lock:
        ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (fs_reclaim){+.+.}-{0:0}:
      	 fs_reclaim_acquire+0x65/0x80
      	 slab_pre_alloc_hook.constprop.0+0x20/0x200
      	 kmem_cache_alloc+0x37/0x270
      	 alloc_inode+0x82/0xb0
      	 iget_locked+0x10d/0x2c0
      	 kernfs_get_inode+0x1b/0x130
      	 kernfs_get_tree+0x136/0x240
      	 sysfs_get_tree+0x16/0x40
      	 vfs_get_tree+0x28/0xc0
      	 path_mount+0x434/0xc00
      	 __x64_sys_mount+0xe3/0x120
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #2 (kernfs_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 kernfs_add_one+0x23/0x150
      	 kernfs_create_link+0x63/0xa0
      	 sysfs_do_create_link_sd+0x5e/0xd0
      	 btrfs_sysfs_add_devices_dir+0x81/0x130
      	 btrfs_init_new_device+0x67f/0x1250
      	 btrfs_ioctl+0x1ef/0x2e20
      	 __x64_sys_ioctl+0x83/0xb0
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}:
      	 __mutex_lock+0x7e/0x7e0
      	 btrfs_chunk_alloc+0x125/0x3a0
      	 find_free_extent+0xdf6/0x1210
      	 btrfs_reserve_extent+0xb3/0x1b0
      	 btrfs_alloc_tree_block+0xb0/0x310
      	 alloc_tree_block_no_bg_flush+0x4a/0x60
      	 __btrfs_cow_block+0x11a/0x530
      	 btrfs_cow_block+0x104/0x220
      	 btrfs_search_slot+0x52e/0x9d0
      	 btrfs_insert_empty_items+0x64/0xb0
      	 btrfs_insert_delayed_items+0x90/0x4f0
      	 btrfs_commit_inode_delayed_items+0x93/0x140
      	 btrfs_log_inode+0x5de/0x2020
      	 btrfs_log_inode_parent+0x429/0xc90
      	 btrfs_log_new_name+0x95/0x9b
      	 btrfs_rename2+0xbb9/0x1800
      	 vfs_rename+0x64f/0x9f0
      	 do_renameat2+0x320/0x4e0
      	 __x64_sys_rename+0x1f/0x30
      	 do_syscall_64+0x33/0x40
      	 entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        -> #0 (&delayed_node->mutex){+.+.}-{3:3}:
      	 __lock_acquire+0x119c/0x1fc0
      	 lock_acquire+0xa7/0x3d0
      	 __mutex_lock+0x7e/0x7e0
      	 __btrfs_release_delayed_node.part.0+0x3f/0x330
      	 btrfs_evict_inode+0x24c/0x500
      	 evict+0xcf/0x1f0
      	 dispose_list+0x48/0x70
      	 prune_icache_sb+0x44/0x50
      	 super_cache_scan+0x161/0x1e0
      	 do_shrink_slab+0x178/0x3c0
      	 shrink_slab+0x17c/0x290
      	 shrink_node+0x2b2/0x6d0
      	 balance_pgdat+0x30a/0x670
      	 kswapd+0x213/0x4c0
      	 kthread+0x138/0x160
      	 ret_from_fork+0x1f/0x30
      
        other info that might help us debug this:
      
        Chain exists of:
          &delayed_node->mutex --> kernfs_mutex --> fs_reclaim
      
         Possible unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(fs_reclaim);
      				 lock(kernfs_mutex);
      				 lock(fs_reclaim);
          lock(&delayed_node->mutex);
      
         *** DEADLOCK ***
      
        3 locks held by kswapd0/100:
         #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30
         #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290
         #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0
      
        stack backtrace:
        CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        Call Trace:
         dump_stack+0x8b/0xb8
         check_noncircular+0x12d/0x150
         __lock_acquire+0x119c/0x1fc0
         lock_acquire+0xa7/0x3d0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         __mutex_lock+0x7e/0x7e0
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? __btrfs_release_delayed_node.part.0+0x3f/0x330
         ? lock_acquire+0xa7/0x3d0
         ? find_held_lock+0x2b/0x80
         __btrfs_release_delayed_node.part.0+0x3f/0x330
         btrfs_evict_inode+0x24c/0x500
         evict+0xcf/0x1f0
         dispose_list+0x48/0x70
         prune_icache_sb+0x44/0x50
         super_cache_scan+0x161/0x1e0
         do_shrink_slab+0x178/0x3c0
         shrink_slab+0x17c/0x290
         shrink_node+0x2b2/0x6d0
         balance_pgdat+0x30a/0x670
         kswapd+0x213/0x4c0
         ? _raw_spin_unlock_irqrestore+0x41/0x50
         ? add_wait_queue_exclusive+0x70/0x70
         ? balance_pgdat+0x670/0x670
         kthread+0x138/0x160
         ? kthread_create_worker_on_cpu+0x40/0x40
         ret_from_fork+0x1f/0x30
      
      This happens because we are holding the chunk_mutex at the time of
      adding in a new device.  However we only need to hold the
      device_list_mutex, as we're going to iterate over the fs_devices
      devices.  Move the sysfs init stuff outside of the chunk_mutex to get
      rid of this lockdep splat.
      
      CC: stable@vger.kernel.org # 4.4.x: f3cd2c58: btrfs: sysfs, rename device_link add/remove functions
      CC: stable@vger.kernel.org # 4.4.x
      Reported-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ca10845a
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar
    • Nikolay Borisov's avatar