1. 18 Dec, 2020 10 commits
    • Filipe Manana's avatar
      btrfs: fix transaction leak and crash after cleaning up orphans on RO mount · 638331fa
      Filipe Manana authored
      When we delete a root (subvolume or snapshot), at the very end of the
      operation, we attempt to remove the root's orphan item from the root tree,
      at btrfs_drop_snapshot(), by calling btrfs_del_orphan_item(). We ignore any
      error from btrfs_del_orphan_item() since it is not a serious problem and
      the next time the filesystem is mounted we remove such stray orphan items
      at btrfs_find_orphan_roots().
      
      However if the filesystem is mounted RO and we have stray orphan items for
      any previously deleted root, we can end up leaking a transaction and other
      data structures when unmounting the filesystem, as well as crashing if we
      do not have hardware acceleration for crc32c available.
      
      The steps that lead to the transaction leak are the following:
      
      1) The filesystem is mounted in RW mode;
      
      2) A subvolume is deleted;
      
      3) When the cleaner kthread runs btrfs_drop_snapshot() to delete the root,
         it gets a failure at btrfs_del_orphan_item(), which is ignored, due to
         an ENOMEM when allocating a path for example. So the orphan item for
         the root remains in the root tree;
      
      4) The filesystem is unmounted;
      
      5) The filesystem is mounted RO (-o ro). During the mount path we call
         btrfs_find_orphan_roots(), which iterates the root tree searching for
         orphan items. It finds the orphan item for our deleted root, and since
         it can not find the root, it starts a transaction to delete the orphan
         item (by calling btrfs_del_orphan_item());
      
      6) The RO mount completes;
      
      7) Before the transaction kthread commits the transaction created for
         deleting the orphan item (i.e. less than 30 seconds elapsed since the
         mount, the default commit interval), a filesystem unmount operation is
         started;
      
      8) At close_ctree(), we stop the transaction kthread, but we still have a
         transaction open with at least one dirty extent buffer, a leaf for the
         tree root which was COWed when deleting the orphan item;
      
      9) We then proceed to destroy the work queues, free the roots and block
         groups, etc. After that we drop the last reference on the btree inode by
         calling iput() on it. Since there are dirty pages for the btree inode,
         corresponding to the COWed extent buffer, btree_write_cache_pages() is
         invoked to flush those dirty pages. This results in creating a bio and
         submitting it, which makes us end up at btrfs_submit_metadata_bio();
      
      10) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
          that calls btrfs_wq_submit_bio(), because check_async_write() returned
          a value of 1. This value of 1 is because we did not have hardware
          acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
          set in fs_info->flags;
      
      11) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
          workqueue at fs_info->workers, which was already freed before by the
          call to btrfs_stop_all_workers() at close_ctree(). This results in an
          invalid memory access due to a use-after-free, leading to a crash.
      
      When this happens, before the crash there are several warnings triggered,
      since we have reserved metadata space in a block group, the delayed refs
      reservation, etc:
      
       ------------[ cut here ]------------
       WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
       Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
       CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
       Code: f0 01 00 00 48 39 c2 75 (...)
       RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
       RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
       RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
       RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
       R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
       FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
        close_ctree+0x2ba/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f15ee221ee7
       Code: ff 0b 00 f7 d8 64 89 01 48 (...)
       RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
       RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
       RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
       R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
       R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace dd74718fef1ed5c6 ]---
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
       Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
       CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
       Code: 48 83 bb b0 03 00 00 00 (...)
       RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
       RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
       RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
       R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
       FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
        close_ctree+0x2ba/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f15ee221ee7
       Code: ff 0b 00 f7 d8 64 89 01 (...)
       RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
       RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
       RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
       R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
       R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace dd74718fef1ed5c7 ]---
       ------------[ cut here ]------------
       WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
       Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
       CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
       Code: ad de 49 be 22 01 00 (...)
       RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
       RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
       RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
       R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
       FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        close_ctree+0x2ba/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f15ee221ee7
       Code: ff 0b 00 f7 d8 64 89 (...)
       RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
       RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
       RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
       R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
       R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
       irq event stamp: 0
       hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
       softirqs last disabled at (0): [<0000000000000000>] 0x0
       ---[ end trace dd74718fef1ed5c8 ]---
       BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
       BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
       BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
       BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
       BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
       BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
       BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
      
      And the crash, which only happens when we do not have crc32c hardware
      acceleration, produces the following trace immediately after those
      warnings:
      
       stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
       CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
       Code: 54 55 53 48 89 f3 (...)
       RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
       RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
       RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
       RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
       R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
       FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
        btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
        submit_one_bio+0x61/0x70 [btrfs]
        btree_write_cache_pages+0x414/0x450 [btrfs]
        ? kobject_put+0x9a/0x1d0
        ? trace_hardirqs_on+0x1b/0xf0
        ? _raw_spin_unlock_irqrestore+0x3c/0x60
        ? free_debug_processing+0x1e1/0x2b0
        do_writepages+0x43/0xe0
        ? lock_acquired+0x199/0x490
        __writeback_single_inode+0x59/0x650
        writeback_single_inode+0xaf/0x120
        write_inode_now+0x94/0xd0
        iput+0x187/0x2b0
        close_ctree+0x2c6/0x2fa [btrfs]
        generic_shutdown_super+0x6c/0x100
        kill_anon_super+0x14/0x30
        btrfs_kill_super+0x12/0x20 [btrfs]
        deactivate_locked_super+0x31/0x70
        cleanup_mnt+0x100/0x160
        task_work_run+0x68/0xb0
        exit_to_user_mode_prepare+0x1bb/0x1c0
        syscall_exit_to_user_mode+0x4b/0x260
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f3cfebabee7
       Code: ff 0b 00 f7 d8 64 89 01 (...)
       RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
       RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
       RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
       R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
       R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
       Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
       ---[ end trace dd74718fef1ed5cc ]---
      
      Finally when we remove the btrfs module (rmmod btrfs), there are several
      warnings about objects that were allocated from our slabs but were never
      freed, consequence of the transaction that was never committed and got
      leaked:
       =============================================================================
       BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
       -----------------------------------------------------------------------------
      
       INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
       CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        slab_err+0xb7/0xdc
        ? lock_acquired+0x199/0x490
        __kmem_cache_shutdown+0x1ac/0x3c0
        ? lock_release+0x20e/0x4c0
        kmem_cache_destroy+0x55/0x120
        btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 f5 (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       INFO: Object 0x0000000050cbdd61 @offset=12104
       INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
              btrfs_free_tree_block+0x128/0x360 [btrfs]
              __btrfs_cow_block+0x489/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
       INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              sync_filesystem+0x74/0x90
              generic_shutdown_super+0x22/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
       INFO: Object 0x0000000086e9b0ff @offset=12776
       INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
              btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
       INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
              commit_cowonly_roots+0x248/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
       kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
       CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        kmem_cache_destroy+0x119/0x120
        btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       =============================================================================
       BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
       -----------------------------------------------------------------------------
      
       INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
       CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        slab_err+0xb7/0xdc
        ? lock_acquired+0x199/0x490
        __kmem_cache_shutdown+0x1ac/0x3c0
        ? lock_release+0x20e/0x4c0
        kmem_cache_destroy+0x55/0x120
        btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 f5 (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       INFO: Object 0x000000001a340018 @offset=4408
       INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
              btrfs_free_tree_block+0x128/0x360 [btrfs]
              __btrfs_cow_block+0x489/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
       INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              btrfs_commit_transaction+0x60/0xc40 [btrfs]
              create_subvol+0x56a/0x990 [btrfs]
              btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
              __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
              btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
              btrfs_ioctl+0x1a92/0x36f0 [btrfs]
              __x64_sys_ioctl+0x83/0xb0
              do_syscall_64+0x33/0x80
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
       INFO: Object 0x000000002b46292a @offset=13648
       INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
              btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
       INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
       kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
       CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        kmem_cache_destroy+0x119/0x120
        btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 f5 (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       =============================================================================
       BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
       -----------------------------------------------------------------------------
      
       INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
       CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        slab_err+0xb7/0xdc
        ? lock_acquired+0x199/0x490
        __kmem_cache_shutdown+0x1ac/0x3c0
        ? __mutex_unlock_slowpath+0x45/0x2a0
        kmem_cache_destroy+0x55/0x120
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 f5 (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       INFO: Object 0x000000004cf95ea8 @offset=6264
       INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
              __slab_alloc.isra.0+0x109/0x1c0
              kmem_cache_alloc+0x7bb/0x830
              btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
              alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
              __btrfs_cow_block+0x12d/0x5f0 [btrfs]
              btrfs_cow_block+0xf7/0x220 [btrfs]
              btrfs_search_slot+0x62a/0xc40 [btrfs]
              btrfs_del_orphan_item+0x65/0xd0 [btrfs]
              btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
              open_ctree+0x125a/0x18a0 [btrfs]
              btrfs_mount_root.cold+0x13/0xed [btrfs]
              legacy_get_tree+0x30/0x60
              vfs_get_tree+0x28/0xe0
              fc_mount+0xe/0x40
              vfs_kern_mount.part.0+0x71/0x90
              btrfs_mount+0x13b/0x3e0 [btrfs]
       INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
              kmem_cache_free+0x34c/0x3c0
              __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
              btrfs_run_delayed_refs+0x81/0x210 [btrfs]
              commit_cowonly_roots+0xfb/0x300 [btrfs]
              btrfs_commit_transaction+0x367/0xc40 [btrfs]
              close_ctree+0x113/0x2fa [btrfs]
              generic_shutdown_super+0x6c/0x100
              kill_anon_super+0x14/0x30
              btrfs_kill_super+0x12/0x20 [btrfs]
              deactivate_locked_super+0x31/0x70
              cleanup_mnt+0x100/0x160
              task_work_run+0x68/0xb0
              exit_to_user_mode_prepare+0x1bb/0x1c0
              syscall_exit_to_user_mode+0x4b/0x260
              entry_SYSCALL_64_after_hwframe+0x44/0xa9
       kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
       CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
       Call Trace:
        dump_stack+0x8d/0xb5
        kmem_cache_destroy+0x119/0x120
        exit_btrfs_fs+0xa/0x59 [btrfs]
        __x64_sys_delete_module+0x194/0x260
        ? fpregs_assert_state_consistent+0x1e/0x40
        ? exit_to_user_mode_prepare+0x55/0x1c0
        ? trace_hardirqs_on+0x1b/0xf0
        do_syscall_64+0x33/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7f693e305897
       Code: 73 01 c3 48 8b 0d f9 (...)
       RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
       RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
       RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
       RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
       R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
       R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
       BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
      
      So fix this by calling btrfs_find_orphan_roots() in the mount path only if
      we are mounting the filesystem in RW mode. It's pointless to have it called
      for RO mounts anyway, since despite adding any deleted roots to the list of
      dead roots, we will never have the roots deleted until the filesystem is
      remounted in RW mode, as the cleaner kthread does nothing when we are
      mounted in RO - btrfs_need_cleaner_sleep() always returns true and the
      cleaner spends all time sleeping, never cleaning dead roots.
      
      This is accomplished by moving the call to btrfs_find_orphan_roots() from
      open_ctree() to btrfs_start_pre_rw_mount(), which also guarantees that
      if later the filesystem is remounted RW, we populate the list of dead
      roots and have the cleaner task delete the dead roots.
      Tested-by: default avatarFabian Vogt <fvogt@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      638331fa
    • Filipe Manana's avatar
      btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan · cb13eea3
      Filipe Manana authored
      If we remount a filesystem in RO mode while the qgroup rescan worker is
      running, we can end up having it still running after the remount is done,
      and at unmount time we may end up with an open transaction that ends up
      never getting committed. If that happens we end up with several memory
      leaks and can crash when hardware acceleration is unavailable for crc32c.
      Possibly it can lead to other nasty surprises too, due to use-after-free
      issues.
      
      The following steps explain how the problem happens.
      
      1) We have a filesystem mounted in RW mode and the qgroup rescan worker is
         running;
      
      2) We remount the filesystem in RO mode, and never stop/pause the rescan
         worker, so after the remount the rescan worker is still running. The
         important detail here is that the rescan task is still running after
         the remount operation committed any ongoing transaction through its
         call to btrfs_commit_super();
      
      3) The rescan is still running, and after the remount completed, the
         rescan worker started a transaction, after it finished iterating all
         leaves of the extent tree, to update the qgroup status item in the
         quotas tree. It does not commit the transaction, it only releases its
         handle on the transaction;
      
      4) A filesystem unmount operation starts shortly after;
      
      5) The unmount task, at close_ctree(), stops the transaction kthread,
         which had not had a chance to commit the open transaction since it was
         sleeping and the commit interval (default of 30 seconds) has not yet
         elapsed since the last time it committed a transaction;
      
      6) So after stopping the transaction kthread we still have the transaction
         used to update the qgroup status item open. At close_ctree(), when the
         filesystem is in RO mode and no transaction abort happened (or the
         filesystem is in error mode), we do not expect to have any transaction
         open, so we do not call btrfs_commit_super();
      
      7) We then proceed to destroy the work queues, free the roots and block
         groups, etc. After that we drop the last reference on the btree inode
         by calling iput() on it. Since there are dirty pages for the btree
         inode, corresponding to the COWed extent buffer for the quotas btree,
         btree_write_cache_pages() is invoked to flush those dirty pages. This
         results in creating a bio and submitting it, which makes us end up at
         btrfs_submit_metadata_bio();
      
      8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
         that calls btrfs_wq_submit_bio(), because check_async_write() returned
         a value of 1. This value of 1 is because we did not have hardware
         acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
         set in fs_info->flags;
      
      9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
         workqueue at fs_info->workers, which was already freed before by the
         call to btrfs_stop_all_workers() at close_ctree(). This results in an
         invalid memory access due to a use-after-free, leading to a crash.
      
      When this happens, before the crash there are several warnings triggered,
      since we have reserved metadata space in a block group, the delayed refs
      reservation, etc:
      
        ------------[ cut here ]------------
        WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 4 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
        Code: f0 01 00 00 48 39 c2 75 (...)
        RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
        RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
        RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
        RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
        R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 01 48 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c6 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 2 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
        Code: 48 83 bb b0 03 00 00 00 (...)
        RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
        RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
        RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
        R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 01 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c7 ]---
        ------------[ cut here ]------------
        WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        CPU: 5 PID: 1729896 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
        Code: ad de 49 be 22 01 00 (...)
        RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
        RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
        RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
        R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
        FS:  00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         close_ctree+0x2ba/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f15ee221ee7
        Code: ff 0b 00 f7 d8 64 89 (...)
        RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
        RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
        R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
        irq event stamp: 0
        hardirqs last  enabled at (0): [<0000000000000000>] 0x0
        hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last  enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
        softirqs last disabled at (0): [<0000000000000000>] 0x0
        ---[ end trace dd74718fef1ed5c8 ]---
        BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
        BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
        BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
        BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
      
      And the crash, which only happens when we do not have crc32c hardware
      acceleration, produces the following trace immediately after those
      warnings:
      
        stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
        CPU: 2 PID: 1749129 Comm: umount Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
        Code: 54 55 53 48 89 f3 (...)
        RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
        RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
        RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
        R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
        FS:  00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
         btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
         submit_one_bio+0x61/0x70 [btrfs]
         btree_write_cache_pages+0x414/0x450 [btrfs]
         ? kobject_put+0x9a/0x1d0
         ? trace_hardirqs_on+0x1b/0xf0
         ? _raw_spin_unlock_irqrestore+0x3c/0x60
         ? free_debug_processing+0x1e1/0x2b0
         do_writepages+0x43/0xe0
         ? lock_acquired+0x199/0x490
         __writeback_single_inode+0x59/0x650
         writeback_single_inode+0xaf/0x120
         write_inode_now+0x94/0xd0
         iput+0x187/0x2b0
         close_ctree+0x2c6/0x2fa [btrfs]
         generic_shutdown_super+0x6c/0x100
         kill_anon_super+0x14/0x30
         btrfs_kill_super+0x12/0x20 [btrfs]
         deactivate_locked_super+0x31/0x70
         cleanup_mnt+0x100/0x160
         task_work_run+0x68/0xb0
         exit_to_user_mode_prepare+0x1bb/0x1c0
         syscall_exit_to_user_mode+0x4b/0x260
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f3cfebabee7
        Code: ff 0b 00 f7 d8 64 89 01 (...)
        RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
        RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
        RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
        RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
        R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
        R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
        Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
        ---[ end trace dd74718fef1ed5cc ]---
      
      Finally when we remove the btrfs module (rmmod btrfs), there are several
      warnings about objects that were allocated from our slabs but were never
      freed, consequence of the transaction that was never committed and got
      leaked:
      
        =============================================================================
        BUG btrfs_delayed_ref_head (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? lock_release+0x20e/0x4c0
         kmem_cache_destroy+0x55/0x120
         btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x0000000050cbdd61 @offset=12104
        INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
      	__slab_alloc.isra.0+0x109/0x1c0
      	kmem_cache_alloc+0x7bb/0x830
      	btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
      	btrfs_free_tree_block+0x128/0x360 [btrfs]
      	__btrfs_cow_block+0x489/0x5f0 [btrfs]
      	btrfs_cow_block+0xf7/0x220 [btrfs]
      	btrfs_search_slot+0x62a/0xc40 [btrfs]
      	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
      	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
      	open_ctree+0x125a/0x18a0 [btrfs]
      	btrfs_mount_root.cold+0x13/0xed [btrfs]
      	legacy_get_tree+0x30/0x60
      	vfs_get_tree+0x28/0xe0
      	fc_mount+0xe/0x40
      	vfs_kern_mount.part.0+0x71/0x90
      	btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
      	kmem_cache_free+0x34c/0x3c0
      	__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
      	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
      	commit_cowonly_roots+0xfb/0x300 [btrfs]
      	btrfs_commit_transaction+0x367/0xc40 [btrfs]
      	sync_filesystem+0x74/0x90
      	generic_shutdown_super+0x22/0x100
      	kill_anon_super+0x14/0x30
      	btrfs_kill_super+0x12/0x20 [btrfs]
      	deactivate_locked_super+0x31/0x70
      	cleanup_mnt+0x100/0x160
      	task_work_run+0x68/0xb0
      	exit_to_user_mode_prepare+0x1bb/0x1c0
      	syscall_exit_to_user_mode+0x4b/0x260
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
        INFO: Object 0x0000000086e9b0ff @offset=12776
        INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
      	__slab_alloc.isra.0+0x109/0x1c0
      	kmem_cache_alloc+0x7bb/0x830
      	btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
      	btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
      	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
      	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
      	btrfs_cow_block+0xf7/0x220 [btrfs]
      	btrfs_search_slot+0x62a/0xc40 [btrfs]
      	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
      	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
      	open_ctree+0x125a/0x18a0 [btrfs]
      	btrfs_mount_root.cold+0x13/0xed [btrfs]
      	legacy_get_tree+0x30/0x60
      	vfs_get_tree+0x28/0xe0
      	fc_mount+0xe/0x40
      	vfs_kern_mount.part.0+0x71/0x90
        INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
      	kmem_cache_free+0x34c/0x3c0
      	__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
      	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
      	btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
      	commit_cowonly_roots+0x248/0x300 [btrfs]
      	btrfs_commit_transaction+0x367/0xc40 [btrfs]
      	close_ctree+0x113/0x2fa [btrfs]
      	generic_shutdown_super+0x6c/0x100
      	kill_anon_super+0x14/0x30
      	btrfs_kill_super+0x12/0x20 [btrfs]
      	deactivate_locked_super+0x31/0x70
      	cleanup_mnt+0x100/0x160
      	task_work_run+0x68/0xb0
      	exit_to_user_mode_prepare+0x1bb/0x1c0
      	syscall_exit_to_user_mode+0x4b/0x260
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        =============================================================================
        BUG btrfs_delayed_tree_ref (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
        CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? lock_release+0x20e/0x4c0
         kmem_cache_destroy+0x55/0x120
         btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x000000001a340018 @offset=4408
        INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
      	__slab_alloc.isra.0+0x109/0x1c0
      	kmem_cache_alloc+0x7bb/0x830
      	btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
      	btrfs_free_tree_block+0x128/0x360 [btrfs]
      	__btrfs_cow_block+0x489/0x5f0 [btrfs]
      	btrfs_cow_block+0xf7/0x220 [btrfs]
      	btrfs_search_slot+0x62a/0xc40 [btrfs]
      	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
      	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
      	open_ctree+0x125a/0x18a0 [btrfs]
      	btrfs_mount_root.cold+0x13/0xed [btrfs]
      	legacy_get_tree+0x30/0x60
      	vfs_get_tree+0x28/0xe0
      	fc_mount+0xe/0x40
      	vfs_kern_mount.part.0+0x71/0x90
      	btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
      	kmem_cache_free+0x34c/0x3c0
      	__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
      	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
      	btrfs_commit_transaction+0x60/0xc40 [btrfs]
      	create_subvol+0x56a/0x990 [btrfs]
      	btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
      	__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
      	btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
      	btrfs_ioctl+0x1a92/0x36f0 [btrfs]
      	__x64_sys_ioctl+0x83/0xb0
      	do_syscall_64+0x33/0x80
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
        INFO: Object 0x000000002b46292a @offset=13648
        INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
      	__slab_alloc.isra.0+0x109/0x1c0
      	kmem_cache_alloc+0x7bb/0x830
      	btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
      	btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
      	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
      	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
      	btrfs_cow_block+0xf7/0x220 [btrfs]
      	btrfs_search_slot+0x62a/0xc40 [btrfs]
      	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
      	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
      	open_ctree+0x125a/0x18a0 [btrfs]
      	btrfs_mount_root.cold+0x13/0xed [btrfs]
      	legacy_get_tree+0x30/0x60
      	vfs_get_tree+0x28/0xe0
      	fc_mount+0xe/0x40
      	vfs_kern_mount.part.0+0x71/0x90
        INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
      	kmem_cache_free+0x34c/0x3c0
      	__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
      	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
      	commit_cowonly_roots+0xfb/0x300 [btrfs]
      	btrfs_commit_transaction+0x367/0xc40 [btrfs]
      	close_ctree+0x113/0x2fa [btrfs]
      	generic_shutdown_super+0x6c/0x100
      	kill_anon_super+0x14/0x30
      	btrfs_kill_super+0x12/0x20 [btrfs]
      	deactivate_locked_super+0x31/0x70
      	cleanup_mnt+0x100/0x160
      	task_work_run+0x68/0xb0
      	exit_to_user_mode_prepare+0x1bb/0x1c0
      	syscall_exit_to_user_mode+0x4b/0x260
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        =============================================================================
        BUG btrfs_delayed_extent_op (Tainted: G    B   W        ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
        CPU: 5 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         slab_err+0xb7/0xdc
         ? lock_acquired+0x199/0x490
         __kmem_cache_shutdown+0x1ac/0x3c0
         ? __mutex_unlock_slowpath+0x45/0x2a0
         kmem_cache_destroy+0x55/0x120
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 f5 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        INFO: Object 0x000000004cf95ea8 @offset=6264
        INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
      	__slab_alloc.isra.0+0x109/0x1c0
      	kmem_cache_alloc+0x7bb/0x830
      	btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
      	alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
      	__btrfs_cow_block+0x12d/0x5f0 [btrfs]
      	btrfs_cow_block+0xf7/0x220 [btrfs]
      	btrfs_search_slot+0x62a/0xc40 [btrfs]
      	btrfs_del_orphan_item+0x65/0xd0 [btrfs]
      	btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
      	open_ctree+0x125a/0x18a0 [btrfs]
      	btrfs_mount_root.cold+0x13/0xed [btrfs]
      	legacy_get_tree+0x30/0x60
      	vfs_get_tree+0x28/0xe0
      	fc_mount+0xe/0x40
      	vfs_kern_mount.part.0+0x71/0x90
      	btrfs_mount+0x13b/0x3e0 [btrfs]
        INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
      	kmem_cache_free+0x34c/0x3c0
      	__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
      	btrfs_run_delayed_refs+0x81/0x210 [btrfs]
      	commit_cowonly_roots+0xfb/0x300 [btrfs]
      	btrfs_commit_transaction+0x367/0xc40 [btrfs]
      	close_ctree+0x113/0x2fa [btrfs]
      	generic_shutdown_super+0x6c/0x100
      	kill_anon_super+0x14/0x30
      	btrfs_kill_super+0x12/0x20 [btrfs]
      	deactivate_locked_super+0x31/0x70
      	cleanup_mnt+0x100/0x160
      	task_work_run+0x68/0xb0
      	exit_to_user_mode_prepare+0x1bb/0x1c0
      	syscall_exit_to_user_mode+0x4b/0x260
      	entry_SYSCALL_64_after_hwframe+0x44/0xa9
        kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
        CPU: 3 PID: 1729921 Comm: rmmod Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
        Call Trace:
         dump_stack+0x8d/0xb5
         kmem_cache_destroy+0x119/0x120
         exit_btrfs_fs+0xa/0x59 [btrfs]
         __x64_sys_delete_module+0x194/0x260
         ? fpregs_assert_state_consistent+0x1e/0x40
         ? exit_to_user_mode_prepare+0x55/0x1c0
         ? trace_hardirqs_on+0x1b/0xf0
         do_syscall_64+0x33/0x80
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f693e305897
        Code: 73 01 c3 48 8b 0d f9 (...)
        RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
        RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
        RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
        RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
        R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
        R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
        BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
      
      Fix this issue by having the remount path stop the qgroup rescan worker
      when we are remounting RO and teach the rescan worker to stop when a
      remount is in progress. If later a remount in RW mode happens, we are
      already resuming the qgroup rescan worker through the call to
      btrfs_qgroup_rescan_resume(), so we do not need to worry about that.
      Tested-by: default avatarFabian Vogt <fvogt@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      cb13eea3
    • Pavel Begunkov's avatar
      btrfs: merge critical sections of discard lock in workfn · 8fc05859
      Pavel Begunkov authored
      btrfs_discard_workfn() drops discard_ctl->lock just to take it again in
      a moment in btrfs_discard_schedule_work(). Avoid that and also reuse
      ktime.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8fc05859
    • Pavel Begunkov's avatar
      btrfs: fix racy access to discard_ctl data · 1ea2872f
      Pavel Begunkov authored
      Because only one discard worker may be running at any given point, it
      could have been safe to modify ->prev_discard, etc. without
      synchronization, if not for @override flag in
      btrfs_discard_schedule_work() and delayed_work_pending() returning false
      while workfn is running.
      
      That may lead to torn reads of u64 for some architectures, but that's
      not a big problem as only slightly affects the discard rate.
      Suggested-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1ea2872f
    • Pavel Begunkov's avatar
      btrfs: fix async discard stall · ea9ed87c
      Pavel Begunkov authored
      Might happen that bg->discard_eligible_time was changed without
      rescheduling, so btrfs_discard_workfn() wakes up earlier than that new
      time, peek_discard_list() returns NULL, and all work halts and goes to
      sleep without further rescheduling even there are block groups to
      discard.
      
      It happens pretty often, but not so visible from the userspace because
      after some time it usually will be kicked off anyway by someone else
      calling btrfs_discard_reschedule_work().
      
      Fix it by continue rescheduling if block group discard lists are not
      empty.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ea9ed87c
    • Josef Bacik's avatar
      btrfs: tests: initialize test inodes location · 675a4fc8
      Josef Bacik authored
      I noticed that sometimes the module failed to load because the self
      tests failed like this:
      
        BTRFS: selftest: fs/btrfs/tests/inode-tests.c:963 miscount, wanted 1, got 0
      
      This turned out to be because sometimes the btrfs ino would be the btree
      inode number, and thus we'd skip calling the set extent delalloc bit
      helper, and thus not adjust ->outstanding_extents.
      
      Fix this by making sure we initialize test inodes with a valid inode
      number so that we don't get random failures during self tests.
      Signed-off-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      675a4fc8
    • Filipe Manana's avatar
      btrfs: send: fix wrong file path when there is an inode with a pending rmdir · 0b3f407e
      Filipe Manana authored
      When doing an incremental send, if we have a new inode that happens to
      have the same number that an old directory inode had in the base snapshot
      and that old directory has a pending rmdir operation, we end up computing
      a wrong path for the new inode, causing the receiver to fail.
      
      Example reproducer:
      
        $ cat test-send-rmdir.sh
        #!/bin/bash
      
        DEV=/dev/sdi
        MNT=/mnt/sdi
      
        mkfs.btrfs -f $DEV >/dev/null
        mount $DEV $MNT
      
        mkdir $MNT/dir
        touch $MNT/dir/file1
        touch $MNT/dir/file2
        touch $MNT/dir/file3
      
        # Filesystem looks like:
        #
        # .                                     (ino 256)
        # |----- dir/                           (ino 257)
        #         |----- file1                  (ino 258)
        #         |----- file2                  (ino 259)
        #         |----- file3                  (ino 260)
        #
      
        btrfs subvolume snapshot -r $MNT $MNT/snap1
        btrfs send -f /tmp/snap1.send $MNT/snap1
      
        # Now remove our directory and all its files.
        rm -fr $MNT/dir
      
        # Unmount the filesystem and mount it again. This is to ensure that
        # the next inode that is created ends up with the same inode number
        # that our directory "dir" had, 257, which is the first free "objectid"
        # available after mounting again the filesystem.
        umount $MNT
        mount $DEV $MNT
      
        # Now create a new file (it could be a directory as well).
        touch $MNT/newfile
      
        # Filesystem now looks like:
        #
        # .                                     (ino 256)
        # |----- newfile                        (ino 257)
        #
      
        btrfs subvolume snapshot -r $MNT $MNT/snap2
        btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2
      
        # Now unmount the filesystem, create a new one, mount it and try to apply
        # both send streams to recreate both snapshots.
        umount $DEV
      
        mkfs.btrfs -f $DEV >/dev/null
      
        mount $DEV $MNT
      
        btrfs receive -f /tmp/snap1.send $MNT
        btrfs receive -f /tmp/snap2.send $MNT
      
        umount $MNT
      
      When running the test, the receive operation for the incremental stream
      fails:
      
        $ ./test-send-rmdir.sh
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
        At subvol /mnt/sdi/snap1
        Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
        At subvol /mnt/sdi/snap2
        At subvol snap1
        At snapshot snap2
        ERROR: chown o257-9-0 failed: No such file or directory
      
      So fix this by tracking directories that have a pending rmdir by inode
      number and generation number, instead of only inode number.
      
      A test case for fstests follows soon.
      Reported-by: default avatarMassimo B. <massimo.b@gmx.net>
      Tested-by: default avatarMassimo B. <massimo.b@gmx.net>
      Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
      CC: stable@vger.kernel.org # 4.19+
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      0b3f407e
    • Qu Wenruo's avatar
      btrfs: qgroup: don't try to wait flushing if we're already holding a transaction · ae5e070e
      Qu Wenruo authored
      There is a chance of racing for qgroup flushing which may lead to
      deadlock:
      
      	Thread A		|	Thread B
         (not holding trans handle)	|  (holding a trans handle)
      --------------------------------+--------------------------------
      __btrfs_qgroup_reserve_meta()   | __btrfs_qgroup_reserve_meta()
      |- try_flush_qgroup()		| |- try_flush_qgroup()
         |- QGROUP_FLUSHING bit set   |    |
         |				|    |- test_and_set_bit()
         |				|    |- wait_event()
         |- btrfs_join_transaction()	|
         |- btrfs_commit_transaction()|
      
      			!!! DEAD LOCK !!!
      
      Since thread A wants to commit transaction, but thread B is holding a
      transaction handle, blocking the commit.
      At the same time, thread B is waiting for thread A to finish its commit.
      
      This is just a hot fix, and would lead to more EDQUOT when we're near
      the qgroup limit.
      
      The proper fix would be to make all metadata/data reservations happen
      without holding a transaction handle.
      
      CC: stable@vger.kernel.org # 5.9+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      ae5e070e
    • ethanwu's avatar
      btrfs: correctly calculate item size used when item key collision happens · 9a664971
      ethanwu authored
      Item key collision is allowed for some item types, like dir item and
      inode refs, but the overall item size is limited by the nodesize.
      
      item size(ins_len) passed from btrfs_insert_empty_items to
      btrfs_search_slot already contains size of btrfs_item.
      
      When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
      The check incorrectly reports that split leaf is required, because
      it treats the space required by the newly inserted item as
      btrfs_item + item data. But in item key collision case, only item data
      is actually needed, the newly inserted item could merge into the existing
      one. No new btrfs_item will be inserted.
      
      And split_leaf return EOVERFLOW from following code:
      
        if (extend && data_size + btrfs_item_size_nr(l, slot) +
            sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
            return -EOVERFLOW;
      
      In most cases, when callers receive EOVERFLOW, they either return
      this error or handle in different ways. For example, in normal dir item
      creation the userspace will get errno EOVERFLOW; in inode ref case
      INODE_EXTREF is used instead.
      
      However, this is not the case for rename. To avoid the unrecoverable
      situation in rename, btrfs_check_dir_item_collision is called in
      early phase of rename. In this function, when item key collision is
      detected leaf space is checked:
      
        data_size = sizeof(*di) + name_len;
        if (data_size + btrfs_item_size_nr(leaf, slot) +
            sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
      
      the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
      refers to existing item size, the condition here correctly calculates
      the needed size for collision case rather than the wrong case above.
      
      The consequence of inconsistent condition check between
      btrfs_check_dir_item_collision and btrfs_search_slot when item key
      collision happens is that we might pass check here but fail
      later at btrfs_search_slot. Rename fails and volume is forced readonly
      
        [436149.586170] ------------[ cut here ]------------
        [436149.586173] BTRFS: Transaction aborted (error -75)
        [436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
        [436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G      D           4.18.0-rc5+ #1
        [436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
        [436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
        [436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
        [436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
        [436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
        [436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
        [436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
        [436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
        [436149.586260] FS:  00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
        [436149.586261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
        [436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [436149.586296] Call Trace:
        [436149.586311]  vfs_rename+0x383/0x920
        [436149.586313]  ? vfs_rename+0x383/0x920
        [436149.586315]  do_renameat2+0x4ca/0x590
        [436149.586317]  __x64_sys_rename+0x20/0x30
        [436149.586324]  do_syscall_64+0x5a/0x120
        [436149.586330]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [436149.586332] RIP: 0033:0x7fa3133b1d37
        [436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
        [436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
        [436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
        [436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
      
      Thanks to Hans van Kranenburg for information about crc32 hash collision
      tools, I was able to reproduce the dir item collision with following
      python script.
      https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
      it under a btrfs volume will trigger the abort transaction.  It simply
      creates files and rename them to forged names that leads to
      hash collision.
      
      There are two ways to fix this. One is to simply revert the patch
      878f2d2c ("Btrfs: fix max dir item size calculation") to make the
      condition consistent although that patch is correct about the size.
      
      The other way is to handle the leaf space check correctly when
      collision happens. I prefer the second one since it correct leaf
      space check in collision case. This fix will not account
      sizeof(struct btrfs_item) when the item already exists.
      There are two places where ins_len doesn't contain
      sizeof(struct btrfs_item), however.
      
        1. extent-tree.c: lookup_inline_extent_backref
        2. file-item.c: btrfs_csum_file_blocks
      
      to make the logic of btrfs_search_slot more clear, we add a flag
      search_for_extension in btrfs_path.
      
      This flag indicates that ins_len passed to btrfs_search_slot doesn't
      contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot
      will use the actual size needed to calculate the required leaf space.
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarethanwu <ethanwu@synology.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9a664971
    • Filipe Manana's avatar
      btrfs: fix deadlock when cloning inline extent and low on free metadata space · 3d45f221
      Filipe Manana authored
      When cloning an inline extent there are cases where we can not just copy
      the inline extent from the source range to the target range (e.g. when the
      target range starts at an offset greater than zero). In such cases we copy
      the inline extent's data into a page of the destination inode and then
      dirty that page. However, after that we will need to start a transaction
      for each processed extent and, if we are ever low on available metadata
      space, we may need to flush existing delalloc for all dirty inodes in an
      attempt to release metadata space - if that happens we may deadlock:
      
      * the async reclaim task queued a delalloc work to flush delalloc for
        the destination inode of the clone operation;
      
      * the task executing that delalloc work gets blocked waiting for the
        range with the dirty page to be unlocked, which is currently locked
        by the task doing the clone operation;
      
      * the async reclaim task blocks waiting for the delalloc work to complete;
      
      * the cloning task is waiting on the waitqueue of its reservation ticket
        while holding the range with the dirty page locked in the inode's
        io_tree;
      
      * if metadata space is not released by some other task (like delalloc for
        some other inode completing for example), the clone task waits forever
        and as a consequence the delalloc work and async reclaim tasks will hang
        forever as well. Releasing more space on the other hand may require
        starting a transaction, which will hang as well when trying to reserve
        metadata space, resulting in a deadlock between all these tasks.
      
      When this happens, traces like the following show up in dmesg/syslog:
      
        [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
        [87452.323644]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.324852] task:kworker/u16:11  state:D stack:    0 pid:1810830 ppid:     2 flags:0x00004000
        [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
        [87452.326136] Call Trace:
        [87452.326737]  __schedule+0x5d1/0xcf0
        [87452.327390]  schedule+0x45/0xe0
        [87452.328174]  lock_extent_bits+0x1e6/0x2d0 [btrfs]
        [87452.328894]  ? finish_wait+0x90/0x90
        [87452.329474]  btrfs_invalidatepage+0x32c/0x390 [btrfs]
        [87452.330133]  ? __mod_memcg_state+0x8e/0x160
        [87452.330738]  __extent_writepage+0x2d4/0x400 [btrfs]
        [87452.331405]  extent_write_cache_pages+0x2b2/0x500 [btrfs]
        [87452.332007]  ? lock_release+0x20e/0x4c0
        [87452.332557]  ? trace_hardirqs_on+0x1b/0xf0
        [87452.333127]  extent_writepages+0x43/0x90 [btrfs]
        [87452.333653]  ? lock_acquire+0x1a3/0x490
        [87452.334177]  do_writepages+0x43/0xe0
        [87452.334699]  ? __filemap_fdatawrite_range+0xa4/0x100
        [87452.335720]  __filemap_fdatawrite_range+0xc5/0x100
        [87452.336500]  btrfs_run_delalloc_work+0x17/0x40 [btrfs]
        [87452.337216]  btrfs_work_helper+0xf1/0x600 [btrfs]
        [87452.337838]  process_one_work+0x24e/0x5e0
        [87452.338437]  worker_thread+0x50/0x3b0
        [87452.339137]  ? process_one_work+0x5e0/0x5e0
        [87452.339884]  kthread+0x153/0x170
        [87452.340507]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.341153]  ret_from_fork+0x22/0x30
        [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
        [87452.342487]       Tainted: G    B   W         5.10.0-rc4-btrfs-next-73 #1
        [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        [87452.344049] task:kworker/u16:1   state:D stack:    0 pid:2426217 ppid:     2 flags:0x00004000
        [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
        [87452.345655] Call Trace:
        [87452.346305]  __schedule+0x5d1/0xcf0
        [87452.346947]  ? kvm_clock_read+0x14/0x30
        [87452.347676]  ? wait_for_completion+0x81/0x110
        [87452.348389]  schedule+0x45/0xe0
        [87452.349077]  schedule_timeout+0x30c/0x580
        [87452.349718]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [87452.350340]  ? lock_acquire+0x1a3/0x490
        [87452.351006]  ? try_to_wake_up+0x7a/0xa20
        [87452.351541]  ? lock_release+0x20e/0x4c0
        [87452.352040]  ? lock_acquired+0x199/0x490
        [87452.352517]  ? wait_for_completion+0x81/0x110
        [87452.353000]  wait_for_completion+0xab/0x110
        [87452.353490]  start_delalloc_inodes+0x2af/0x390 [btrfs]
        [87452.353973]  btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
        [87452.354455]  flush_space+0x24f/0x660 [btrfs]
        [87452.355063]  btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
        [87452.355565]  process_one_work+0x24e/0x5e0
        [87452.356024]  worker_thread+0x20f/0x3b0
        [87452.356487]  ? process_one_work+0x5e0/0x5e0
        [87452.356973]  kthread+0x153/0x170
        [87452.357434]  ? kthread_mod_delayed_work+0xc0/0xc0
        [87452.357880]  ret_from_fork+0x22/0x30
        (...)
        < stack traces of several tasks waiting for the locks of the inodes of the
          clone operation >
        (...)
        [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
        [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
        [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
        [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
        [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
        [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
        [92867.447361] task:fsstress        state:D stack:    0 pid:2508238 ppid:2508153 flags:0x00004000
        [92867.447920] Call Trace:
        [92867.448435]  __schedule+0x5d1/0xcf0
        [92867.448934]  ? _raw_spin_unlock_irqrestore+0x3c/0x60
        [92867.449423]  schedule+0x45/0xe0
        [92867.449916]  __reserve_bytes+0x4a4/0xb10 [btrfs]
        [92867.450576]  ? finish_wait+0x90/0x90
        [92867.451202]  btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
        [92867.451815]  btrfs_block_rsv_add+0x1f/0x50 [btrfs]
        [92867.452412]  start_transaction+0x2d1/0x760 [btrfs]
        [92867.453216]  clone_copy_inline_extent+0x333/0x490 [btrfs]
        [92867.453848]  ? lock_release+0x20e/0x4c0
        [92867.454539]  ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
        [92867.455218]  btrfs_clone+0x569/0x7e0 [btrfs]
        [92867.455952]  btrfs_clone_files+0xf6/0x150 [btrfs]
        [92867.456588]  btrfs_remap_file_range+0x324/0x3d0 [btrfs]
        [92867.457213]  do_clone_file_range+0xd4/0x1f0
        [92867.457828]  vfs_clone_file_range+0x4d/0x230
        [92867.458355]  ? lock_release+0x20e/0x4c0
        [92867.458890]  ioctl_file_clone+0x8f/0xc0
        [92867.459377]  do_vfs_ioctl+0x342/0x750
        [92867.459913]  __x64_sys_ioctl+0x62/0xb0
        [92867.460377]  do_syscall_64+0x33/0x80
        [92867.460842]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        (...)
        < stack traces of more tasks blocked on metadata reservation like the clone
          task above, because the async reclaim task has deadlocked >
        (...)
      
      Another thing to notice is that the worker task that is deadlocked when
      trying to flush the destination inode of the clone operation is at
      btrfs_invalidatepage(). This is simply because the clone operation has a
      destination offset greater than the i_size and we only update the i_size
      of the destination file after cloning an extent (just like we do in the
      buffered write path).
      
      Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
      the flushing of delalloc for all inodes that have delalloc, add a runtime
      flag to an inode to signal it should not be flushed, and for inodes with
      that flag set, start_delalloc_inodes() will simply skip them. When the
      cloning code needs to dirty a page to copy an inline extent, set that flag
      on the inode and then clear it when the clone operation finishes.
      
      This could be sporadically triggered with test case generic/269 from
      fstests, which exercises many fsstress processes running in parallel with
      several dd processes filling up the entire filesystem.
      
      CC: stable@vger.kernel.org # 5.9+
      Fixes: 05a5a762 ("Btrfs: implement full reflink support for inline extents")
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      3d45f221
  2. 09 Dec, 2020 30 commits
    • Qu Wenruo's avatar
      btrfs: scrub: allow scrub to work with subpage sectorsize · b42fe98c
      Qu Wenruo authored
      Since btrfs scrub is utilizing its own infrastructure to submit
      read/write, scrub is independent from all other routines.
      
      This brings one very neat feature, allow us to read 4K data into offset
      0 of a 64K page.  So is the writeback routine.
      
      This makes scrub on subpage sector size much easier to implement, and
      thanks to previous commits which just changed the implementation to
      always do scrub based on sector size, now scrub can handle subpage
      filesystem without any problem.
      
      This patch will just remove the restriction on
      (sectorsize != PAGE_SIZE), to make scrub finally work on subpage
      filesystems.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b42fe98c
    • Qu Wenruo's avatar
      btrfs: scrub: support subpage data scrub · b29dca44
      Qu Wenruo authored
      Btrfs scrub is more flexible than buffered data write path, as we can
      read an unaligned subpage data into page offset 0.
      
      This ability makes subpage support much easier, we just need to check
      each scrub_page::page_len and ensure we only calculate hash for [0,
      page_len) of a page.
      
      There is a small thing to notice: for subpage case, we still do sector
      by sector scrub.  This means we will submit a read bio for each sector
      to scrub, resulting in the same amount of read bios, just like on the 4K
      page systems.
      
      This behavior can be considered as a good thing, if we want everything
      to be the same as 4K page systems.  But this also means, we're wasting
      the possibility to submit larger bio using 64K page size.  This is
      another problem to consider in the future.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      b29dca44
    • Qu Wenruo's avatar
      btrfs: scrub: support subpage tree block scrub · 53f3251d
      Qu Wenruo authored
      To support subpage tree block scrub, scrub_checksum_tree_block() only
      needs to learn 2 new tricks:
      
      - Follow sector size
        Now scrub_page only represents one sector, we need to follow it
        properly.
      
      - Run checksum on all sectors
        Since scrub_page only represents one sector, we need to run checksum
        on all sectors, not only (nodesize >> PAGE_SIZE).
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      53f3251d
    • Qu Wenruo's avatar
      btrfs: scrub: always allocate one full page for one sector for RAID56 · d0a7a9c0
      Qu Wenruo authored
      For scrub_pages() and scrub_pages_for_parity(), we currently allocate
      one scrub_page structure for one page.
      
      This is fine if we only read/write one sector one time.  But for cases
      like scrubbing RAID56, we need to read/write the full stripe, which is
      in 64K size for now.
      
      For subpage size, we will submit the read in just one page, which is
      normally a good thing, but for RAID56 case, it only expects to see one
      sector, not the full stripe in its endio function.
      This could lead to wrong parity checksum for RAID56 on subpage.
      
      To make the existing code work well for subpage case, here we take a
      shortcut by always allocating a full page for one sector.
      
      This should provide the base to make RAID56 work for subpage case.
      
      The cost is pretty obvious now, for one RAID56 stripe now we always need
      16 pages. For support subpage situation (64K page size, 4K sector size),
      this means we need full one megabyte to scrub just one RAID56 stripe.
      
      And for data scrub, each 4K sector will also need one 64K page.
      
      This is mostly just a workaround, the proper fix for this is a much
      larger project, using scrub_block to replace scrub_page, and allow
      scrub_block to handle multi pages, csums, and csum_bitmap to avoid
      allocating one page for each sector.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      d0a7a9c0
    • Qu Wenruo's avatar
      btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits · fa485d21
      Qu Wenruo authored
      Btrfs on-disk format chose to use u64 for almost everything, but there
      are a other restrictions that won't let us use more than u32 for things
      like extent length (the maximum length is 128MiB for non-hole extents),
      or stripe length (we have device number limit).
      
      This means if we don't have extra handling to convert u64 to u32, we
      will always have some questionable operations like
      "u32 = u64 >> sectorsize_bits" in the code.
      
      This patch will try to address the problem by reducing the width for the
      following members/parameters:
      
      - scrub_parity::stripe_len
      - @len of scrub_pages()
      - @extent_len of scrub_remap_extent()
      - @len of scrub_parity_mark_sectors_error()
      - @len of scrub_parity_mark_sectors_data()
      - @len of scrub_extent()
      - @len of scrub_pages_for_parity()
      - @len of scrub_extent_for_parity()
      
      For members extracted from on-disk structure, like map->stripe_len, they
      will be kept as is. Since that modification would require on-disk format
      change.
      
      There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call
      sites, extra ASSERT() is added to be extra safe for debug builds.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      fa485d21
    • Qu Wenruo's avatar
      btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs · 6275193e
      Qu Wenruo authored
      Refactor btrfs_lookup_bio_sums() by:
      
      - Remove the @file_offset parameter
        There are two factors making the @file_offset parameter useless:
      
        * For csum lookup in csum tree, file offset makes no sense
          We only need disk_bytenr, which is unrelated to file_offset
      
        * page_offset (file offset) of each bvec is not contiguous.
          Pages can be added to the same bio as long as their on-disk bytenr
          is contiguous, meaning we could have pages at different file offsets
          in the same bio.
      
        Thus passing file_offset makes no sense any more.
        The only user of file_offset is for data reloc inode, we will use
        a new function, search_file_offset_in_bio(), to handle it.
      
      - Extract the csum tree lookup into search_csum_tree()
        The new function will handle the csum search in csum tree.
        The return value is the same as btrfs_find_ordered_sum(), returning
        the number of found sectors which have checksum.
      
      - Change how we do the main loop
        The only needed info from bio is:
        * the on-disk bytenr
        * the length
      
        After extracting the above info, we can do the search without bio
        at all, which makes the main loop much simpler:
      
      	for (cur_disk_bytenr = orig_disk_bytenr;
      	     cur_disk_bytenr < orig_disk_bytenr + orig_len;
      	     cur_disk_bytenr += count * sectorsize) {
      
      		/* Lookup csum tree */
      		count = search_csum_tree(fs_info, path, cur_disk_bytenr,
      					 search_len, csum_dst);
      		if (!count) {
      			/* Csum hole handling */
      		}
      	}
      
      - Use single variable as the source to calculate all other offsets
        Instead of all different type of variables, we use only one main
        variable, cur_disk_bytenr, which represents the current disk bytenr.
      
        All involved values can be calculated from that variable, and
        all those variable will only be visible in the inner loop.
      
      The above refactoring makes btrfs_lookup_bio_sums() way more robust than
      it used to be, especially related to the file offset lookup.  Now
      file_offset lookup is only related to data reloc inode, otherwise we
      don't need to bother file_offset at all.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      6275193e
    • Qu Wenruo's avatar
      btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums · 9e46458a
      Qu Wenruo authored
      The function btrfs_lookup_bio_sums() is only called for read bios.
      While btrfs_find_ordered_sum() is to search ordered extent sums, which
      is only for write path.
      
      This means to read a page we either:
      
      - Submit read bio if it's not uptodate
        This means we only need to search csum tree for checksums.
      
      - The page is already uptodate
        It can be marked uptodate for previous read, or being marked dirty.
        As we always mark page uptodate for dirty page.
        In that case, we don't need to submit read bio at all, thus no need
        to search any checksums.
      
      Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums().
      And since btrfs_lookup_bio_sums() is the only caller for
      btrfs_find_ordered_sum(), also remove the implementation.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      9e46458a
    • Qu Wenruo's avatar
      btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors · 884b07d0
      Qu Wenruo authored
      To support sectorsize < PAGE_SIZE case, we need to take extra care of
      extent buffer accessors.
      
      Since sectorsize is smaller than PAGE_SIZE, one page can contain
      multiple tree blocks, we must use eb->start to determine the real offset
      to read/write for extent buffer accessors.
      
      This patch introduces two helpers to do this:
      
      - get_eb_page_index()
        This is to calculate the index to access extent_buffer::pages.
        It's just a simple wrapper around "start >> PAGE_SHIFT".
      
        For sectorsize == PAGE_SIZE case, nothing is changed.
        For sectorsize < PAGE_SIZE case, we always get index as 0, and
        the existing page shift also works.
      
      - get_eb_offset_in_page()
        This is to calculate the offset to access extent_buffer::pages.
        This needs to take extent_buffer::start into consideration.
      
        For sectorsize == PAGE_SIZE case, extent_buffer::start is always
        aligned to PAGE_SIZE, thus adding extent_buffer::start to
        offset_in_page() won't change the result.
        For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives
        us the correct offset to access.
      
      This patch will touch the following parts to cover all extent buffer
      accessors:
      
      - BTRFS_SETGET_HEADER_FUNCS()
      - read_extent_buffer()
      - read_extent_buffer_to_user()
      - memcmp_extent_buffer()
      - write_extent_buffer_chunk_tree_uuid()
      - write_extent_buffer_fsid()
      - write_extent_buffer()
      - memzero_extent_buffer()
      - copy_extent_buffer_full()
      - copy_extent_buffer()
      - memcpy_extent_buffer()
      - memmove_extent_buffer()
      - btrfs_get_token_##bits()
      - btrfs_get_##bits()
      - btrfs_set_token_##bits()
      - btrfs_set_##bits()
      - generic_bin_search()
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      884b07d0
    • Qu Wenruo's avatar
      btrfs: update num_extent_pages to support subpage sized extent buffer · 4a3dc938
      Qu Wenruo authored
      For subpage sized extent buffer, we have ensured no extent buffer will
      cross page boundary, thus we would only need one page for any extent
      buffer.
      
      Update function num_extent_pages to handle such case.  Now
      num_extent_pages() returns 1 for subpage sized extent buffer.
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      4a3dc938
    • Qu Wenruo's avatar
      btrfs: don't allow tree block to cross page boundary for subpage support · 1aaac38c
      Qu Wenruo authored
      As a preparation for subpage sector size support (allowing filesystem
      with sector size smaller than page size to be mounted) if the sector
      size is smaller than page size, we don't allow tree block to be read if
      it crosses 64K(*) boundary.
      
      The 64K is selected because:
      
      - we are only going to support 64K page size for subpage for now
      - 64K is also the maximum supported node size
      
      This ensures that tree blocks are always contained in one page for a
      system with 64K page size, which can greatly simplify the handling.
      
      Otherwise we would have to do complex multi-page handling of tree
      blocks.  Currently there is no way to create such tree blocks.
      
      In kernel we have avoided such tree blocks allocation even on 4K page
      size, as it can lead to RAID56 stripe scrubbing.
      
      While btrfs-progs have fixed its chunk allocator since 2016 for convert,
      and has extra checks to do the same behavior as the kernel.
      
      Just add such graceful checks in case of an ancient filesystem.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1aaac38c
    • Qu Wenruo's avatar
      btrfs: calculate inline extent buffer page size based on page size · deb67895
      Qu Wenruo authored
      Btrfs only support 64K as maximum node size, thus for 4K page system, we
      would have at most 16 pages for one extent buffer.
      
      For a system using 64K page size, we would really have just one page.
      
      While we always use 16 pages for extent_buffer::pages, this means for
      systems using 64K pages, we are wasting memory for 15 page pointers
      which will never be used.
      
      Calculate the array size based on page size and the node size maximum.
      
      - for systems using 4K page size, it will stay 16 pages
      - for systems using 64K page size, it will be 1 page
      
      Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to
      avoid circular inclusion of ctree.h.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      deb67895
    • Qu Wenruo's avatar
      btrfs: factor out btree page submission code to a helper · f91e0d0c
      Qu Wenruo authored
      In btree_write_cache_pages() we have a btree page submission routine
      buried deeply in a nested loop.
      
      This patch will extract that part of code into a helper function,
      submit_eb_page(), to do the same work.
      
      Since submit_eb_page() now can return >0 for successful extent
      buffer submission, remove the "ASSERT(ret <= 0);" line.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f91e0d0c
    • Qu Wenruo's avatar
      btrfs: make btrfs_verify_data_csum follow sector size · f44cf410
      Qu Wenruo authored
      Currently btrfs_verify_data_csum() just passes the whole page to
      check_data_csum(), which is fine since we only support sectorsize ==
      PAGE_SIZE.
      
      To support subpage, we need to properly honor per-sector
      checksum verification, just like what we did in dio read path.
      
      This patch will do the csum verification in a for loop, starts with
      pg_off == start - page_offset(page), with sectorsize increase for
      each loop.
      
      For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we
      will only loop once.
      
      For subpage case, we do the iterate over each sector and if we found any
      error, we return error.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      f44cf410
    • Qu Wenruo's avatar
      btrfs: pass bio_offset to check_data_csum() directly · 7ffd27e3
      Qu Wenruo authored
      Parameter icsum for check_data_csum() is a little hard to understand.
      So is the phy_offset for btrfs_verify_data_csum().
      
      Both parameters are calculated values for csum lookup.
      
      Instead of some calculated value, just pass bio_offset and let the
      final and only user, check_data_csum(), calculate whatever it needs.
      
      Since we are here, also make the bio_offset parameter and some related
      variables to be u32 (unsigned int).
      As bio size is limited by its bi_size, which is unsigned int, and has
      extra size limit check during various bio operations.
      Thus we are ensured that bio_offset won't overflow u32.
      
      Thus for all involved functions, not only rename the parameter from
      @phy_offset to @bio_offset, but also reduce its width to u32, so we
      won't have suspicious "u32 = u64 >> sector_bits;" lines anymore.
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarNikolay Borisov <nborisov@suse.com>
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      7ffd27e3
    • Qu Wenruo's avatar
      btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset · 1941b64b
      Qu Wenruo authored
      The parameter bio_offset of extent_submit_bio_start_t is very confusing.
      If it's really bio_offset (offset to bio), then it should be u32.  But
      in fact, it's only utilized by dio read, and that member is used as file
      offset, which must be u64.
      
      Rename it to dio_file_offset since the only user uses it as file offset,
      and add comment for who is using it.
      Signed-off-by: default avatarQu Wenruo <wqu@suse.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      1941b64b
    • Boris Burkov's avatar
      btrfs: fix lockdep warning when creating free space tree · 8a6a87cd
      Boris Burkov authored
      A lock dependency loop exists between the root tree lock, the extent tree
      lock, and the free space tree lock.
      
      The root tree lock depends on the free space tree lock because
      btrfs_create_tree holds the new tree's lock while adding it to the root
      tree.
      
      The extent tree lock depends on the root tree lock because during
      umount, we write out space cache v1, which writes inodes in the root
      tree, which results in holding the root tree lock while doing a lookup
      in the extent tree.
      
      Finally, the free space tree depends on the extent tree because
      populate_free_space_tree holds a locked path in the extent tree and then
      does a lookup in the free space tree to add the new item.
      
      The simplest of the three to break is the one during tree creation: we
      unlock the leaf before inserting the tree node into the root tree, which
      fixes the lockdep warning.
      
        [30.480136] ======================================================
        [30.480830] WARNING: possible circular locking dependency detected
        [30.481457] 5.9.0-rc8+ #76 Not tainted
        [30.481897] ------------------------------------------------------
        [30.482500] mount/520 is trying to acquire lock:
        [30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
        [30.484054]
      	      but task is already holding lock:
        [30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
        [30.485581]
      	      which lock already depends on the new lock.
      
        [30.486397]
      	      the existing dependency chain (in reverse order) is:
        [30.487205]
      	      -> #2 (btrfs-extent-01#2){++++}-{3:3}:
        [30.487825]        down_read_nested+0x43/0x150
        [30.488306]        __btrfs_tree_read_lock+0x39/0x180
        [30.488868]        __btrfs_read_lock_root_node+0x3a/0x50
        [30.489477]        btrfs_search_slot+0x464/0x9b0
        [30.490009]        check_committed_ref+0x59/0x1d0
        [30.490603]        btrfs_cross_ref_exist+0x65/0xb0
        [30.491108]        run_delalloc_nocow+0x405/0x930
        [30.491651]        btrfs_run_delalloc_range+0x60/0x6b0
        [30.492203]        writepage_delalloc+0xd4/0x150
        [30.492688]        __extent_writepage+0x18d/0x3a0
        [30.493199]        extent_write_cache_pages+0x2af/0x450
        [30.493743]        extent_writepages+0x34/0x70
        [30.494231]        do_writepages+0x31/0xd0
        [30.494642]        __filemap_fdatawrite_range+0xad/0xe0
        [30.495194]        btrfs_fdatawrite_range+0x1b/0x50
        [30.495677]        __btrfs_write_out_cache+0x40d/0x460
        [30.496227]        btrfs_write_out_cache+0x8b/0x110
        [30.496716]        btrfs_start_dirty_block_groups+0x211/0x4e0
        [30.497317]        btrfs_commit_transaction+0xc0/0xba0
        [30.497861]        sync_filesystem+0x71/0x90
        [30.498303]        btrfs_remount+0x81/0x433
        [30.498767]        reconfigure_super+0x9f/0x210
        [30.499261]        path_mount+0x9d1/0xa30
        [30.499722]        do_mount+0x55/0x70
        [30.500158]        __x64_sys_mount+0xc4/0xe0
        [30.500616]        do_syscall_64+0x33/0x40
        [30.501091]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.501629]
      	      -> #1 (btrfs-root-00){++++}-{3:3}:
        [30.502241]        down_read_nested+0x43/0x150
        [30.502727]        __btrfs_tree_read_lock+0x39/0x180
        [30.503291]        __btrfs_read_lock_root_node+0x3a/0x50
        [30.503903]        btrfs_search_slot+0x464/0x9b0
        [30.504405]        btrfs_insert_empty_items+0x60/0xa0
        [30.504973]        btrfs_insert_item+0x60/0xd0
        [30.505412]        btrfs_create_tree+0x1b6/0x210
        [30.505913]        btrfs_create_free_space_tree+0x54/0x110
        [30.506460]        btrfs_mount_rw+0x15d/0x20f
        [30.506937]        btrfs_remount+0x356/0x433
        [30.507369]        reconfigure_super+0x9f/0x210
        [30.507868]        path_mount+0x9d1/0xa30
        [30.508264]        do_mount+0x55/0x70
        [30.508668]        __x64_sys_mount+0xc4/0xe0
        [30.509186]        do_syscall_64+0x33/0x40
        [30.509652]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.510271]
      	      -> #0 (btrfs-free-space-00){++++}-{3:3}:
        [30.510972]        __lock_acquire+0x11ad/0x1b60
        [30.511432]        lock_acquire+0xa2/0x360
        [30.511917]        down_read_nested+0x43/0x150
        [30.512383]        __btrfs_tree_read_lock+0x39/0x180
        [30.512947]        __btrfs_read_lock_root_node+0x3a/0x50
        [30.513455]        btrfs_search_slot+0x464/0x9b0
        [30.513947]        search_free_space_info+0x45/0x90
        [30.514465]        __add_to_free_space_tree+0x92/0x39d
        [30.515010]        btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
        [30.515639]        btrfs_mount_rw+0x15d/0x20f
        [30.516142]        btrfs_remount+0x356/0x433
        [30.516538]        reconfigure_super+0x9f/0x210
        [30.517065]        path_mount+0x9d1/0xa30
        [30.517438]        do_mount+0x55/0x70
        [30.517824]        __x64_sys_mount+0xc4/0xe0
        [30.518293]        do_syscall_64+0x33/0x40
        [30.518776]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.519335]
      	      other info that might help us debug this:
      
        [30.520210] Chain exists of:
      		btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2
      
        [30.521407]  Possible unsafe locking scenario:
      
        [30.522037]        CPU0                    CPU1
        [30.522456]        ----                    ----
        [30.522941]   lock(btrfs-extent-01#2);
        [30.523311]                                lock(btrfs-root-00);
        [30.523952]                                lock(btrfs-extent-01#2);
        [30.524620]   lock(btrfs-free-space-00);
        [30.525068]
      	       *** DEADLOCK ***
      
        [30.525669] 5 locks held by mount/520:
        [30.526116]  #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30
        [30.527056]  #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0
        [30.527960]  #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d
        [30.529118]  #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180
        [30.530113]  #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100
        [30.531124]
      	      stack backtrace:
        [30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76
        [30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014
        [30.533215] Call Trace:
        [30.533452]  dump_stack+0x8d/0xc0
        [30.533797]  check_noncircular+0x13c/0x150
        [30.534233]  __lock_acquire+0x11ad/0x1b60
        [30.534667]  lock_acquire+0xa2/0x360
        [30.535063]  ? __btrfs_tree_read_lock+0x39/0x180
        [30.535525]  down_read_nested+0x43/0x150
        [30.535939]  ? __btrfs_tree_read_lock+0x39/0x180
        [30.536400]  __btrfs_tree_read_lock+0x39/0x180
        [30.536862]  __btrfs_read_lock_root_node+0x3a/0x50
        [30.537304]  btrfs_search_slot+0x464/0x9b0
        [30.537713]  ? trace_hardirqs_on+0x1c/0xf0
        [30.538148]  search_free_space_info+0x45/0x90
        [30.538572]  __add_to_free_space_tree+0x92/0x39d
        [30.539071]  ? printk+0x48/0x4a
        [30.539367]  btrfs_create_free_space_tree.cold.22+0x1ee/0x45d
        [30.539972]  btrfs_mount_rw+0x15d/0x20f
        [30.540350]  btrfs_remount+0x356/0x433
        [30.540773]  ? shrink_dcache_sb+0xd9/0x100
        [30.541203]  reconfigure_super+0x9f/0x210
        [30.541642]  path_mount+0x9d1/0xa30
        [30.542040]  do_mount+0x55/0x70
        [30.542366]  __x64_sys_mount+0xc4/0xe0
        [30.542822]  do_syscall_64+0x33/0x40
        [30.543197]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
        [30.543691] RIP: 0033:0x7f109f7ab93a
        [30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
        [30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a
        [30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0
        [30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0
        [30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000
        [30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8a6a87cd
    • Boris Burkov's avatar
      btrfs: skip space_cache v1 setup when not using it · af456a2c
      Boris Burkov authored
      If we are not using space cache v1, we should not create the free space
      object or free space inodes. This comes up when we delete the existing
      free space objects/inodes when migrating to v2, only to see them get
      recreated for every dirtied block group.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      af456a2c
    • Boris Burkov's avatar
      btrfs: remove free space items when disabling space cache v1 · 36b216c8
      Boris Burkov authored
      When the filesystem transitions from space cache v1 to v2 or to
      nospace_cache, it removes the old cached data, but does not remove
      the FREE_SPACE items nor the free space inodes they point to. This
      doesn't cause any issues besides being a bit inefficient, since these
      items no longer do anything useful.
      
      To fix it, when we are mounting, and plan to disable the space cache,
      destroy each block group's free space item and free space inode.
      The code to remove the items is lifted from the existing use case of
      removing the block group, with a light adaptation to handle whether or
      not we have already looked up the free space inode.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      36b216c8
    • Boris Burkov's avatar
      btrfs: warn when remount will not change the free space tree · 2838d255
      Boris Burkov authored
      If the remount is ro->ro, rw->ro, or rw->rw, we will not create or
      clear the free space tree. This can be surprising, so print a warning
      to dmesg to make the failure more visible. It is also important to
      ensure that the space cache options (SPACE_CACHE, FREE_SPACE_TREE) are
      consistent, so ensure those are set to properly match the current on
      disk state (which won't be changing).
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      2838d255
    • Boris Burkov's avatar
      btrfs: use superblock state to print space_cache mount option · 04c41559
      Boris Burkov authored
      To make the contents of /proc/mounts better match the actual state of
      the filesystem, base the display of the space cache mount options off
      the contents of the super block rather than the last mount options
      passed in. Since there are many scenarios where the mount will ignore a
      space cache option, simply showing the passed in option is misleading.
      
      For example, if we mount with -o remount,space_cache=v2 on a read-write
      file system without an existing free space tree, we won't build a free
      space tree, but /proc/mounts will read space_cache=v2 (until we mount
      again and it goes away)
      
      cache_generation is set iff space_cache=v1, FREE_SPACE_TREE is set iff
      space_cache=v2, and if neither is the case, we print nospace_cache.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      04c41559
    • Boris Burkov's avatar
      btrfs: keep sb cache_generation consistent with space_cache · 94846229
      Boris Burkov authored
      When mounting, btrfs uses the cache_generation in the super block to
      determine if space cache v1 is in use. However, by mounting with
      nospace_cache or space_cache=v2, it is possible to disable space cache
      v1, which does not result in un-setting cache_generation back to 0.
      
      In order to base some logic, like mount option printing in /proc/mounts,
      on the current state of the space cache rather than just the values of
      the mount option, keep the value of cache_generation consistent with the
      status of space cache v1.
      
      We ensure that cache_generation > 0 iff the file system is using
      space_cache v1. This requires committing a transaction on any mount
      which changes whether we are using v1. (v1->nospace_cache, v1->v2,
      nospace_cache->v1, v2->v1).
      
      Since the mechanism for writing out the cache generation is transaction
      commit, but we want some finer grained control over when we un-set it,
      we can't just rely on the SPACE_CACHE mount option, and introduce an
      fs_info flag that mount can use when it wants to unset the generation.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      94846229
    • Boris Burkov's avatar
      btrfs: clear free space tree on ro->rw remount · 8b228324
      Boris Burkov authored
      A user might want to revert to v1 or nospace_cache on a root filesystem,
      and much like turning on the free space tree, that can only be done
      remounting from ro->rw. Support clearing the free space tree on such
      mounts by moving it into the shared remount logic.
      
      Since the CLEAR_CACHE option sticks around across remounts, this change
      would result in clearing the tree for ever on every remount, which is
      not desirable. To fix that, add CLEAR_CACHE to the oneshot options we
      clear at mount end, which has the other bonus of not cluttering the
      /proc/mounts output with clear_cache.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8b228324
    • Boris Burkov's avatar
      btrfs: clear oneshot options on mount and remount · 8cd29088
      Boris Burkov authored
      Some options only apply during mount time and are cleared at the end
      of mount. For now, the example is USEBACKUPROOT, but CLEAR_CACHE also
      fits the bill, and this is a preparation patch for also clearing that
      option.
      
      One subtlety is that the current code only resets USEBACKUPROOT on rw
      mounts, but the option is meaningfully "consumed" by a ro mount, so it
      feels appropriate to clear in that case as well. A subsequent read-write
      remount would not go through open_ctree, which is the only place that
      checks the option, so the change should be benign.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8cd29088
    • Boris Burkov's avatar
      btrfs: create free space tree on ro->rw remount · 5011139a
      Boris Burkov authored
      When a user attempts to remount a btrfs filesystem with
      'mount -o remount,space_cache=v2', that operation silently succeeds.
      Unfortunately, this is misleading, because the remount does not create
      the free space tree. /proc/mounts will incorrectly show space_cache=v2,
      but on the next mount, the file system will revert to the old
      space_cache.
      
      For now, we handle only the easier case, where the existing mount is
      read-only and the new mount is read-write. In that case, we can create
      the free space tree without contending with the block groups changing
      as we go.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      5011139a
    • Boris Burkov's avatar
      btrfs: only mark bg->needs_free_space if free space tree is on · 997e3e2e
      Boris Burkov authored
      If we attempt to create a free space tree while any block groups have
      needs_free_space set, we will double add the new free space item
      and hit EEXIST. Previously, we only created the free space tree on a new
      mount, so we never hit the case, but if we try to create it on a
      remount, such block groups could exist and trip us up.
      
      We don't do anything with this field unless the free space tree is
      enabled, so there is no harm in not setting it.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      997e3e2e
    • Boris Burkov's avatar
      btrfs: start orphan cleanup on ro->rw remount · 8f1c21d7
      Boris Burkov authored
      When we mount a rw filesystem, we start the orphan cleanup process in
      tree root and filesystem tree. However, when we remount a ro file system
      rw, we only clean the former. Move the calls to btrfs_orphan_cleanup()
      on tree_root and fs_root to the shared rw mount routine to effectively
      add them on ro->rw remount.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      8f1c21d7
    • Boris Burkov's avatar
      btrfs: lift read-write mount setup from mount and remount · 44c0ca21
      Boris Burkov authored
      Mounting rw and remounting from ro to rw naturally share invariants and
      functionality which result in a correctly setup rw filesystem. Luckily,
      there is even a strong unity in the code which implements them. In
      mount's open_ctree, these operations mostly happen after an early return
      for ro file systems, and in remount, they happen in a section devoted to
      remounting ro->rw, after some remount specific validation passes.
      
      However, there are unfortunately a few differences. There are small
      deviations in the order of some of the operations, remount does not
      start orphan cleanup in root_tree or fs_tree, remount does not create
      the free space tree, and remount does not handle "one-shot" mount
      options like clear_cache and uuid tree rescan.
      
      Since we want to add building the free space tree to remount, and also
      to start the same orphan cleanup process on a filesystem mounted as ro
      then remounted rw, we would benefit from unifying the logic between the
      two code paths.
      
      This patch only lifts the existing common functionality, and leaves a
      natural path for fixing the discrepancies.
      Reviewed-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarBoris Burkov <boris@bur.io>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      44c0ca21
    • Filipe Manana's avatar
      btrfs: do not block inode logging for so long during transaction commit · 47876f7c
      Filipe Manana authored
      Early on during a transaction commit we acquire the tree_log_mutex and
      hold it until after we write the super blocks. But before writing the
      extent buffers dirtied by the transaction and the super blocks we unblock
      the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
      fs_info->running_transaction to NULL.
      
      This means that after that and before writing the super blocks, new
      transactions can start. However if any transaction wants to log an inode,
      it will block waiting for the transaction commit to write its dirty
      extent buffers and the super blocks because the tree_log_mutex is only
      released after those operations are complete, and starting a new log
      transaction blocks on that mutex (at start_log_trans()).
      
      Writing the dirty extent buffers and the super blocks can take a very
      significant amount of time to complete, but we could allow the tasks
      wanting to log an inode to proceed with most of their steps:
      
      1) create the log trees
      2) log metadata in the trees
      3) write their dirty extent buffers
      
      They only need to wait for the previous transaction commit to complete
      (write its super blocks) before they attempt to write their super blocks,
      otherwise we could end up with a corrupt filesystem after a crash.
      
      So change start_log_trans() to use the root tree's log_mutex to serialize
      for the creation of the log root tree instead of using the tree_log_mutex,
      and make btrfs_sync_log() acquire the tree_log_mutex before writing the
      super blocks. This allows for inode logging to wait much less time when
      there is a previous transaction that is still committing, often not having
      to wait at all, as by the time when we try to sync the log the previous
      transaction already wrote its super blocks.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      The following script that uses dbench was used to measure the impact of
      the whole patchset:
      
        $ cat test-dbench.sh
        #!/bin/bash
      
        DEV=/dev/nvme0n1
        MNT=/mnt/btrfs
        MOUNT_OPTIONS="-o ssd"
      
        echo "performance" | \
            tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
      
        mkfs.btrfs -f -m single -d single $DEV
        mount $MOUNT_OPTIONS $DEV $MNT
      
        dbench -D $MNT -t 300 64
      
        umount $MNT
      
      The test was run on a machine with 12 cores, 64G of ram, using a NVMe
      device and a non-debug kernel configuration (Debian's default).
      
      Before patch set:
      
       Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    11277211    0.250    85.340
       Close        8283172     0.002     6.479
       Rename        477515     1.935    86.026
       Unlink       2277936     0.770    87.071
       Deltree          256    15.732    81.379
       Mkdir            128     0.003     0.009
       Qpathinfo    10221180    0.056    44.404
       Qfileinfo    1789967     0.002     4.066
       Qfsinfo      1874399     0.003     9.176
       Sfileinfo     918589     0.061    10.247
       Find         3951758     0.341    54.040
       WriteX       5616547     0.047    85.079
       ReadX        17676028    0.005     9.704
       LockX          36704     0.003     1.800
       UnlockX        36704     0.002     0.687
       Flush         790541    14.115   676.236
      
      Throughput 1179.19 MB/sec  64 clients  64 procs  max_latency=676.240 ms
      
      After patch set:
      
      Operation      Count    AvgLat    MaxLat
       ----------------------------------------
       NTCreateX    12687926    0.171    86.526
       Close        9320780     0.002     8.063
       Rename        537253     1.444    78.576
       Unlink       2561827     0.559    87.228
       Deltree          374    11.499    73.549
       Mkdir            187     0.003     0.005
       Qpathinfo    11500300    0.061    36.801
       Qfileinfo    2017118     0.002     7.189
       Qfsinfo      2108641     0.003     4.825
       Sfileinfo    1033574     0.008     8.065
       Find         4446553     0.408    47.835
       WriteX       6335667     0.045    84.388
       ReadX        19887312    0.003     9.215
       LockX          41312     0.003     1.394
       UnlockX        41312     0.002     1.425
       Flush         889233    13.014   623.259
      
      Throughput 1339.32 MB/sec  64 clients  64 procs  max_latency=623.265 ms
      
      +12.7% throughput, -8.2% max latency
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47876f7c
    • Filipe Manana's avatar
      btrfs: fix race leading to unnecessary transaction commit when logging inode · 639bd575
      Filipe Manana authored
      When logging an inode we may often have to fallback to a full transaction
      commit, either because a new block group was allocated, there is some case
      we can not deal with without a transaction commit or some error like an
      ENOMEM happened. However after we fallback to a transaction commit, we
      have a time window where we can make the next attempt to log any inode
      commit the next transaction unnecessarily, adding additional overhead and
      increasing latency.
      
      A sequence of steps that leads to this issue is the following:
      
      1) The current open transaction has a generation of 1000;
      
      2) A new block group is allocated, and as a consequence we must make sure
         any attempts to commit a log fallback to a transaction commit, so
         btrfs_set_log_full_commit() is called from btrfs_make_block_group().
         This sets fs_info->last_trans_log_full_commit to 1000;
      
      3) Task A is holding a handle on transaction 1000 and tries to log inode X.
         Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
         which returns true, since fs_info->last_trans_log_full_commit has a
         value of 1000. So we end up returning EAGAIN and propagating it up to
         btrfs_sync_file(), where we commit transaction 1000;
      
      4) The transaction commit task (task A) sets the transaction state to
         unblocked (TRANS_STATE_UNBLOCKED);
      
      5) Some other task, task B, starts a new transaction with a generation of
         1001;
      
      6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;
      
      7) Transaction 1000 has not fully committed yet, we are still writing all
         the extent buffers it created;
      
      8) Some new task, task C, starts an fsync of inode Y, gets a handle for
         transaction 1001, and it gets to btrfs_log_inode_parent() which does
         the following check:
      
           if (fs_info->last_trans_log_full_commit > last_committed) {
               ret = 1;
               goto end_no_trans;
           }
      
         At that point last_trans_log_full_commit has a value of 1000 and
         last_committed (value of fs_info->last_trans_committed) has a value of
         999, since transaction 1000 has not yet committed - it is either still
         writing out dirty extent buffers, its super blocks or unpinning
         extents.
      
         As a consequence we return 1, which gets propagated up to
         btrfs_sync_file(), which will then call btrfs_commit_transaction()
         for transaction 1001.
      
         As a consequence we have an unnecessary second transaction commit, we
         previously committed transaction 1000 and now commit transaction 1001
         as well, resulting in more overhead and increased latency.
      
      So fix this double transaction commit issue simply by removing that check,
      because all we need to do is wait for the previous transaction to finish
      its commit, which we already do later when starting the log transaction at
      start_log_trans(), because there we acquire the tree_log_mutex lock, which
      is held by a transaction commit and only released after the transaction
      commits its super blocks.
      
      Another issue that check has is that it reads last_trans_log_full_commit
      without using READ_ONCE(), which is incorrect since that member of
      struct btrfs_fs_info is always updated with WRITE_ONCE() through the
      helper btrfs_set_log_full_commit().
      
      This double transaction commit issue can actually be triggered quite often
      in long runs of dbench, since besides the creation of new block groups
      that force inode logging to fallback to a transaction commit, there are
      cases where dbench asks to fsync a directory which had files in it that
      were previously renamed or subdirectories that were removed, resulting in
      the inode logging to fallback to a full transaction commit.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      639bd575
    • Filipe Manana's avatar
      btrfs: fix race that makes inode logging fallback to transaction commit · 47d3db41
      Filipe Manana authored
      When logging an inode and the previous transaction is still committing, we
      have a time window where we can end up incorrectly think an inode has its
      last_unlink_trans field with a value greater than the last transaction
      committed, which results in the logging to fallback to a full transaction
      commit, which is usually much more expensive than doing a log commit.
      
      The race is described by the following steps:
      
      1) We are at transaction 1000;
      
      2) We modify an inode X (a directory) using transaction 1000 and set its
         last_unlink_trans field to 1000, because for example we removed one
         of its subdirectories;
      
      3) We create a new inode Y with a dentry in inode X using transaction 1000,
         so its generation field is set to 1000;
      
      4) The commit for transaction 1000 is started by task A;
      
      5) The task committing transaction 1000 sets the transaction state to
         unblocked, writes the dirty extent buffers and the super blocks, then
         unlocks tree_log_mutex;
      
      6) Some task starts a new transaction with a generation of 1001;
      
      7) We do some modification to inode Y (using transaction 1001);
      
      8) The transaction 1000 commit starts unpinning extents. At this point
         fs_info->last_trans_committed still has a value of 999;
      
      9) Task B starts an fsync on inode Y, and gets a handle for transaction
         1001. When it gets to check_parent_dirs_for_sync() it does the checking
         of the ancestor dentries because the following check does not evaluate
         to true:
      
             if (S_ISREG(inode->vfs_inode.i_mode) &&
                 inode->generation <= last_committed &&
                 inode->last_unlink_trans <= last_committed)
                     goto out;
      
         The generation value for inode Y is 1000 and last_committed, which has
         the value read from fs_info->last_trans_committed, has a value of 999,
         so that check evaluates to false and we proceed to check the ancestor
         inodes.
      
         Once we get to the first ancestor, inode X, we call
         btrfs_must_commit_transaction() on it, which evaluates to true:
      
         static bool btrfs_must_commit_transaction(...)
         {
             struct btrfs_fs_info *fs_info = inode->root->fs_info;
             bool ret = false;
      
             mutex_lock(&inode->log_mutex);
             if (inode->last_unlink_trans > fs_info->last_trans_committed) {
                 /*
                  * Make sure any commits to the log are forced to be full
                  * commits.
                  */
                  btrfs_set_log_full_commit(trans);
                  ret = true;
             }
          (...)
      
          because inode's X last_unlink_trans has a value of 1000 and
          fs_info->last_trans_committed still has a value of 999, it returns
          true to check_parent_dirs_for_sync(), making it return 1 which is
          propagated up to btrfs_sync_file(), causing it to fallback to a full
          transaction commit of transaction 1001.
      
          We should have not fallen back to commit transaction 1001, since inode
          X had last_unlink_trans set to 1000 and the super blocks for
          transaction 1000 were already written. So while not resulting in a
          functional problem, it leads to a lot more work and higher latencies
          for a fsync since committing a transaction is usually more expensive
          than committing a log (if other filesystem changes happened under that
          transaction).
      
      Similar problem happens when logging directories, for the same reason as
      btrfs_must_commit_transaction() returns true on an inode with its
      last_unlink_trans having the generation of the previous transaction and
      that transaction is still committing, unpinning its freed extents.
      
      So fix this by comparing last_unlink_trans with the id of the current
      transaction instead of fs_info->last_trans_committed.
      
      This case is often hit when running dbench for a long enough duration, as
      it does lots of rename and rmdir operations (both update the field
      last_unlink_trans of an inode) and fsyncs of files and directories.
      
      This patch belongs to a patch set that is comprised of the following
      patches:
      
        btrfs: fix race causing unnecessary inode logging during link and rename
        btrfs: fix race that results in logging old extents during a fast fsync
        btrfs: fix race that causes unnecessary logging of ancestor inodes
        btrfs: fix race that makes inode logging fallback to transaction commit
        btrfs: fix race leading to unnecessary transaction commit when logging inode
        btrfs: do not block inode logging for so long during transaction commit
      
      Performance results are mentioned in the change log of the last patch.
      Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
      Signed-off-by: default avatarDavid Sterba <dsterba@suse.com>
      47d3db41