• Filipe Manana's avatar
    Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl · 79179a68
    Filipe Manana authored
    [ Upstream commit 0c0fe3b0 ]
    
    While doing some tests I ran into an hang on an extent buffer's rwlock
    that produced the following trace:
    
    [39389.800012] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [fdm-stress:32166]
    [39389.800016] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [fdm-stress:32165]
    [39389.800016] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
    [39389.800016] irq event stamp: 0
    [39389.800016] hardirqs last  enabled at (0): [<          (null)>]           (null)
    [39389.800016] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
    [39389.800016] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
    [39389.800016] softirqs last disabled at (0): [<          (null)>]           (null)
    [39389.800016] CPU: 14 PID: 32165 Comm: fdm-stress Not tainted 4.4.0-rc6-btrfs-next-18+ #1
    [39389.800016] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [39389.800016] task: ffff880175b1ca40 ti: ffff8800a185c000 task.ti: ffff8800a185c000
    [39389.800016] RIP: 0010:[<ffffffff810902af>]  [<ffffffff810902af>] queued_spin_lock_slowpath+0x57/0x158
    [39389.800016] RSP: 0018:ffff8800a185fb80  EFLAGS: 00000202
    [39389.800016] RAX: 0000000000000101 RBX: ffff8801710c4e9c RCX: 0000000000000101
    [39389.800016] RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000001
    [39389.800016] RBP: ffff8800a185fb98 R08: 0000000000000001 R09: 0000000000000000
    [39389.800016] R10: ffff8800a185fb68 R11: 6db6db6db6db6db7 R12: ffff8801710c4e98
    [39389.800016] R13: ffff880175b1ca40 R14: ffff8800a185fc10 R15: ffff880175b1ca40
    [39389.800016] FS:  00007f6d37fff700(0000) GS:ffff8802be9c0000(0000) knlGS:0000000000000000
    [39389.800016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [39389.800016] CR2: 00007f6d300019b8 CR3: 0000000037c93000 CR4: 00000000001406e0
    [39389.800016] Stack:
    [39389.800016]  ffff8801710c4e98 ffff8801710c4e98 ffff880175b1ca40 ffff8800a185fbb0
    [39389.800016]  ffffffff81091e11 ffff8801710c4e98 ffff8800a185fbc8 ffffffff81091895
    [39389.800016]  ffff8801710c4e98 ffff8800a185fbe8 ffffffff81486c5c ffffffffa067288c
    [39389.800016] Call Trace:
    [39389.800016]  [<ffffffff81091e11>] queued_read_lock_slowpath+0x46/0x60
    [39389.800016]  [<ffffffff81091895>] do_raw_read_lock+0x3e/0x41
    [39389.800016]  [<ffffffff81486c5c>] _raw_read_lock+0x3d/0x44
    [39389.800016]  [<ffffffffa067288c>] ? btrfs_tree_read_lock+0x54/0x125 [btrfs]
    [39389.800016]  [<ffffffffa067288c>] btrfs_tree_read_lock+0x54/0x125 [btrfs]
    [39389.800016]  [<ffffffffa0622ced>] ? btrfs_find_item+0xa7/0xd2 [btrfs]
    [39389.800016]  [<ffffffffa069363f>] btrfs_ref_to_path+0xd6/0x174 [btrfs]
    [39389.800016]  [<ffffffffa0693730>] inode_to_path+0x53/0xa2 [btrfs]
    [39389.800016]  [<ffffffffa0693e2e>] paths_from_inode+0x117/0x2ec [btrfs]
    [39389.800016]  [<ffffffffa0670cff>] btrfs_ioctl+0xd5b/0x2793 [btrfs]
    [39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
    [39389.800016]  [<ffffffff81276727>] ? __this_cpu_preempt_check+0x13/0x15
    [39389.800016]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
    [39389.800016]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
    [39389.800016]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
    [39389.800016]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
    [39389.800016]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
    [39389.800016]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
    [39389.800016] Code: b9 01 01 00 00 f7 c6 00 ff ff ff 75 32 83 fe 01 89 ca 89 f0 0f 45 d7 f0 0f b1 13 39 f0 74 04 89 c6 eb e2 ff ca 0f 84 fa 00 00 00 <8b> 03 84 c0 74 04 f3 90 eb f6 66 c7 03 01 00 e9 e6 00 00 00 e8
    [39389.800012] Modules linked in: btrfs dm_mod ppdev xor sha256_generic hmac raid6_pq drbg ansi_cprng aesni_intel i2c_piix4 acpi_cpufreq aes_x86_64 ablk_helper tpm_tis parport_pc i2c_core sg cryptd evdev psmouse lrw tpm parport gf128mul serio_raw pcspkr glue_helper processor button loop autofs4 ext4 crc16 mbcache jbd2 sd_mod sr_mod cdrom ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring crc32c_intel scsi_mod e1000 virtio floppy [last unloaded: btrfs]
    [39389.800012] irq event stamp: 0
    [39389.800012] hardirqs last  enabled at (0): [<          (null)>]           (null)
    [39389.800012] hardirqs last disabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
    [39389.800012] softirqs last  enabled at (0): [<ffffffff8104e58d>] copy_process+0x638/0x1a35
    [39389.800012] softirqs last disabled at (0): [<          (null)>]           (null)
    [39389.800012] CPU: 15 PID: 32166 Comm: fdm-stress Tainted: G             L  4.4.0-rc6-btrfs-next-18+ #1
    [39389.800012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014
    [39389.800012] task: ffff880179294380 ti: ffff880034a60000 task.ti: ffff880034a60000
    [39389.800012] RIP: 0010:[<ffffffff81091e8d>]  [<ffffffff81091e8d>] queued_write_lock_slowpath+0x62/0x72
    [39389.800012] RSP: 0018:ffff880034a639f0  EFLAGS: 00000206
    [39389.800012] RAX: 0000000000000101 RBX: ffff8801710c4e98 RCX: 0000000000000000
    [39389.800012] RDX: 00000000000000ff RSI: 0000000000000000 RDI: ffff8801710c4e9c
    [39389.800012] RBP: ffff880034a639f8 R08: 0000000000000001 R09: 0000000000000000
    [39389.800012] R10: ffff880034a639b0 R11: 0000000000001000 R12: ffff8801710c4e98
    [39389.800012] R13: 0000000000000001 R14: ffff880172cbc000 R15: ffff8801710c4e00
    [39389.800012] FS:  00007f6d377fe700(0000) GS:ffff8802be9e0000(0000) knlGS:0000000000000000
    [39389.800012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [39389.800012] CR2: 00007f6d3d3c1000 CR3: 0000000037c93000 CR4: 00000000001406e0
    [39389.800012] Stack:
    [39389.800012]  ffff8801710c4e98 ffff880034a63a10 ffffffff81091963 ffff8801710c4e98
    [39389.800012]  ffff880034a63a30 ffffffff81486f1b ffffffffa0672cb3 ffff8801710c4e00
    [39389.800012]  ffff880034a63a78 ffffffffa0672cb3 ffff8801710c4e00 ffff880034a63a58
    [39389.800012] Call Trace:
    [39389.800012]  [<ffffffff81091963>] do_raw_write_lock+0x72/0x8c
    [39389.800012]  [<ffffffff81486f1b>] _raw_write_lock+0x3a/0x41
    [39389.800012]  [<ffffffffa0672cb3>] ? btrfs_tree_lock+0x119/0x251 [btrfs]
    [39389.800012]  [<ffffffffa0672cb3>] btrfs_tree_lock+0x119/0x251 [btrfs]
    [39389.800012]  [<ffffffffa061aeba>] ? rcu_read_unlock+0x5b/0x5d [btrfs]
    [39389.800012]  [<ffffffffa061ce13>] ? btrfs_root_node+0xda/0xe6 [btrfs]
    [39389.800012]  [<ffffffffa061ce83>] btrfs_lock_root_node+0x22/0x42 [btrfs]
    [39389.800012]  [<ffffffffa062046b>] btrfs_search_slot+0x1b8/0x758 [btrfs]
    [39389.800012]  [<ffffffff810fc6b0>] ? time_hardirqs_on+0x15/0x28
    [39389.800012]  [<ffffffffa06365db>] btrfs_lookup_inode+0x31/0x95 [btrfs]
    [39389.800012]  [<ffffffff8108d62f>] ? trace_hardirqs_on+0xd/0xf
    [39389.800012]  [<ffffffff8148482b>] ? mutex_lock_nested+0x397/0x3bc
    [39389.800012]  [<ffffffffa068821b>] __btrfs_update_delayed_inode+0x59/0x1c0 [btrfs]
    [39389.800012]  [<ffffffffa068858e>] __btrfs_commit_inode_delayed_items+0x194/0x5aa [btrfs]
    [39389.800012]  [<ffffffff81486ab7>] ? _raw_spin_unlock+0x31/0x44
    [39389.800012]  [<ffffffffa0688a48>] __btrfs_run_delayed_items+0xa4/0x15c [btrfs]
    [39389.800012]  [<ffffffffa0688d62>] btrfs_run_delayed_items+0x11/0x13 [btrfs]
    [39389.800012]  [<ffffffffa064048e>] btrfs_commit_transaction+0x234/0x96e [btrfs]
    [39389.800012]  [<ffffffffa0618d10>] btrfs_sync_fs+0x145/0x1ad [btrfs]
    [39389.800012]  [<ffffffffa0671176>] btrfs_ioctl+0x11d2/0x2793 [btrfs]
    [39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
    [39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
    [39389.800012]  [<ffffffff81140261>] ? __might_fault+0x4c/0xa7
    [39389.800012]  [<ffffffff8108a8b0>] ? arch_local_irq_save+0x9/0xc
    [39389.800012]  [<ffffffff8118b3d4>] ? rcu_read_unlock+0x3e/0x5d
    [39389.800012]  [<ffffffff811822f8>] do_vfs_ioctl+0x42b/0x4ea
    [39389.800012]  [<ffffffff8118b4f3>] ? __fget_light+0x62/0x71
    [39389.800012]  [<ffffffff8118240e>] SyS_ioctl+0x57/0x79
    [39389.800012]  [<ffffffff814872d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
    [39389.800012] Code: f0 0f b1 13 85 c0 75 ef eb 2a f3 90 8a 03 84 c0 75 f8 f0 0f b0 13 84 c0 75 f0 ba ff 00 00 00 eb 0a f0 0f b1 13 ff c8 74 0b f3 90 <8b> 03 83 f8 01 75 f7 eb ed c6 43 04 00 5b 5d c3 0f 1f 44 00 00
    
    This happens because in the code path executed by the inode_paths ioctl we
    end up nesting two calls to read lock a leaf's rwlock when after the first
    call to read_lock() and before the second call to read_lock(), another
    task (running the delayed items as part of a transaction commit) has
    already called write_lock() against the leaf's rwlock. This situation is
    illustrated by the following diagram:
    
             Task A                       Task B
    
      btrfs_ref_to_path()               btrfs_commit_transaction()
        read_lock(&eb->lock);
    
                                          btrfs_run_delayed_items()
                                            __btrfs_commit_inode_delayed_items()
                                              __btrfs_update_delayed_inode()
                                                btrfs_lookup_inode()
    
                                                  write_lock(&eb->lock);
                                                    --> task waits for lock
    
        read_lock(&eb->lock);
        --> makes this task hang
            forever (and task B too
    	of course)
    
    So fix this by avoiding doing the nested read lock, which is easily
    avoidable. This issue does not happen if task B calls write_lock() after
    task A does the second call to read_lock(), however there does not seem
    to exist anything in the documentation that mentions what is the expected
    behaviour for recursive locking of rwlocks (leaving the idea that doing
    so is not a good usage of rwlocks).
    
    Also, as a side effect necessary for this fix, make sure we do not
    needlessly read lock extent buffers when the input path has skip_locking
    set (used when called from send).
    
    Cc: stable@vger.kernel.org
    Signed-off-by: default avatarFilipe Manana <fdmanana@suse.com>
    Signed-off-by: default avatarSasha Levin <sasha.levin@oracle.com>
    79179a68
backref.c 50.5 KB