• Sagi Grimberg's avatar
    nvme-tcp: fix possible hang caused during ctrl deletion · c4abd875
    Sagi Grimberg authored
    When we delete a controller, we execute the following:
    1. nvme_stop_ctrl() - stop some work elements that may be
    	inflight or scheduled (specifically also .stop_ctrl
    	which cancels ctrl error recovery work)
    2. nvme_remove_namespaces() - which first flushes scan_work
    	to avoid competing ns addition/removal
    3. continue to teardown the controller
    
    However, if err_work was scheduled to run in (1), it is designed to
    cancel any inflight I/O, particularly I/O that is originating from ns
    scan_work in (2), but because it is cancelled in .stop_ctrl(), we can
    prevent forward progress of (2) as ns scanning is blocking on I/O
    (that will never be cancelled).
    
    The race is:
    1. transport layer error observed -> err_work is scheduled
    2. scan_work executes, discovers ns, generate I/O to it
    3. nvme_ctop_ctrl() -> .stop_ctrl() -> cancel_work_sync(err_work)
       - err_work never executed
    4. nvme_remove_namespaces() -> flush_work(scan_work)
    --> deadlock, because scan_work is blocked on I/O that was supposed
    to be cancelled by err_work, but was cancelled before executing (see
    stack trace [1]).
    
    Fix this by flushing err_work instead of cancelling it, to force it
    to execute and cancel all inflight I/O.
    
    [1]:
    --
    Call Trace:
     <TASK>
     __schedule+0x390/0x910
     ? scan_shadow_nodes+0x40/0x40
     schedule+0x55/0xe0
     io_schedule+0x16/0x40
     do_read_cache_page+0x55d/0x850
     ? __page_cache_alloc+0x90/0x90
     read_cache_page+0x12/0x20
     read_part_sector+0x3f/0x110
     amiga_partition+0x3d/0x3e0
     ? osf_partition+0x33/0x220
     ? put_partition+0x90/0x90
     bdev_disk_changed+0x1fe/0x4d0
     blkdev_get_whole+0x7b/0x90
     blkdev_get_by_dev+0xda/0x2d0
     device_add_disk+0x356/0x3b0
     nvme_mpath_set_live+0x13c/0x1a0 [nvme_core]
     ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core]
     nvme_update_ns_ana_state+0x3a/0x40 [nvme_core]
     nvme_mpath_add_disk+0x120/0x160 [nvme_core]
     nvme_alloc_ns+0x594/0xa00 [nvme_core]
     nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core]
     ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core]
     nvme_scan_work+0x281/0x410 [nvme_core]
     process_one_work+0x1be/0x380
     worker_thread+0x37/0x3b0
     ? process_one_work+0x380/0x380
     kthread+0x12d/0x150
     ? set_kthread_struct+0x50/0x50
     ret_from_fork+0x1f/0x30
     </TASK>
    INFO: task nvme:6725 blocked for more than 491 seconds.
          Not tainted 5.15.65-f0.el7.x86_64 #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:nvme            state:D
     stack:    0 pid: 6725 ppid:  1761 flags:0x00004000
    Call Trace:
     <TASK>
     __schedule+0x390/0x910
     ? sched_clock+0x9/0x10
     schedule+0x55/0xe0
     schedule_timeout+0x24b/0x2e0
     ? try_to_wake_up+0x358/0x510
     ? finish_task_switch+0x88/0x2c0
     wait_for_completion+0xa5/0x110
     __flush_work+0x144/0x210
     ? worker_attach_to_pool+0xc0/0xc0
     flush_work+0x10/0x20
     nvme_remove_namespaces+0x41/0xf0 [nvme_core]
     nvme_do_delete_ctrl+0x47/0x66 [nvme_core]
     nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core]
     dev_attr_store+0x14/0x30
     sysfs_kf_write+0x38/0x50
     kernfs_fop_write_iter+0x146/0x1d0
     new_sync_write+0x114/0x1b0
     ? intel_pmu_handle_irq+0xe0/0x420
     vfs_write+0x18d/0x270
     ksys_write+0x61/0xe0
     __x64_sys_write+0x1a/0x20
     do_syscall_64+0x37/0x90
     entry_SYSCALL_64_after_hwframe+0x61/0xcb
    --
    
    Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
    Reported-by: default avatarJonathan Nicklin <jnicklin@blockbridge.com>
    Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
    Tested-by: default avatarJonathan Nicklin <jnicklin@blockbridge.com>
    Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
    c4abd875
tcp.c 67.7 KB