• Asias He's avatar
    block: Mitigate lock unbalance caused by lock switching · 5e5cfac0
    Asias He authored
    Commit 777eb1bf disconnects externally
    supplied queue_lock before blk_drain_queue(). Switching the lock would
    introduce lock unbalance because theads which have taken the external
    lock might unlock the internal lock in the during the queue drain. This
    patch mitigate this by disconnecting the lock after the queue draining
    since queue draining makes a lot of request_queue users go away.
    
    However, please note, this patch only makes the problem less likely to
    happen. Anyone who still holds a ref might try to issue a new request on
    a dead queue after the blk_cleanup_queue() finishes draining, the lock
    unbalance might still happen in this case.
    
     =====================================
     [ BUG: bad unlock balance detected! ]
     3.4.0+ #288 Not tainted
     -------------------------------------
     fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
     [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
     but there are no more locks to release!
    
     other info that might help us debug this:
     1 lock held by fio/17706:
      #0:  (&(&vblk->lock)->rlock){......}, at: [<ffffffff81327f1a>]
     get_request_wait+0x19a/0x250
    
     stack backtrace:
     Pid: 17706, comm: fio Not tainted 3.4.0+ #288
     Call Trace:
      [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
      [<ffffffff810dea49>] print_unlock_inbalance_bug+0xf9/0x100
      [<ffffffff810dfe4f>] lock_release_non_nested+0x1df/0x330
      [<ffffffff811dae24>] ? dio_bio_end_aio+0x34/0xc0
      [<ffffffff811d6935>] ? bio_check_pages_dirty+0x85/0xe0
      [<ffffffff811daea1>] ? dio_bio_end_aio+0xb1/0xc0
      [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
      [<ffffffff81329372>] ? blk_queue_bio+0x2a2/0x380
      [<ffffffff810e0079>] lock_release+0xd9/0x250
      [<ffffffff81a74553>] _raw_spin_unlock_irq+0x23/0x40
      [<ffffffff81329372>] blk_queue_bio+0x2a2/0x380
      [<ffffffff81328faa>] generic_make_request+0xca/0x100
      [<ffffffff81329056>] submit_bio+0x76/0xf0
      [<ffffffff8115470c>] ? set_page_dirty_lock+0x3c/0x60
      [<ffffffff811d69e1>] ? bio_set_pages_dirty+0x51/0x70
      [<ffffffff811dd1a8>] do_blockdev_direct_IO+0xbf8/0xee0
      [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
      [<ffffffff811dd4e5>] __blockdev_direct_IO+0x55/0x60
      [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
      [<ffffffff811d92e7>] blkdev_direct_IO+0x57/0x60
      [<ffffffff811d8620>] ? blkdev_get_block+0x80/0x80
      [<ffffffff8114c6ae>] generic_file_aio_read+0x70e/0x760
      [<ffffffff810df7c5>] ? __lock_acquire+0x215/0x5a0
      [<ffffffff811e9924>] ? aio_run_iocb+0x54/0x1a0
      [<ffffffff8114bfa0>] ? grab_cache_page_nowait+0xc0/0xc0
      [<ffffffff811e82cc>] aio_rw_vect_retry+0x7c/0x1e0
      [<ffffffff811e8250>] ? aio_fsync+0x30/0x30
      [<ffffffff811e9936>] aio_run_iocb+0x66/0x1a0
      [<ffffffff811ea9b0>] do_io_submit+0x6f0/0xb80
      [<ffffffff8134de2e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [<ffffffff811eae50>] sys_io_submit+0x10/0x20
      [<ffffffff81a7c9e9>] system_call_fastpath+0x16/0x1b
    
    Changes since v2: Update commit log to explain how the code is still
                      broken even if we delay the lock switching after the drain.
    Changes since v1: Update commit log as Tejun suggested.
    Acked-by: default avatarTejun Heo <tj@kernel.org>
    Signed-off-by: default avatarAsias He <asias@redhat.com>
    Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
    5e5cfac0
blk-core.c 80 KB