• Roman Pen's avatar
    workqueue: fix ghost PENDING flag while doing MQ IO · ce75b13e
    Roman Pen authored
    commit 346c09f8 upstream.
    
    The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
    with the following backtrace:
    
    [  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
    [  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
    [  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
    [  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
    [  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
    [  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
    [  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
    [  601.350965] Call Trace:
    [  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
    [  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
    [  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
    [  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
    [  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
    [  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
    [  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
    [  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
    [  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
    [  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
    [  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
    [  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
    [  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
    [  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
    [  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
    [  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
    [  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
    [  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
    [  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
    [  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
    [  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50
    
    The underlying device is a null_blk, with default parameters:
    
      queue_mode    = MQ
      submit_queues = 1
    
    Verification that nullb0 has something inflight:
    
    root@pserver8:~# cat /sys/block/nullb0/inflight
           0        1
    root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
    ...
    /sys/block/nullb0/mq/0/cpu2/rq_list
    CTX pending:
            ffff8838038e2400
    ...
    
    During debug it became clear that stalled request is always inserted in
    the rq_list from the following path:
    
       save_stack_trace_tsk + 34
       blk_mq_insert_requests + 231
       blk_mq_flush_plug_list + 281
       blk_flush_plug_list + 199
       wait_on_page_bit + 192
       __filemap_fdatawait_range + 228
       filemap_fdatawait_range + 20
       filemap_write_and_wait_range + 63
       blkdev_fsync + 27
       vfs_fsync_range + 73
       blkdev_write_iter + 202
       __vfs_write + 170
       vfs_write + 169
       kernel_write + 56
    
    So blk_flush_plug_list() was called with from_schedule == true.
    
    If from_schedule is true, that means that finally blk_mq_insert_requests()
    offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
    i.e. it calls kblockd_schedule_delayed_work_on().
    
    That means, that we race with another CPU, which is about to execute
    __blk_mq_run_hw_queue() work.
    
    Further debugging shows the following traces from different CPUs:
    
      CPU#0                                  CPU#1
      ----------------------------------     -------------------------------
      reqeust A inserted
      STORE hctx->ctx_map[0] bit marked
      kblockd_schedule...() returns 1
      <schedule to kblockd workqueue>
                                             request B inserted
                                             STORE hctx->ctx_map[1] bit marked
                                             kblockd_schedule...() returns 0
      *** WORK PENDING bit is cleared ***
      flush_busy_ctxs() is executed, but
      bit 1, set by CPU#1, is not observed
    
    As a result request B pended forever.
    
    This behaviour can be explained by speculative LOAD of hctx->ctx_map on
    CPU#0, which is reordered with clear of PENDING bit and executed _before_
    actual STORE of bit 1 on CPU#1.
    
    The proper fix is an explicit full barrier <mfence>, which guarantees
    that clear of PENDING bit is to be executed before all possible
    speculative LOADS or STORES inside actual work function.
    Signed-off-by: default avatarRoman Pen <roman.penyaev@profitbricks.com>
    Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
    Cc: Michael Wang <yun.wang@profitbricks.com>
    Cc: Tejun Heo <tj@kernel.org>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    Signed-off-by: default avatarKamal Mostafa <kamal@canonical.com>
    ce75b13e
workqueue.c 147 KB