1. 20 Dec, 2018 15 commits
    • Lars Ellenberg's avatar
      drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire") · f31e583a
      Lars Ellenberg authored
      And also re-enable partial-zero-out + discard aligned.
      
      With the introduction of REQ_OP_WRITE_ZEROES,
      we started to use that for both WRITE_ZEROES and DISCARDS,
      hoping that WRITE_ZEROES would "do what we want",
      UNMAP if possible, zero-out the rest.
      
      The example scenario is some LVM "thin" backend.
      
      While an un-allocated block on dm-thin reads as zeroes, on a dm-thin
      with "skip_block_zeroing=true", after a partial block write allocated
      that block, that same block may well map "undefined old garbage" from
      the backends on LBAs that have not yet been written to.
      
      If we cannot distinguish between zero-out and discard on the receiving
      side, to avoid "undefined old garbage" to pop up randomly at later times
      on supposedly zero-initialized blocks, we'd need to map all discards to
      zero-out on the receiving side.  But that would potentially do a full
      alloc on thinly provisioned backends, even when the expectation was to
      unmap/trim/discard/de-allocate.
      
      We need to distinguish on the protocol level, whether we need to guarantee
      zeroes (and thus use zero-out, potentially doing the mentioned full-alloc),
      or if we want to put the emphasis on discard, and only do a "best effort
      zeroing" (by "discarding" blocks aligned to discard-granularity, and zeroing
      only potential unaligned head and tail clippings to at least *try* to
      avoid "false positives" in an online-verify later), hoping that someone
      set skip_block_zeroing=false.
      
      For some discussion regarding this on dm-devel, see also
      https://www.mail-archive.com/dm-devel%40redhat.com/msg07965.html
      https://www.redhat.com/archives/dm-devel/2018-January/msg00271.html
      
      For backward compatibility, P_TRIM means zero-out, unless the
      DRBD_FF_WZEROES feature flag is agreed upon during handshake.
      
      To have upper layers even try to submit WRITE ZEROES requests,
      we need to announce "efficient zeroout" independently.
      
      We need to fixup max_write_zeroes_sectors after blk_queue_stack_limits():
      if we can handle "zeroes" efficiently on the protocol,
      we want to do that, even if our backend does not announce
      max_write_zeroes_sectors itself.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f31e583a
    • Lars Ellenberg's avatar
      drbd: skip spurious timeout (ping-timeo) when failing promote · 9848b6dd
      Lars Ellenberg authored
      If you try to promote a Secondary while connected to a Primary
      and allow-two-primaries is NOT set, we will wait for "ping-timeout"
      to give this node a chance to detect a dead primary,
      in case the cluster manager noticed faster than we did.
      
      But if we then are *still* connected to a Primary,
      we fail (after an additional timeout of ping-timout).
      
      This change skips the spurious second timeout.
      
      Most people won't notice really,
      since "ping-timeout" by default is half a second.
      
      But in some installations, ping-timeout may be 10 or 20 seconds or more,
      and spuriously delaying the error return becomes annoying.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9848b6dd
    • Lars Ellenberg's avatar
      drbd: don't retry connection if peers do not agree on "authentication" settings · 9049ccd4
      Lars Ellenberg authored
      emma: "Unexpected data packet AuthChallenge (0x0010)"
       ava: "expected AuthChallenge packet, received: ReportProtocol (0x000b)"
            "Authentication of peer failed, trying again."
      
      Pattern repeats.
      
      There is no point in retrying the handshake,
      if we expect to receive an AuthChallenge,
      but the peer is not even configured to expect or use a shared secret.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9049ccd4
    • Luc Van Oostenryck's avatar
      drbd: fix print_st_err()'s prototype to match the definition · 2c38f035
      Luc Van Oostenryck authored
      print_st_err() is defined with its 4th argument taking an
      'enum drbd_state_rv' but its prototype use an int for it.
      
      Fix this by using 'enum drbd_state_rv' in the prototype too.
      Signed-off-by: default avatarLuc Van Oostenryck <luc.vanoostenryck@gmail.com>
      Signed-off-by: default avatarRoland Kammerer <roland.kammerer@linbit.com>
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2c38f035
    • Lars Ellenberg's avatar
      drbd: avoid spurious self-outdating with concurrent disconnect / down · be80ff88
      Lars Ellenberg authored
      If peers are "simultaneously" told to disconnect from each other,
      either explicitly, or implicitly by taking down the resource,
      with bad timing, one side may see its disconnect "fail" with
      a result of "state change failed by peer", and interpret this as
      "please oudate yourself".
      
      Try to catch this by checking for current connection status,
      and possibly retry as local-only state change instead.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      be80ff88
    • Lars Ellenberg's avatar
      drbd: do not block when adjusting "disk-options" while IO is frozen · f708bd08
      Lars Ellenberg authored
      "suspending" IO is overloaded.
      It can mean "do not allow new requests" (obviously),
      but it also may mean "must not complete pending IO",
      for example while the fencing handlers do their arbitration.
      
      When adjusting disk options, we suspend io (disallow new requests), then
      wait for the activity-log to become unused (drain all IO completions),
      and possibly replace it with a new activity log of different size.
      
      If the other "suspend IO" aspect is active, pending IO completions won't
      happen, and we would block forever (unkillable drbdsetup process).
      
      Fix this by skipping the activity log adjustment if the "al-extents"
      setting did not change. Also, in case it did change, fail early without
      blocking if it looks like we would block forever.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f708bd08
    • Lars Ellenberg's avatar
      drbd: fix comment typos · a2823ea9
      Lars Ellenberg authored
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a2823ea9
    • Lars Ellenberg's avatar
      drbd: reject attach of unsuitable uuids even if connected · fe43ed97
      Lars Ellenberg authored
      Multiple failure scenario:
      a) all good
         Connected Primary/Secondary UpToDate/UpToDate
      b) lose disk on Primary,
         Connected Primary/Secondary Diskless/UpToDate
      c) continue to write to the device,
         changes only make it to the Secondary storage.
      d) lose disk on Secondary,
         Connected Primary/Secondary Diskless/Diskless
      e) now try to re-attach on Primary
      
      This would have succeeded before, even though that is clearly the
      wrong data set to attach to (missing the modifications from c).
      Because we only compared our "effective" and the "to-be-attached"
      data generation uuid tags if (device->state.conn < C_CONNECTED).
      
      Fix: change that constraint to (device->state.pdsk != D_UP_TO_DATE)
      compare the uuids, and reject the attach.
      
      This patch also tries to improve the reverse scenario:
      first lose Secondary, then Primary disk,
      then try to attach the disk on Secondary.
      
      Before this patch, the attach on the Secondary succeeds, but since commit
      drbd: disconnect, if the wrong UUIDs are attached on a connected peer
      the Primary will notice unsuitable data, and drop the connection hard.
      
      Though unfortunately at a point in time during the handshake where
      we cannot easily abort the attach on the peer without more
      refactoring of the handshake.
      
      We now reject any attach to "unsuitable" uuids,
      as long as we can see a Primary role,
      unless we already have access to "good" data.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fe43ed97
    • Lars Ellenberg's avatar
      drbd: attach on connected diskless peer must not shrink a consistent device · ad6e8979
      Lars Ellenberg authored
      If we would reject a new handshake, if the peer had attached first,
      and then connected, we should force disconnect if the peer first connects,
      and only then attaches.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ad6e8979
    • Lars Ellenberg's avatar
      drbd: fix confusing error message during attach · 4ef2a4f4
      Lars Ellenberg authored
      If we attach a (consistent) backing device,
      which knows about a last-agreed effective size,
      and that effective size is *larger* than the currently requested size,
      we refused to attach with ERR_DISK_TOO_SMALL
        Failure: (111) Low.dev. smaller than requested DRBD-dev. size.
      which is confusing to say the least.
      
      This patch changes the error code in that case to ERR_IMPLICIT_SHRINK
        Failure: (170) Implicit device shrinking not allowed. See kernel log.
        additional info from kernel:
        To-be-attached device has last effective > current size, and is consistent
        (9999 > 7777 sectors). Refusing to attach.
      
      It also allows to attach with an explicit size.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4ef2a4f4
    • Lars Ellenberg's avatar
      drbd: disconnect, if the wrong UUIDs are attached on a connected peer · b17b5960
      Lars Ellenberg authored
      With "on-no-data-accessible suspend-io", DRBD requires the next attach
      or connect to be to the very same data generation uuid tag it lost last.
      
      If we first lost connection to the peer,
      then later lost connection to our own disk,
      we would usually refuse to re-connect to the peer,
      because it presents the wrong data set.
      
      However, if the peer first connects without a disk,
      and then attached its disk, we accepted that same wrong data set,
      which would be "unexpected" by any user of that DRBD
      and cause "undefined results" (read: very likely data corruption).
      
      The fix is to forcefully disconnect as soon as we notice that the peer
      attached to the "wrong" dataset.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b17b5960
    • Lars Ellenberg's avatar
      drbd: ignore "all zero" peer volume sizes in handshake · 94c43a13
      Lars Ellenberg authored
      During handshake, if we are diskless ourselves, we used to accept any size
      presented by the peer.
      
      Which could be zero if that peer was just brought up and connected
      to us without having a disk attached first, in which case both
      peers would just "flip" their volume sizes.
      
      Now, even a diskless node will ignore "zero" sizes
      presented by a diskless peer.
      
      Also a currently Diskless Primary will refuse to shrink during handshake:
      it may be frozen, and waiting for a "suitable" local disk or peer to
      re-appear (on-no-data-accessible suspend-io). If the peer is smaller
      than what we used to be, it is not suitable.
      
      The logic for a diskless node during handshake is now supposed to be:
      believe the peer, if
       - I don't have a current size myself
       - we agree on the size anyways
       - I do have a current size, am Secondary, and he has the only disk
       - I do have a current size, am Primary, and he has the only disk,
         which is larger than my current size
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      94c43a13
    • Lars Ellenberg's avatar
      drbd: centralize printk reporting of new size into drbd_set_my_capacity() · d5412e8d
      Lars Ellenberg authored
      Previously, some implicit resizes that happend during handshake
      have not been reported as prominently as explicit resize.
      Signed-off-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d5412e8d
    • Lars Ellenberg's avatar
    • Roland Kammerer's avatar
      drbd: narrow rcu_read_lock in drbd_sync_handshake · d29e89e3
      Roland Kammerer authored
      So far there was the possibility that we called
      genlmsg_new(GFP_NOIO)/mutex_lock() while holding an rcu_read_lock().
      
      This included cases like:
      
      drbd_sync_handshake (acquire the RCU lock)
        drbd_asb_recover_1p
          drbd_khelper
            drbd_bcast_event
              genlmsg_new(GFP_NOIO) --> may sleep
      
      drbd_sync_handshake (acquire the RCU lock)
        drbd_asb_recover_1p
          drbd_khelper
            notify_helper
              genlmsg_new(GFP_NOIO) --> may sleep
      
      drbd_sync_handshake (acquire the RCU lock)
        drbd_asb_recover_1p
          drbd_khelper
            notify_helper
              mutex_lock --> may sleep
      
      While using GFP_ATOMIC whould have been possible in the first two cases,
      the real fix is to narrow the rcu_read_lock.
      Reported-by: default avatarJia-Ju Bai <baijiaju1990@163.com>
      Reviewed-by: default avatarLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: default avatarRoland Kammerer <roland.kammerer@linbit.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d29e89e3
  2. 19 Dec, 2018 4 commits
    • Ming Lei's avatar
      block: save irq state in blkg_lookup_create() · 3a762de5
      Ming Lei authored
      blkg_lookup_create() may be called from pool_map() in which
      irq state is saved, so we have to do that in blkg_lookup_create().
      
      Otherwise, the following lockdep warning can be triggered:
      
      [  104.258537] ================================
      [  104.259129] WARNING: inconsistent lock state
      [  104.259725] 4.20.0-rc6+ #545 Not tainted
      [  104.260268] --------------------------------
      [  104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
      [  104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
      [  104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool]
      [  104.263747] {SOFTIRQ-ON-W} state was registered at:
      [  104.264417]   _raw_spin_unlock_irq+0x29/0x4c
      [  104.265014]   blkg_lookup_create+0xdc/0xe6
      [  104.265609]   bio_associate_blkg_from_css+0xd3/0x13f
      [  104.266312]   bio_associate_blkg+0x15a/0x1bb
      [  104.266913]   pool_map+0xe8/0x103 [dm_thin_pool]
      [  104.267572]   __map_bio+0x98/0x29c [dm_mod]
      [  104.268162]   __split_and_process_non_flush+0x29e/0x306 [dm_mod]
      [  104.269003]   __split_and_process_bio+0x16a/0x25b [dm_mod]
      [  104.269971]   __dm_make_request.isra.14+0xdc/0x124 [dm_mod]
      [  104.270973]   generic_make_request+0x3f5/0x68b
      [  104.271676]   process_prepared_mapping+0x166/0x1ef [dm_thin_pool]
      [  104.272531]   schedule_zero+0x239/0x273 [dm_thin_pool]
      [  104.273245]   process_cell+0x60c/0x6f1 [dm_thin_pool]
      [  104.273967]   do_worker+0x60c/0xca8 [dm_thin_pool]
      [  104.274635]   process_one_work+0x4eb/0x834
      [  104.275203]   worker_thread+0x318/0x484
      [  104.275740]   kthread+0x1d1/0x1e1
      [  104.276203]   ret_from_fork+0x3a/0x50
      [  104.276714] irq event stamp: 170003
      [  104.277201] hardirqs last  enabled at (170002): [<ffffffff81bcc33e>] _raw_spin_unlock_irqrestore+0x44/0x6b
      [  104.278535] hardirqs last disabled at (170003): [<ffffffff81bcc1ad>] _raw_spin_lock_irqsave+0x20/0x55
      [  104.280273] softirqs last  enabled at (169978): [<ffffffff810d13d4>] irq_enter+0x4c/0x73
      [  104.281617] softirqs last disabled at (169979): [<ffffffff810d1479>] irq_exit+0x7e/0x11d
      [  104.282744]
      [  104.282744] other info that might help us debug this:
      [  104.283640]  Possible unsafe locking scenario:
      [  104.283640]
      [  104.284452]        CPU0
      [  104.284803]        ----
      [  104.285150]   lock(&(&pool->lock)->rlock#3);
      [  104.285762]   <Interrupt>
      [  104.286130]     lock(&(&pool->lock)->rlock#3);
      [  104.286750]
      [  104.286750]  *** DEADLOCK ***
      [  104.286750]
      [  104.287564] no locks held by swapper/49/0.
      [  104.288129]
      [  104.288129] stack backtrace:
      [  104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545
      [  104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
      [  104.290858] Call Trace:
      [  104.291204]  <IRQ>
      [  104.291502]  dump_stack+0x9a/0xe6
      [  104.291968]  mark_lock+0x56c/0x7a6
      [  104.292442]  ? check_usage_backwards+0x209/0x209
      [  104.293086]  __lock_acquire+0x400/0x15bf
      [  104.293662]  ? check_chain_key+0x150/0x1aa
      [  104.294236]  lock_acquire+0x1a6/0x1e3
      [  104.294768]  ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
      [  104.295444]  ? _raw_spin_unlock_irqrestore+0x44/0x6b
      [  104.296143]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
      [  104.297031]  _raw_spin_lock_irqsave+0x46/0x55
      [  104.297659]  ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
      [  104.298335]  thin_endio+0xcf/0x2a3 [dm_thin_pool]
      [  104.298997]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
      [  104.299886]  ? check_flags+0x20a/0x20a
      [  104.300408]  ? lock_acquire+0x1a6/0x1e3
      [  104.300954]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
      [  104.301865]  clone_endio+0x1bb/0x22d [dm_mod]
      [  104.302491]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
      [  104.303200]  ? bio_disassociate_blkg+0xc6/0x15f
      [  104.303836]  ? bio_endio+0x2b2/0x2da
      [  104.304349]  clone_endio+0x1f3/0x22d [dm_mod]
      [  104.304978]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
      [  104.305709]  ? bio_disassociate_blkg+0xc6/0x15f
      [  104.306333]  ? bio_endio+0x2b2/0x2da
      [  104.306853]  clone_endio+0x1f3/0x22d [dm_mod]
      [  104.307476]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
      [  104.308185]  ? bio_disassociate_blkg+0xc6/0x15f
      [  104.308817]  ? bio_endio+0x2b2/0x2da
      [  104.309319]  blk_update_request+0x2de/0x4cc
      [  104.309927]  blk_mq_end_request+0x2a/0x183
      [  104.310498]  blk_done_softirq+0x16a/0x1a6
      [  104.311051]  ? blk_softirq_cpu_dead+0xe2/0xe2
      [  104.311653]  ? __lock_is_held+0x2a/0x87
      [  104.312186]  __do_softirq+0x250/0x4e8
      [  104.312705]  irq_exit+0x7e/0x11d
      [  104.313157]  call_function_single_interrupt+0xf/0x20
      [  104.313860]  </IRQ>
      [  104.314163] RIP: 0010:native_safe_halt+0x2/0x3
      [  104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 <c3> f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8
      [  104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
      [  104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7
      [  104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c
      [  104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001
      [  104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf
      [  104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000
      [  104.323307]  ? lockdep_hardirqs_on+0x26b/0x278
      [  104.323927]  default_idle+0xd9/0x1a8
      [  104.324427]  do_idle+0x162/0x2b2
      [  104.324891]  ? arch_cpu_idle_exit+0x28/0x28
      [  104.325467]  ? mark_held_locks+0x28/0x7f
      [  104.326031]  ? _raw_spin_unlock_irqrestore+0x44/0x6b
      [  104.326719]  cpu_startup_entry+0x1d/0x1f
      [  104.327261]  start_secondary+0x2cb/0x308
      [  104.327806]  ? set_cpu_sibling_map+0x8a3/0x8a3
      [  104.328421]  secondary_startup_64+0xa4/0xb0
      
      Fixes: b978962a ("blkcg: update blkg_lookup_create() to do locking")
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3a762de5
    • Jens Axboe's avatar
      dm: don't reuse bio for flushes · dbe3ece1
      Jens Axboe authored
      DM currently has a statically allocated bio that it uses to issue empty
      flushes. It doesn't submit this bio, it just uses it for maintaining
      state while setting up clones. Multiple users can access this bio at the
      same time. This wasn't previously an issue, even if it was a bit iffy,
      but with the blkg associations it can become one.
      
      We setup the blkg association, then clone bio's and submit, then remove
      the blkg assocation again. But since we can have multiple tasks doing
      this at the same time, against multiple blkg's, then we can either lose
      references to a blkg, or put it twice. The latter causes complaints on
      the percpu ref being <= 0 when released, and can cause use-after-free as
      well. Ming reports that xfstest generic/475 triggers this:
      
      ------------[ cut here ]------------
      percpu ref (blkg_release) <= 0 (0) after switching to atomic
      WARNING: CPU: 13 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x2c9/0x4a0
      
      Switch to just using an on-stack bio for this, and get rid of the
      embedded bio.
      
      Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
      Reported-by: default avatarMing Lei <ming.lei@redhat.com>
      Tested-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarMike Snitzer <snitzer@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dbe3ece1
    • Jens Axboe's avatar
      Merge branch 'nvme-4.21' of git://git.infradead.org/nvme into for-4.21/block · 499aeb45
      Jens Axboe authored
      Pull last batch of NVMe updates for 4.21 from Christoph:
      
      "This contains a series from Sagi to restore poll support for nvme-rdma,
       a new tracepoint from yupeng and various fixes."
      
      * 'nvme-4.21' of git://git.infradead.org/nvme:
        nvme-pci: trace SQ status on completions
        nvme-rdma: implement polling queue map
        nvme-fabrics: allow user to pass in nr_poll_queues
        nvme-fabrics: allow nvmf_connect_io_queue to poll
        nvme-core: optionally poll sync commands
        block: make request_to_qc_t public
        nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
        nvme-tcp: fix endianess annotations
        nvmet-tcp: fix endianess annotations
        nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
        nvme-pci: only set nr_maps to 2 if poll queues are supported
        nvmet: use a macro for default error location
        nvmet: fix comparison of a u16 with -1
      499aeb45
    • yupeng's avatar
      nvme-pci: trace SQ status on completions · 604c01d5
      yupeng authored
      Export the disk name, queue id, sq_head, sq_tail to a trace event in
      completion handling.
      
      Usage example:
      
      cd /sys/kernel/debug/tracing/events/nvme/nvme_sq
      
      echo 'disk=="nvme1n1"' > filter
      
      echo 1 > enable
      
      cat /sys/kernel/debug/tracing/trace_pipe
      Signed-off-by: default avataryupeng <yupeng0921@gmail.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarKeith Busch <keith.busch@intel.com>
      [hch: slight formatting tweaks, use standard nvme tracepoint
       conventions]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      
      wip
      604c01d5
  3. 18 Dec, 2018 14 commits
  4. 17 Dec, 2018 7 commits
    • Ming Lei's avatar
      blk-mq: skip zero-queue maps in blk_mq_map_swqueue · e5edd5f2
      Ming Lei authored
      From 7e849dd9 ("nvme-pci: don't share queue maps"), the mapping
      table won't be initialized actually if map->nr_queues is zero, so
      we can't use blk_mq_map_queue_type() to retrieve hctx any more.
      
      This way still may cause broken mapping, fix it by skipping zero-queues
      maps in blk_mq_map_swqueue().
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e5edd5f2
    • Dennis Zhou's avatar
      block: fix blk-iolatency accounting underflow · 13369816
      Dennis Zhou authored
      The blk-iolatency controller measures the time from rq_qos_throttle() to
      rq_qos_done_bio() and attributes this time to the first bio that needs
      to create the request. This means if a bio is plug-mergeable or
      bio-mergeable, it gets to bypass the blk-iolatency controller.
      
      The recent series [1], to tag all bios w/ blkgs undermined how iolatency
      was determining which bios it was charging and should process in
      rq_qos_done_bio(). Because all bios are being tagged, this caused the
      atomic_t for the struct rq_wait inflight count to underflow and result
      in a stall.
      
      This patch adds a new flag BIO_TRACKED to let controllers know that a
      bio is going through the rq_qos path. blk-iolatency now checks if this
      flag is set to see if it should process the bio in rq_qos_done_bio().
      
      Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
      BIO_THROTTLED was another candidate, but the flag is set for all bios
      that have gone through blk-throttle code. Overloading a flag comes with
      the burden of making sure that when either implementation changes, a
      change in setting rules for one doesn't cause a bug in the other. So
      here, we unfortunately opt for adding a new flag.
      
      [1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/
      
      Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
      Signed-off-by: default avatarDennis Zhou <dennis@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      13369816
    • Ming Lei's avatar
      blk-mq: fix dispatch from sw queue · c16d6b5a
      Ming Lei authored
      When a request is added to rq list of sw queue(ctx), the rq may be from
      a different type of hctx, especially after multi queue mapping is
      introduced.
      
      So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
      blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
      hctx can be dispatched to current hctx in case that read queue or poll
      queue is enabled.
      
      This patch fixes this issue by introducing per-queue-type list.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      
      Changed by me to not use separately cacheline aligned lists, just
      place them all in the same cacheline where we had just the one list
      and lock before.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c16d6b5a
    • Damien Le Moal's avatar
      block: mq-deadline: Fix write completion handling · 7211aef8
      Damien Le Moal authored
      For a zoned block device using mq-deadline, if a write request for a
      zone is received while another write was already dispatched for the same
      zone, dd_dispatch_request() will return NULL and the newly inserted
      write request is kept in the scheduler queue waiting for the ongoing
      zone write to complete. With this behavior, when no other request has
      been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
      and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
      __blk_mq_free_request() call of blk_mq_sched_restart() to not run the
      queue when the already dispatched write request completes. The newly
      dispatched request stays stuck in the scheduler queue until eventually
      another request is submitted.
      
      This problem does not affect SCSI disk as the SCSI stack handles queue
      restart on request completion. However, this problem is can be triggered
      the nullblk driver with zoned mode enabled.
      
      Fix this by always requesting a queue restart in dd_dispatch_request()
      if no request was dispatched while WRITE requests are queued.
      
      Fixes: 5700f691 ("mq-deadline: Introduce zone locking support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      
      Add missing export of blk_mq_sched_restart()
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7211aef8
    • Christoph Hellwig's avatar
      nvme-pci: don't share queue maps · 7e849dd9
      Christoph Hellwig authored
      Now that the block layer checks if a queue map has any queues inside
      it there is no more reason to duplicate the maps for the non-default
      types.
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7e849dd9
    • Christoph Hellwig's avatar
      blk-mq: only dispatch to non-defauly queue maps if they have queues · 5aceaeb2
      Christoph Hellwig authored
      We should check if a given queue map actually has queues enabled before
      dispatching to it.  This allows drivers to not initialize optional but
      not used map types, which subsequently will allow fixing problems with
      queue map rebuilds for that case.
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5aceaeb2
    • Ming Lei's avatar
      blk-mq: export hctx->type in debugfs instead of sysfs · 346fc108
      Ming Lei authored
      Now we only export hctx->type via sysfs, and there isn't such info
      in hctx entry under debugfs. We often use debugfs only to diagnose
      queue mapping issue, so add the support in debugfs.
      
      Queue mapping becomes a bit more complicated after multiple queue
      mapping is supported, we may write blktest to verify if queue mapping
      is valid based on blk-mq-debugfs.
      
      Given not necessary to export hctx->type twice, so remove the export
      from sysfs.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      346fc108