Commits · fe43ed97bba3b11521abd934b83ed93143470e4f · Kirill Smelkov / linux

20 Dec, 2018 8 commits

drbd: reject attach of unsuitable uuids even if connected · fe43ed97

Lars Ellenberg authored Dec 20, 2018

Multiple failure scenario:
a) all good
   Connected Primary/Secondary UpToDate/UpToDate
b) lose disk on Primary,
   Connected Primary/Secondary Diskless/UpToDate
c) continue to write to the device,
   changes only make it to the Secondary storage.
d) lose disk on Secondary,
   Connected Primary/Secondary Diskless/Diskless
e) now try to re-attach on Primary

This would have succeeded before, even though that is clearly the
wrong data set to attach to (missing the modifications from c).
Because we only compared our "effective" and the "to-be-attached"
data generation uuid tags if (device->state.conn < C_CONNECTED).

Fix: change that constraint to (device->state.pdsk != D_UP_TO_DATE)
compare the uuids, and reject the attach.

This patch also tries to improve the reverse scenario:
first lose Secondary, then Primary disk,
then try to attach the disk on Secondary.

Before this patch, the attach on the Secondary succeeds, but since commit
drbd: disconnect, if the wrong UUIDs are attached on a connected peer
the Primary will notice unsuitable data, and drop the connection hard.

Though unfortunately at a point in time during the handshake where
we cannot easily abort the attach on the peer without more
refactoring of the handshake.

We now reject any attach to "unsuitable" uuids,
as long as we can see a Primary role,
unless we already have access to "good" data.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fe43ed97

drbd: attach on connected diskless peer must not shrink a consistent device · ad6e8979

Lars Ellenberg authored Dec 20, 2018

If we would reject a new handshake, if the peer had attached first,
and then connected, we should force disconnect if the peer first connects,
and only then attaches.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ad6e8979

drbd: fix confusing error message during attach · 4ef2a4f4

Lars Ellenberg authored Dec 20, 2018

If we attach a (consistent) backing device,
which knows about a last-agreed effective size,
and that effective size is *larger* than the currently requested size,
we refused to attach with ERR_DISK_TOO_SMALL
  Failure: (111) Low.dev. smaller than requested DRBD-dev. size.
which is confusing to say the least.

This patch changes the error code in that case to ERR_IMPLICIT_SHRINK
  Failure: (170) Implicit device shrinking not allowed. See kernel log.
  additional info from kernel:
  To-be-attached device has last effective > current size, and is consistent
  (9999 > 7777 sectors). Refusing to attach.

It also allows to attach with an explicit size.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4ef2a4f4

drbd: disconnect, if the wrong UUIDs are attached on a connected peer · b17b5960

Lars Ellenberg authored Dec 20, 2018

With "on-no-data-accessible suspend-io", DRBD requires the next attach
or connect to be to the very same data generation uuid tag it lost last.

If we first lost connection to the peer,
then later lost connection to our own disk,
we would usually refuse to re-connect to the peer,
because it presents the wrong data set.

However, if the peer first connects without a disk,
and then attached its disk, we accepted that same wrong data set,
which would be "unexpected" by any user of that DRBD
and cause "undefined results" (read: very likely data corruption).

The fix is to forcefully disconnect as soon as we notice that the peer
attached to the "wrong" dataset.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b17b5960

drbd: ignore "all zero" peer volume sizes in handshake · 94c43a13

Lars Ellenberg authored Dec 20, 2018

During handshake, if we are diskless ourselves, we used to accept any size
presented by the peer.

Which could be zero if that peer was just brought up and connected
to us without having a disk attached first, in which case both
peers would just "flip" their volume sizes.

Now, even a diskless node will ignore "zero" sizes
presented by a diskless peer.

Also a currently Diskless Primary will refuse to shrink during handshake:
it may be frozen, and waiting for a "suitable" local disk or peer to
re-appear (on-no-data-accessible suspend-io). If the peer is smaller
than what we used to be, it is not suitable.

The logic for a diskless node during handshake is now supposed to be:
believe the peer, if
 - I don't have a current size myself
 - we agree on the size anyways
 - I do have a current size, am Secondary, and he has the only disk
 - I do have a current size, am Primary, and he has the only disk,
   which is larger than my current size
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

94c43a13

drbd: centralize printk reporting of new size into drbd_set_my_capacity() · d5412e8d

Lars Ellenberg authored Dec 20, 2018

Previously, some implicit resizes that happend during handshake
have not been reported as prominently as explicit resize.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d5412e8d

drbd: must not use connection after kref_put(&connection->kref) · 792c3fdd
Lars Ellenberg authored Dec 20, 2018
```
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
```
792c3fdd

drbd: narrow rcu_read_lock in drbd_sync_handshake · d29e89e3

Roland Kammerer authored Dec 20, 2018

So far there was the possibility that we called
genlmsg_new(GFP_NOIO)/mutex_lock() while holding an rcu_read_lock().

This included cases like:

drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
    drbd_khelper
      drbd_bcast_event
        genlmsg_new(GFP_NOIO) --> may sleep

drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
    drbd_khelper
      notify_helper
        genlmsg_new(GFP_NOIO) --> may sleep

drbd_sync_handshake (acquire the RCU lock)
  drbd_asb_recover_1p
    drbd_khelper
      notify_helper
        mutex_lock --> may sleep

While using GFP_ATOMIC whould have been possible in the first two cases,
the real fix is to narrow the rcu_read_lock.
Reported-by: Jia-Ju Bai <baijiaju1990@163.com>
Reviewed-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Roland Kammerer <roland.kammerer@linbit.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d29e89e3

19 Dec, 2018 4 commits

block: save irq state in blkg_lookup_create() · 3a762de5

Ming Lei authored Dec 20, 2018

blkg_lookup_create() may be called from pool_map() in which
irq state is saved, so we have to do that in blkg_lookup_create().

Otherwise, the following lockdep warning can be triggered:

[  104.258537] ================================
[  104.259129] WARNING: inconsistent lock state
[  104.259725] 4.20.0-rc6+ #545 Not tainted
[  104.260268] --------------------------------
[  104.260865] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  104.261727] swapper/49/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
[  104.262444] 00000000db365b5d (&(&pool->lock)->rlock#3){+.?.}, at: thin_endio+0xcf/0x2a3 [dm_thin_pool]
[  104.263747] {SOFTIRQ-ON-W} state was registered at:
[  104.264417]   _raw_spin_unlock_irq+0x29/0x4c
[  104.265014]   blkg_lookup_create+0xdc/0xe6
[  104.265609]   bio_associate_blkg_from_css+0xd3/0x13f
[  104.266312]   bio_associate_blkg+0x15a/0x1bb
[  104.266913]   pool_map+0xe8/0x103 [dm_thin_pool]
[  104.267572]   __map_bio+0x98/0x29c [dm_mod]
[  104.268162]   __split_and_process_non_flush+0x29e/0x306 [dm_mod]
[  104.269003]   __split_and_process_bio+0x16a/0x25b [dm_mod]
[  104.269971]   __dm_make_request.isra.14+0xdc/0x124 [dm_mod]
[  104.270973]   generic_make_request+0x3f5/0x68b
[  104.271676]   process_prepared_mapping+0x166/0x1ef [dm_thin_pool]
[  104.272531]   schedule_zero+0x239/0x273 [dm_thin_pool]
[  104.273245]   process_cell+0x60c/0x6f1 [dm_thin_pool]
[  104.273967]   do_worker+0x60c/0xca8 [dm_thin_pool]
[  104.274635]   process_one_work+0x4eb/0x834
[  104.275203]   worker_thread+0x318/0x484
[  104.275740]   kthread+0x1d1/0x1e1
[  104.276203]   ret_from_fork+0x3a/0x50
[  104.276714] irq event stamp: 170003
[  104.277201] hardirqs last  enabled at (170002): [<ffffffff81bcc33e>] _raw_spin_unlock_irqrestore+0x44/0x6b
[  104.278535] hardirqs last disabled at (170003): [<ffffffff81bcc1ad>] _raw_spin_lock_irqsave+0x20/0x55
[  104.280273] softirqs last  enabled at (169978): [<ffffffff810d13d4>] irq_enter+0x4c/0x73
[  104.281617] softirqs last disabled at (169979): [<ffffffff810d1479>] irq_exit+0x7e/0x11d
[  104.282744]
[  104.282744] other info that might help us debug this:
[  104.283640]  Possible unsafe locking scenario:
[  104.283640]
[  104.284452]        CPU0
[  104.284803]        ----
[  104.285150]   lock(&(&pool->lock)->rlock#3);
[  104.285762]   <Interrupt>
[  104.286130]     lock(&(&pool->lock)->rlock#3);
[  104.286750]
[  104.286750]  *** DEADLOCK ***
[  104.286750]
[  104.287564] no locks held by swapper/49/0.
[  104.288129]
[  104.288129] stack backtrace:
[  104.288738] CPU: 49 PID: 0 Comm: swapper/49 Not tainted 4.20.0-rc6+ #545
[  104.289700] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014
[  104.290858] Call Trace:
[  104.291204]  <IRQ>
[  104.291502]  dump_stack+0x9a/0xe6
[  104.291968]  mark_lock+0x56c/0x7a6
[  104.292442]  ? check_usage_backwards+0x209/0x209
[  104.293086]  __lock_acquire+0x400/0x15bf
[  104.293662]  ? check_chain_key+0x150/0x1aa
[  104.294236]  lock_acquire+0x1a6/0x1e3
[  104.294768]  ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
[  104.295444]  ? _raw_spin_unlock_irqrestore+0x44/0x6b
[  104.296143]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
[  104.297031]  _raw_spin_lock_irqsave+0x46/0x55
[  104.297659]  ? thin_endio+0xcf/0x2a3 [dm_thin_pool]
[  104.298335]  thin_endio+0xcf/0x2a3 [dm_thin_pool]
[  104.298997]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
[  104.299886]  ? check_flags+0x20a/0x20a
[  104.300408]  ? lock_acquire+0x1a6/0x1e3
[  104.300954]  ? process_prepared_discard_fail+0x36/0x36 [dm_thin_pool]
[  104.301865]  clone_endio+0x1bb/0x22d [dm_mod]
[  104.302491]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
[  104.303200]  ? bio_disassociate_blkg+0xc6/0x15f
[  104.303836]  ? bio_endio+0x2b2/0x2da
[  104.304349]  clone_endio+0x1f3/0x22d [dm_mod]
[  104.304978]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
[  104.305709]  ? bio_disassociate_blkg+0xc6/0x15f
[  104.306333]  ? bio_endio+0x2b2/0x2da
[  104.306853]  clone_endio+0x1f3/0x22d [dm_mod]
[  104.307476]  ? disable_write_zeroes+0x20/0x20 [dm_mod]
[  104.308185]  ? bio_disassociate_blkg+0xc6/0x15f
[  104.308817]  ? bio_endio+0x2b2/0x2da
[  104.309319]  blk_update_request+0x2de/0x4cc
[  104.309927]  blk_mq_end_request+0x2a/0x183
[  104.310498]  blk_done_softirq+0x16a/0x1a6
[  104.311051]  ? blk_softirq_cpu_dead+0xe2/0xe2
[  104.311653]  ? __lock_is_held+0x2a/0x87
[  104.312186]  __do_softirq+0x250/0x4e8
[  104.312705]  irq_exit+0x7e/0x11d
[  104.313157]  call_function_single_interrupt+0xf/0x20
[  104.313860]  </IRQ>
[  104.314163] RIP: 0010:native_safe_halt+0x2/0x3
[  104.314792] Code: 63 02 df f0 83 44 24 fc 00 48 89 df e8 cc 3f 7a ff 48 8b 03 a8 08 74 0b 65 81 25 9d 31 45 7e ff ff ff 7f 5b 5d 41 5c c3 fb f4 <c3> f4 c3 0f 1f 44 00 00 41 56 41 55 41 54 55 53 e8 a2 0d 5c ff e8
[  104.317339] RSP: 0018:ffff888106c9fdc0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
[  104.318390] RAX: 1ffff11020d92100 RBX: 0000000000000000 RCX: ffffffff81159ac7
[  104.319366] RDX: 1ffffffff05d5e69 RSI: 0000000000000007 RDI: ffff888106c90d1c
[  104.320339] RBP: 0000000000000000 R08: dffffc0000000000 R09: 0000000000000001
[  104.321313] R10: ffffed1025d57ba0 R11: ffffed1025d57b9f R12: 1ffff11020d93fbf
[  104.322328] R13: 0000000000000031 R14: ffff888106c90040 R15: 0000000000000000
[  104.323307]  ? lockdep_hardirqs_on+0x26b/0x278
[  104.323927]  default_idle+0xd9/0x1a8
[  104.324427]  do_idle+0x162/0x2b2
[  104.324891]  ? arch_cpu_idle_exit+0x28/0x28
[  104.325467]  ? mark_held_locks+0x28/0x7f
[  104.326031]  ? _raw_spin_unlock_irqrestore+0x44/0x6b
[  104.326719]  cpu_startup_entry+0x1d/0x1f
[  104.327261]  start_secondary+0x2cb/0x308
[  104.327806]  ? set_cpu_sibling_map+0x8a3/0x8a3
[  104.328421]  secondary_startup_64+0xa4/0xb0

Fixes: b978962a ("blkcg: update blkg_lookup_create() to do locking")
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3a762de5

dm: don't reuse bio for flushes · dbe3ece1

Jens Axboe authored Dec 19, 2018

DM currently has a statically allocated bio that it uses to issue empty
flushes. It doesn't submit this bio, it just uses it for maintaining
state while setting up clones. Multiple users can access this bio at the
same time. This wasn't previously an issue, even if it was a bit iffy,
but with the blkg associations it can become one.

We setup the blkg association, then clone bio's and submit, then remove
the blkg assocation again. But since we can have multiple tasks doing
this at the same time, against multiple blkg's, then we can either lose
references to a blkg, or put it twice. The latter causes complaints on
the percpu ref being <= 0 when released, and can cause use-after-free as
well. Ming reports that xfstest generic/475 triggers this:

------------[ cut here ]------------
percpu ref (blkg_release) <= 0 (0) after switching to atomic
WARNING: CPU: 13 PID: 0 at lib/percpu-refcount.c:155 percpu_ref_switch_to_atomic_rcu+0x2c9/0x4a0

Switch to just using an on-stack bio for this, and get rid of the
embedded bio.

Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
Reported-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

dbe3ece1

Merge branch 'nvme-4.21' of git://git.infradead.org/nvme into for-4.21/block · 499aeb45

Jens Axboe authored Dec 19, 2018

Pull last batch of NVMe updates for 4.21 from Christoph:

"This contains a series from Sagi to restore poll support for nvme-rdma,
 a new tracepoint from yupeng and various fixes."

* 'nvme-4.21' of git://git.infradead.org/nvme:
  nvme-pci: trace SQ status on completions
  nvme-rdma: implement polling queue map
  nvme-fabrics: allow user to pass in nr_poll_queues
  nvme-fabrics: allow nvmf_connect_io_queue to poll
  nvme-core: optionally poll sync commands
  block: make request_to_qc_t public
  nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
  nvme-tcp: fix endianess annotations
  nvmet-tcp: fix endianess annotations
  nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
  nvme-pci: only set nr_maps to 2 if poll queues are supported
  nvmet: use a macro for default error location
  nvmet: fix comparison of a u16 with -1

499aeb45

nvme-pci: trace SQ status on completions · 604c01d5

yupeng authored Dec 18, 2018

Export the disk name, queue id, sq_head, sq_tail to a trace event in
completion handling.

Usage example:

cd /sys/kernel/debug/tracing/events/nvme/nvme_sq

echo 'disk=="nvme1n1"' > filter

echo 1 > enable

cat /sys/kernel/debug/tracing/trace_pipe
Signed-off-by: yupeng <yupeng0921@gmail.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <keith.busch@intel.com>
[hch: slight formatting tweaks, use standard nvme tracepoint
 conventions]
Signed-off-by: Christoph Hellwig <hch@lst.de>

wip

604c01d5

18 Dec, 2018 14 commits

nvme-rdma: implement polling queue map · ff8519f9

Sagi Grimberg authored Dec 14, 2018

When passed with nr_poll_queues setup additional queues with cq polling
context IB_POLL_DIRECT (no interrupts) and make sure to set
QUEUE_FLAG_POLL on the connect_q. In addition add the third queue
mapping for polling queues.

nvmf connect on this queue is polled for like all other requests so make
nvmf_connect_io_queue poll for polling queues.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

ff8519f9

nvme-fabrics: allow user to pass in nr_poll_queues · 89d43802

Sagi Grimberg authored Dec 14, 2018

This argument will specify how many polling I/O queues to connect when
creating the controller. These I/O queues will host I/O that is set with
REQ_HIPRI.
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

89d43802

nvme-fabrics: allow nvmf_connect_io_queue to poll · 26c68227

Sagi Grimberg authored Dec 14, 2018

Preparation for polling support for fabrics. Polling support
means that our completion queues are not generating any interrupts
which means we need to poll for the nvmf io queue connect as well.

Reviewed by Steve Wise <swise@opengridcomputing.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

26c68227

nvme-core: optionally poll sync commands · 6287b51c

Sagi Grimberg authored Dec 14, 2018

Pass poll bool to indicate that we need it to poll. This prepares us for
polling support in nvmf since connect is an I/O that will be queued
and has to be polled in order to complete. If poll is passed,
we call nvme_execute_rq_polled which sends the requests and polls
for its completion.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

6287b51c

block: make request_to_qc_t public · 7b7ab780

Sagi Grimberg authored Dec 14, 2018

block consumers will need it for polling requests that
are sent with blk_execute_rq_nowait. Also, get rid of
blk_tag_to_qc_t and open-code it instead.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

7b7ab780

nvme-tcp: fix spelling mistake "attepmpt" -> "attempt" · 56a77d26

Colin Ian King authored Dec 14, 2018

There is a spelling mistake in a dev_info message, fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>

56a77d26

nvme-tcp: fix endianess annotations · a7273d40

Christoph Hellwig authored Dec 13, 2018

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

a7273d40

nvmet-tcp: fix endianess annotations · f4d10b5c

Christoph Hellwig authored Dec 13, 2018

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>

f4d10b5c

nvme-pci: refactor nvme_poll_irqdisable to make sparse happy · 91a509f8

Christoph Hellwig authored Dec 13, 2018

By duplicating the nvme_process_cq in both branches we keep the
sparse lock context checking happy, so do it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

91a509f8

nvme-pci: only set nr_maps to 2 if poll queues are supported · ed92ad37

Christoph Hellwig authored Dec 14, 2018

The block layer now enables polling support on a queue if nr_maps
includes the poll map, so we should only set that if we actually
support poll queues.

Fixes:  6544d229 ("block: enable polling by default if a poll map is initalized")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

ed92ad37

nvmet: use a macro for default error location · 5698b805

Chaitanya Kulkarni authored Dec 17, 2018

This patch defines a new macro NVMET_NO_ERROR_LOC to represent the
default error location value in the nvme-error-log-page.
This is a pure cleanup patch and it does not change any functionality.
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

5698b805

nvmet: fix comparison of a u16 with -1 · 66c6afbd

Colin Ian King authored Dec 14, 2018

Currently the u16 req->error_loc is being compared to -1 which
will always be false.  Fix this by casting -1 to u16 to fix this.

Detected by clang:
  warning: result of comparison of constant -1 with expression of
  type 'u16' (aka 'unsigned short') is always false
  [-Wtautological-constant-out-of-range-compare]

Fixes: 76574f37 ("nvmet: add interface to update error-log page")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

66c6afbd

blk-mq: enable IO poll if .nr_queues of type poll > 0 · cd19181b

Ming Lei authored Dec 18, 2018

The queue mapping of type poll only exists when set->map[HCTX_TYPE_POLL].nr_queues
is bigger than zero, so enhance the constraint by checking .nr_queues of type poll
before enabling IO poll.

Otherwise IO race & timeout can be observed when running block/007.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cd19181b

blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight() · 3c94d83c

Jens Axboe authored Dec 17, 2018

There's a single user of this function, dm, and dm just wants
to check if IO is inflight, not that it's just allocated.

This fixes a hang with srp/002 in blktests with dm, where it tries
to suspend but waits for inflight IO to finish first. As it checks
for just allocated requests, this fails.
Tested-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3c94d83c

17 Dec, 2018 11 commits

blk-mq: skip zero-queue maps in blk_mq_map_swqueue · e5edd5f2

Ming Lei authored Dec 18, 2018

From 7e849dd9 ("nvme-pci: don't share queue maps"), the mapping
table won't be initialized actually if map->nr_queues is zero, so
we can't use blk_mq_map_queue_type() to retrieve hctx any more.

This way still may cause broken mapping, fix it by skipping zero-queues
maps in blk_mq_map_swqueue().

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e5edd5f2

block: fix blk-iolatency accounting underflow · 13369816

Dennis Zhou authored Dec 17, 2018

The blk-iolatency controller measures the time from rq_qos_throttle() to
rq_qos_done_bio() and attributes this time to the first bio that needs
to create the request. This means if a bio is plug-mergeable or
bio-mergeable, it gets to bypass the blk-iolatency controller.

The recent series [1], to tag all bios w/ blkgs undermined how iolatency
was determining which bios it was charging and should process in
rq_qos_done_bio(). Because all bios are being tagged, this caused the
atomic_t for the struct rq_wait inflight count to underflow and result
in a stall.

This patch adds a new flag BIO_TRACKED to let controllers know that a
bio is going through the rq_qos path. blk-iolatency now checks if this
flag is set to see if it should process the bio in rq_qos_done_bio().

Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
BIO_THROTTLED was another candidate, but the flag is set for all bios
that have gone through blk-throttle code. Overloading a flag comes with
the burden of making sure that when either implementation changes, a
change in setting rules for one doesn't cause a bug in the other. So
here, we unfortunately opt for adding a new flag.

[1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/

Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

13369816

blk-mq: fix dispatch from sw queue · c16d6b5a

Ming Lei authored Dec 17, 2018

When a request is added to rq list of sw queue(ctx), the rq may be from
a different type of hctx, especially after multi queue mapping is
introduced.

So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
hctx can be dispatched to current hctx in case that read queue or poll
queue is enabled.

This patch fixes this issue by introducing per-queue-type list.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>

Changed by me to not use separately cacheline aligned lists, just
place them all in the same cacheline where we had just the one list
and lock before.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c16d6b5a

block: mq-deadline: Fix write completion handling · 7211aef8

Damien Le Moal authored Dec 17, 2018

For a zoned block device using mq-deadline, if a write request for a
zone is received while another write was already dispatched for the same
zone, dd_dispatch_request() will return NULL and the newly inserted
write request is kept in the scheduler queue waiting for the ongoing
zone write to complete. With this behavior, when no other request has
been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
__blk_mq_free_request() call of blk_mq_sched_restart() to not run the
queue when the already dispatched write request completes. The newly
dispatched request stays stuck in the scheduler queue until eventually
another request is submitted.

This problem does not affect SCSI disk as the SCSI stack handles queue
restart on request completion. However, this problem is can be triggered
the nullblk driver with zoned mode enabled.

Fix this by always requesting a queue restart in dd_dispatch_request()
if no request was dispatched while WRITE requests are queued.

Fixes: 5700f691 ("mq-deadline: Introduce zone locking support")
Cc: <stable@vger.kernel.org>
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>

Add missing export of blk_mq_sched_restart()
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7211aef8

nvme-pci: don't share queue maps · 7e849dd9

Christoph Hellwig authored Dec 17, 2018

Now that the block layer checks if a queue map has any queues inside
it there is no more reason to duplicate the maps for the non-default
types.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7e849dd9

blk-mq: only dispatch to non-defauly queue maps if they have queues · 5aceaeb2

Christoph Hellwig authored Dec 17, 2018

We should check if a given queue map actually has queues enabled before
dispatching to it.  This allows drivers to not initialize optional but
not used map types, which subsequently will allow fixing problems with
queue map rebuilds for that case.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5aceaeb2

blk-mq: export hctx->type in debugfs instead of sysfs · 346fc108

Ming Lei authored Dec 17, 2018

Now we only export hctx->type via sysfs, and there isn't such info
in hctx entry under debugfs. We often use debugfs only to diagnose
queue mapping issue, so add the support in debugfs.

Queue mapping becomes a bit more complicated after multiple queue
mapping is supported, we may write blktest to verify if queue mapping
is valid based on blk-mq-debugfs.

Given not necessary to export hctx->type twice, so remove the export
from sysfs.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

346fc108

blk-mq: fix allocation for queue mapping table · 07b35eb5

Ming Lei authored Dec 17, 2018

Type of each element in queue mapping table is 'unsigned int,
intead of 'struct blk_mq_queue_map)', so fix it.

Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

07b35eb5

blk-wbt: export internal state via debugfs · d19afebc

Ming Lei authored Dec 17, 2018

This information is helpful to either investigate issues, or understand
wbt's internal behaviour.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d19afebc

blk-mq-debugfs: support rq_qos · cc56694f

Ming Lei authored Dec 17, 2018

blk-mq-debugfs has been proved as very helpful for debug some
tough issues, such as IO hang.

We have seen blk-wbt related IO hang several times, even inside
Red Hat BZ, there is such report not sovled yet, so this patch
adds support debugfs on rq_qos.

Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cc56694f

block: update sysfs documentation · f9824952

Damien Le Moal authored Nov 30, 2018

Add the description of the zoned, nr_zones and chunk_sectors sysfs queue
attributes to Documentation/block/queue-sysfs.txt. The description of
the zoned and chunk_sector attributes are mostly copied from
ABI/testing/sysfs-block (added a typo fix). While at it, also fix a
typo in the description of the io_poll_delay attribute.

nr_zones description is also added to ABI/testing/sysfs-block and
contact email address updated for the zoned attribute.
Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f9824952

16 Dec, 2018 3 commits

block: loop: check error using IS_ERR instead of IS_ERR_OR_NULL in loop_add() · 38a3499f

Chengguang Xu authored Dec 16, 2018

blk_mq_init_queue() will not return NULL pointer to its caller,
so it's better to replace IS_ERR_OR_NULL using IS_ERR in loop_add().

If in the future things change to check NULL pointer inside loop_add(),
we should return -ENOMEM as return code instead of PTR_ERR(NULL).
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

38a3499f

aoe: add __exit annotation · e7cc005f

Chengguang Xu authored Dec 16, 2018

Add __exit annotation to cleanup helper which
is only called once in the module.
Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e7cc005f

block: clear REQ_HIPRI if polling is not supported · d04c406f

Christoph Hellwig authored Dec 14, 2018

This prevents a HIPRI bio from being submitted through a stacking
driver that does not support polling and thus won't poll for I/O
completion.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d04c406f