Commits · 37ae5a0f5287a52cf51242e76ccf198d02ffe495 · Kirill Smelkov / linux

21 Dec, 2021 4 commits

block: use "unsigned long" for blk_validate_block_size(). · 37ae5a0f

Tetsuo Handa authored Dec 18, 2021

Since lo_simple_ioctl(LOOP_SET_BLOCK_SIZE) and ioctl(NBD_SET_BLKSIZE) pass
user-controlled "unsigned long arg" to blk_validate_block_size(),
"unsigned long" should be used for validation.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/9ecbf057-4375-c2db-ab53-e4cc0dff953d@i-love.sakura.ne.jpSigned-off-by: Jens Axboe <axboe@kernel.dk>

37ae5a0f

block: fix error unwinding in device_add_disk · 99d8690a

Christoph Hellwig authored Dec 21, 2021

One device_add is called disk->ev will be freed by disk_release, so we
should free it twice. Fix this by allocating disk->ev after device_add
so that the extra local unwinding can be removed entirely.

Based on an earlier patch from Tetsuo Handa.
Reported-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Tested-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Fixes: 83cbce95 ("block: add error handling for device_add_disk / add_disk")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211221161851.788424-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

99d8690a

block: call blk_exit_queue() before freeing q->stats · 37e11c36

Ming Lei authored Dec 21, 2021

blk_stat_disable_accounting() is added in commit 68497092
("block: make queue stat accounting a reference"), and called in
kyber_exit_sched().

So we have to free q->stats after elevator is unloaded from
blk_exit_queue() in blk_release_queue(). Otherwise kernel panic
is caused.

Fixes: 68497092 ("block: make queue stat accounting a reference")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211221040436.1333880-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

37e11c36

block: fix error in handling dead task for ioprio setting · a957b612

Jens Axboe authored Dec 20, 2021

Don't combine the task exiting and "already have io_context" case, we
need to just abort if the task is marked as dead. Return -ESRCH, which
is the documented value for ioprio_set() if the specified task could not
be found.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reported-by: syzbot+8836466a79f4175961b0@syzkaller.appspotmail.com
Fixes: 5fc11eeb ("block: open code create_task_io_context in set_task_ioprio")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a957b612

20 Dec, 2021 2 commits

blk-mq: blk-mq: check quiesce state before queue_rqs · 518579a9

Keith Busch authored Dec 20, 2021

The low level drivers don't expect to see new requests after a
successful quiesce completes. Check the queue quiesce state within the
rcu protected area prior to calling the driver's queue_rqs().

Fixes: 3c67d44d ("block: add mq_ops->queue_rqs hook")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20211220205919.180191-1-kbusch@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

518579a9

blktrace: switch trace spinlock to a raw spinlock · 361c81db

Wander Lairson Costa authored Dec 20, 2021

The running_trace_lock protects running_trace_list and is acquired
within the tracepoint which implies disabled preemption. The spinlock_t
typed lock can not be acquired with disabled preemption on PREEMPT_RT
because it becomes a sleeping lock.
The runtime of the tracepoint depends on the number of entries in
running_trace_list and has no limit. The blk-tracer is considered debug
code and higher latencies here are okay.

Make running_trace_lock a raw_spinlock_t.
Signed-off-by: Wander Lairson Costa <wander@redhat.com>
Link: https://lore.kernel.org/r/20211220192827.38297-1-wander@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

361c81db

16 Dec, 2021 17 commits

block: only build the icq tracking code when needed · 5ef16305

Christoph Hellwig authored Dec 09, 2021

Only bfq needs to code to track icq, so make it conditional.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-12-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

5ef16305

block: fold create_task_io_context into ioc_find_get_icq · 90b627f5

Christoph Hellwig authored Dec 09, 2021

Fold create_task_io_context into the only remaining caller.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-11-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

90b627f5

block: open code create_task_io_context in set_task_ioprio · 5fc11eeb

Christoph Hellwig authored Dec 09, 2021

The flow in set_task_ioprio can be simplified by simply open coding
create_task_io_context, which removes a refcount roundtrip on the I/O
context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-10-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

5fc11eeb

block: fold get_task_io_context into set_task_ioprio · 8472161b

Christoph Hellwig authored Dec 09, 2021

Fold get_task_io_context into its only caller, and simplify the code
as no reference to the I/O context is required to just set the ioprio
field.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-9-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

8472161b

block: move set_task_ioprio to blk-ioc.c · a411cd3c

Christoph Hellwig authored Dec 09, 2021

Keep set_task_ioprio with the other low-level code that accesses the
io_context structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

a411cd3c

block: cleanup ioc_clear_queue · 091abcb3

Christoph Hellwig authored Dec 09, 2021

Fold __ioc_clear_queue into ioc_clear_queue and switch to always
use plain _irq locking instead of the more expensive _irqsave that
is not needed here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

091abcb3

block: refactor put_io_context · edf70ff5

Christoph Hellwig authored Dec 09, 2021

Move the code to delay freeing the icqs into a separate helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

edf70ff5

block: remove the NULL ioc check in put_io_context · 8a20c0c7

Christoph Hellwig authored Dec 09, 2021

No caller passes in a NULL pointer, so remove the check.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

8a20c0c7

block: refactor put_iocontext_active · 4be8a2ea

Christoph Hellwig authored Dec 09, 2021

Factor out a ioc_exit_icqs helper to tear down the icqs and the fold
the rest of put_iocontext_active into exit_io_context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

4be8a2ea

block: simplify struct io_context refcounting · 0aed2f16

Christoph Hellwig authored Dec 09, 2021

Don't hold a reference to ->refcount for each active reference, but
just one for all active references.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

0aed2f16

block: remove the nr_task field from struct io_context · 8a2ba178

Christoph Hellwig authored Dec 09, 2021

Nothing ever looks at ->nr_tasks, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211209063131.18537-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

8a2ba178

nvme: add support for mq_ops->queue_rqs() · d62cbcf6

Jens Axboe authored Nov 18, 2021

This enables the block layer to send us a full plug list of requests
that need submitting. The block layer guarantees that they all belong
to the same queue, but we do have to check the hardware queue mapping
for each request.

If errors are encountered, leave them in the passed in list. Then the
block layer will handle them individually.

This is good for about a 4% improvement in peak performance, taking us
from 9.6M to 10M IOPS/core.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d62cbcf6

nvme: separate command prep and issue · 62451a2b

Jens Axboe authored Oct 29, 2021

Add a nvme_prep_rq() helper to setup a command, and nvme_queue_rq() is
adapted to use this helper.
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

62451a2b

nvme: split command copy into a helper · 3233b94c

Jens Axboe authored Oct 29, 2021

We'll need it for batched submit as well. Since we now have a copy
helper, get rid of the nvme_submit_cmd() wrapper.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3233b94c

block: add mq_ops->queue_rqs hook · 3c67d44d

Jens Axboe authored Dec 03, 2021

If we have a list of requests in our plug list, send it to the driver in
one go, if possible. The driver must set mq_ops->queue_rqs() to support
this, if not the usual one-by-one path is used.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3c67d44d

block: use singly linked list for bio cache · fcade2ce

Jens Axboe authored Dec 01, 2021

Pointless to maintain a head/tail for the list, as we never need to
access the tail. Entries are always LIFO for cache hotness reasons.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fcade2ce

block: add completion handler for fast path · 5581a5dd

Jens Axboe authored Dec 01, 2021

The batched completions only deal with non-partial requests anyway,
and it doesn't deal with any requests that have errors. Add a completion
handler that assumes it's a full request and that it's all being ended
successfully.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5581a5dd

15 Dec, 2021 1 commit

block: make queue stat accounting a reference · 68497092

Jens Axboe authored Dec 14, 2021

kyber turns on IO statistics when it is loaded on a queue, which means
that even if kyber is then later unloaded, we're still stuck with stats
enabled on the queue.

Change the account enabled from a bool to an int, and pair the enable call
with the equivalent disable call. This ensures that stats gets turned off
again appropriately.
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

68497092

13 Dec, 2021 1 commit

bdev: Improve lookup_bdev documentation · 0ba4566c

Matthew Wilcox (Oracle) authored Dec 13, 2021

Add a Context section and rewrite the rest to be clearer.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Link: https://lore.kernel.org/r/20211213171113.3097631-1-willy@infradead.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

0ba4566c

12 Dec, 2021 1 commit

mtd_blkdevs: don't scan partitions for plain mtdblock · 17f81f9d

Christoph Hellwig authored Dec 06, 2021

mtdblock / mtdblock_ro set part_bits to 0 and thus nevever scanned
partitions. Restore that behavior by setting the GENHD_FL_NO_PART flag.

Fixes: 1ebe2e5f ("block: remove GENHD_FL_EXT_DEVT")
Reported-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20211206070409.2836165-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

17f81f9d

06 Dec, 2021 5 commits

blk-mq: Optimise blk_mq_queue_tag_busy_iter() for shared tags · fea9f92f

John Garry authored Dec 06, 2021

Kashyap reports high CPU usage in blk_mq_queue_tag_busy_iter() and callees
using megaraid SAS RAID card since moving to shared tags [0].

Previously, when shared tags was shared sbitmap, this function was less
than optimum since we would iter through all tags for all hctx's,
yet only ever match upto tagset depth number of rqs.

Since the change to shared tags, things are even less efficient if we have
parallel callers of blk_mq_queue_tag_busy_iter(). This is because in
bt_iter() -> blk_mq_find_and_get_req() there would be more contention on
accessing each request ref and tags->lock since they are now shared among
all HW queues.

Optimise by having separate calls to bt_for_each() for when we're using
shared tags. In this case no longer pass a hctx, as it is no longer
relevant, and teach bt_iter() about this.

Ming suggested something along the lines of this change, apart from a
different implementation.

[0] https://lore.kernel.org/linux-block/e4e92abbe9d52bcba6b8cc6c91c442cc@mail.gmail.com/Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reported-and-tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Fixes: e155b0c2 ("blk-mq: Use shared tags for shared sbitmap support")
Link: https://lore.kernel.org/r/1638794990-137490-4-git-send-email-john.garry@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

fea9f92f

blk-mq: Delete busy_iter_fn · fc39f8d2

John Garry authored Dec 06, 2021

Typedefs busy_iter_fn and busy_tag_iter_fn are now identical, so delete
busy_iter_fn to reduce duplication.

It would be nicer to delete busy_tag_iter_fn, as the name busy_iter_fn is
less specific.

However busy_tag_iter_fn is used in many different parts of the tree,
unlike busy_iter_fn which is just use in block/, so just take the
straightforward path now, so that we could rename later treewide.
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1638794990-137490-3-git-send-email-john.garry@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

fc39f8d2

blk-mq: Drop busy_iter_fn blk_mq_hw_ctx argument · 8ab30a33

John Garry authored Dec 06, 2021

The only user of blk_mq_hw_ctx blk_mq_hw_ctx argument is
blk_mq_rq_inflight().

Function blk_mq_rq_inflight() uses the hctx to find the associated request
queue to match against the request. However this same check is already
done in caller bt_iter(), so drop this check.

With that change there are no more users of busy_iter_fn blk_mq_hw_ctx
argument, so drop the argument.

Reviewed-by Hannes Reinecke <hare@suse.de>
Signed-off-by: John Garry <john.garry@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Kashyap Desai <kashyap.desai@broadcom.com>
Link: https://lore.kernel.org/r/1638794990-137490-2-git-send-email-john.garry@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

8ab30a33

blk-mq: don't use plug->mq_list->q directly in blk_mq_run_dispatch_ops() · 73f3760e

Ming Lei authored Dec 06, 2021

blk_mq_run_dispatch_ops() is defined as one macro, and plug->mq_list
will be changed when running 'dispatch_ops', so add one local variable
for holding request queue.
Reported-and-tested-by: Yi Zhang <yi.zhang@redhat.com>
Fixes: 4cafe86c ("blk-mq: run dispatch lock once in case of issuing from list")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

73f3760e

blk-mq: don't run might_sleep() if the operation needn't blocking · 41adf531

Ming Lei authored Dec 06, 2021

The operation protected via blk_mq_run_dispatch_ops() in blk_mq_run_hw_queue
won't sleep, so don't run might_sleep() for it.
Reported-and-tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

41adf531

03 Dec, 2021 8 commits

blk-mq: run dispatch lock once in case of issuing from list · 4cafe86c

Ming Lei authored Dec 03, 2021

It isn't necessary to call blk_mq_run_dispatch_ops() once for issuing
single request directly, and enough to do it one time when issuing from
whole list.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-5-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4cafe86c

blk-mq: pass request queue to blk_mq_run_dispatch_ops · bcc330f4

Ming Lei authored Dec 03, 2021

We have switched to allocate srcu into request queue, so it is fine
to pass request queue to blk_mq_run_dispatch_ops().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-4-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

bcc330f4

blk-mq: move srcu from blk_mq_hw_ctx to request_queue · 704b914f

Ming Lei authored Dec 03, 2021

In case of BLK_MQ_F_BLOCKING, per-hctx srcu is used to protect dispatch
critical area. However, this srcu instance stays at the end of hctx, and
it often takes standalone cacheline, often cold.

Inside srcu_read_lock() and srcu_read_unlock(), WRITE is always done on
the indirect percpu variable which is allocated from heap instead of
being embedded, srcu->srcu_idx is read only in srcu_read_lock(). It
doesn't matter if srcu structure stays in hctx or request queue.

So switch to per-request-queue srcu for protecting dispatch, and this
way simplifies quiesce a lot, not mention quiesce is always done on the
request queue wide.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-3-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

704b914f

blk-mq: remove hctx_lock and hctx_unlock · 2a904d00

Ming Lei authored Dec 03, 2021

Remove hctx_lock and hctx_unlock, and add one helper of
blk_mq_run_dispatch_ops() to run code block defined in dispatch_ops
with rcu/srcu read held.

Compared with hctx_lock()/hctx_unlock():

1) remove 2 branch to 1, so we just need to check
(hctx->flags & BLK_MQ_F_BLOCKING) once when running one dispatch_ops

2) srcu_idx needn't to be touched in case of non-blocking

3) might_sleep_if() can be moved to the blocking branch

Also put the added blk_mq_run_dispatch_ops() in private header, so that
the following patch can use it out of blk-mq.c.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20211203131534.3668411-2-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

2a904d00

block: switch to atomic_t for request references · 0a467d0f

Jens Axboe authored Oct 14, 2021

refcount_t is not as expensive as it used to be, but it's still more
expensive than the io_uring method of using atomic_t and just checking
for potential over/underflow.

This borrows that same implementation, which in turn is based on the
mm implementation from Linus.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0a467d0f

block: move direct_IO into our own read_iter handler · ceaa7625

Jens Axboe authored Oct 28, 2021

Don't call into generic_file_read_iter() if we know it's O_DIRECT, just
set it up ourselves and call our own handler. This avoids an indirect call
for O_DIRECT.

Fall back to filemap_read() if we fail.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ceaa7625

mm: move filemap_range_needs_writeback() into header · 4bdcd1dd

Jens Axboe authored Oct 28, 2021

No functional changes in this patch, just in preparation for efficiently
calling this light function from the block O_DIRECT handling.
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4bdcd1dd

block: fix double bio queue when merging in cached request path · a08ed9aa

Jens Axboe authored Dec 02, 2021

When we attempt to merge off the cached request path, we return NULL
if successful. This makes the caller believe that it's should allocate
a new request, and hence we end up with the bio both merged and associated
with a new request. This, predictably, leads to all sorts of crashes.

Pass in a pointer to the bio pointer, and clear it for the merge case.
Then the caller knows that the bio is already queued, and no new requests
need to get allocated.

Fixes: 5b13bc8a ("blk-mq: cleanup request allocation")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a08ed9aa

02 Dec, 2021 1 commit

block: get rid of useless goto and label in blk_mq_get_new_requests() · 373b5416

Jens Axboe authored Dec 02, 2021

Expected case is returning a request, just check for success and return
the request rather than having an error label.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

373b5416