Commits · fde776afdd8467a09395a7aebdb2499f86315945 · Kirill Smelkov / linux

02 Nov, 2022 10 commits

nvme: remove the NVME_NS_DEAD check in nvme_validate_ns · fde776af

Christoph Hellwig authored Nov 01, 2022

At the point where namespaces are marked dead, the controller is in a
non-live state and we won't get pass the identify commands.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

fde776af

nvme: remove the NVME_NS_DEAD check in nvme_remove_invalid_namespaces · 4f17344e

Christoph Hellwig authored Nov 01, 2022

The NVME_NS_DEAD check only made sense when we revalidated namespaces
in nvme_passthrough_end for commands that affected the namespace inventory.
These days NVME_NS_DEAD is only set during reset or when tearing down
namespaces, and we always remove all namespaces right after that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

4f17344e

nvme: don't remove namespaces in nvme_passthru_end · 23a90864

Christoph Hellwig authored Nov 01, 2022

The call to nvme_remove_invalid_namespaces made sense when
nvme_passthru_end revalidated all namespaces and had to remove those that
didn't exist any more.  Since we don't revalidate from nvme_passthru_end
now, this call is entirely spurious.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

23a90864

nvme-pci: refactor the tagset handling in nvme_reset_work · 0ffc7e98

Christoph Hellwig authored Nov 01, 2022

The code to create, update or delete a tagset and namespaces in
nvme_reset_work is a bit convoluted.  Refactor it with a two high-level
conditionals for first probe vs reset and I/O queues vs no I/O queues
to make the code flow more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-3-hch@lst.de
[axboe: fix whitespace issue]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0ffc7e98

block: set the disk capacity to 0 in blk_mark_disk_dead · 71b26083

Christoph Hellwig authored Nov 01, 2022

nvme and xen-blkfront are already doing this to stop buffered writes from
creating dirty pages that can't be written out later.  Move it to the
common code.

This also removes the comment about the ordering from nvme, as bd_mutex
not only is gone entirely, but also hasn't been used for locking updates
to the disk size long before that, and thus the ordering requirement
documented there doesn't apply any more.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20221101150050.3510-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

71b26083

block, bfq: don't declare 'bfqd' as type 'void *' in bfq_group · aa625117

Yu Kuai authored Nov 02, 2022

Prevent unnecessary format conversion for bfqg->bfqd in multiple
places.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@unimore.it>
Link: https://lore.kernel.org/r/20221102022542.3621219-6-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

aa625117

block, bfq: remove dead code for updating 'rq_in_driver' · 918fdea3

Yu Kuai authored Nov 02, 2022

Such code are not even compiled since they are inside marco "#if 0".
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@unimore.it>
Link: https://lore.kernel.org/r/20221102022542.3621219-5-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

918fdea3

block, bfq: cleanup bfq_activate_requeue_entity() · f6fd119b

Yu Kuai authored Nov 02, 2022

Just make the code a litter cleaner by removing the unnecessary
variable 'sd'.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@unimore.it>
Link: https://lore.kernel.org/r/20221102022542.3621219-4-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f6fd119b

block, bfq: factor out code to update 'active_entities' · e5c63eb4

Yu Kuai authored Nov 02, 2022

Current code is a bit ugly and hard to read.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@unimore.it>
Link: https://lore.kernel.org/r/20221102022542.3621219-3-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e5c63eb4

block, bfq: remove set but not used variable in __bfq_entity_update_weight_prio · 060d9217

Yu Kuai authored Nov 02, 2022

After the patch "block, bfq: cleanup bfq_weights_tree add/remove apis"),
the local variable 'bfqd' is not used anymore, thus remove it.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20221102022542.3621219-2-yukuai1@huaweicloud.com
Fixes: afdba146 ("block, bfq: cleanup bfq_weights_tree add/remove apis")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

060d9217

01 Nov, 2022 15 commits

block: Replace struct rq_depth with unsigned int in struct iolatency_grp · dc572f41

Kemeng Shi authored Oct 18, 2022

We only need a max queue depth for every iolatency to limit the inflight io
number. Replace struct rq_depth with unsigned int to simplfy "struct
iolatency_grp" and save memory.
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20221018111240.22612-4-shikemeng@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

dc572f41

block: Correct comment for scale_cookie_change · 6891f968

Kemeng Shi authored Oct 18, 2022

Default queue depth of iolatency_grp is unlimited, so we scale down
quickly(once by half) in scale_cookie_change. Remove the "subtract
1/16th" part which is not the truth and add the actual way we
scale down.
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Link: https://lore.kernel.org/r/20221018111240.22612-3-shikemeng@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6891f968

block: Remove redundant parent blkcg_gp check in check_scale_change · db5896e9

Kemeng Shi authored Oct 18, 2022

Function blkcg_iolatency_throttle will make sure blkg->parent is not
NULL before calls check_scale_change. And function check_scale_change
is only called in blkcg_iolatency_throttle.
Signed-off-by: Kemeng Shi <shikemeng@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20221018111240.22612-2-shikemeng@huawei.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

db5896e9

block: split elevator_switch · 64b36075

Christoph Hellwig authored Oct 30, 2022

Split an elevator_disable helper from elevator_switch for the case where
we want to switch to no scheduler at all. This includes removing the
pointless elevator_switch_mq helper and removing the switch to no
schedule logic from blk_mq_init_sched.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

64b36075

block: don't check for required features in elevator_match · ffb86425

Christoph Hellwig authored Oct 30, 2022

Checking for the required features in the callers simplifies the code
quite a bit, so do that.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-7-hch@lst.de
[axboe: adjust for dropping patch 1, use __elevator_find()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ffb86425

block: simplify the check for the current elevator in elv_iosched_show · 2eef17a2

Christoph Hellwig authored Oct 30, 2022

Just compare the pointers instead of using the string based
elevator_match.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

2eef17a2

block: cleanup the variable naming in elv_iosched_store · 16095af2

Christoph Hellwig authored Oct 30, 2022

Use eq for the elevator_queue as done elsewhere. This frees e to be used
for the loop iterator instead of the odd __ prefix. In addition rename
elv to cur to make it more clear it is the currently selected elevator.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

16095af2

block: exit elv_iosched_show early when I/O schedulers are not supported · aae2a643

Christoph Hellwig authored Oct 30, 2022

If the tag_set has BLK_MQ_F_NO_SCHED flag set we will never show any
scheduler, so exit early.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

aae2a643

block: cleanup elevator_get · 81eaca44

Christoph Hellwig authored Oct 30, 2022

Do the request_module and repeated lookup in the only caller that cares,
pick a saner name that explains where are actually doing a lookup and
use a sane calling conventions that passes the queue first.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030100714.876891-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

81eaca44

block, bfq: cleanup __bfq_weights_tree_remove() · eb5bca73

Yu Kuai authored Sep 16, 2022

It's the same with bfq_weights_tree_remove() now.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-7-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

eb5bca73

block, bfq: cleanup bfq_weights_tree add/remove apis · afdba146

Yu Kuai authored Sep 16, 2022

The 'bfq_data' and 'rb_root_cached' can both be accessed through
'bfq_queue', thus only pass 'bfq_queue' as parameter.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-6-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

afdba146

block, bfq: do not idle if only one group is activated · eed3ecc9

Yu Kuai authored Sep 16, 2022

Now that root group is counted into 'num_groups_with_pending_reqs',
'num_groups_with_pending_reqs > 0' is always true in
bfq_asymmetric_scenario(). Thus change the condition to '> 1'.

On the other hand, this change can enable concurrent sync io if only
one group is activated.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-5-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

eed3ecc9

block, bfq: refactor the counting of 'num_groups_with_pending_reqs' · 71f8ca77

Yu Kuai authored Sep 16, 2022

Currently, bfq can't handle sync io concurrently as long as they
are not issued from root group. This is because
'bfqd->num_groups_with_pending_reqs > 0' is always true in
bfq_asymmetric_scenario().

The way that bfqg is counted into 'num_groups_with_pending_reqs':

Before this patch:
 1) root group will never be counted.
 2) Count if bfqg or it's child bfqgs have pending requests.
 3) Don't count if bfqg and it's child bfqgs complete all the requests.

After this patch:
 1) root group is counted.
 2) Count if bfqg have pending requests.
 3) Don't count if bfqg complete all the requests.

With this change, the occasion that only one group is activated can be
detected, and next patch will support concurrent sync io in the
occasion.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-4-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

71f8ca77

block, bfq: record how many queues have pending requests · 60a6e10c

Yu Kuai authored Sep 16, 2022

Prepare to refactor the counting of 'num_groups_with_pending_reqs'.

Add a counter in bfq_group, update it while tracking if bfqq have pending
requests and when bfq_bfqq_move() is called.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-3-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

60a6e10c

block, bfq: support to track if bfqq has pending requests · 3d89bd12

Yu Kuai authored Sep 16, 2022

If entity belongs to bfqq, then entity->in_groups_with_pending_reqs
is not used currently. This patch use it to track if bfqq has pending
requests through callers of weights_tree insertion and removal.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Link: https://lore.kernel.org/r/20220916071942.214222-2-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3d89bd12

31 Oct, 2022 4 commits

blk-mq: remove redundant call to blk_freeze_queue_start in blk_mq_destroy_queue · 56c1ee92

Jinlong Chen authored Oct 30, 2022

The calling relationship in blk_mq_destroy_queue() is as follows:

blk_mq_destroy_queue()
    ...
    -> blk_queue_start_drain()
        -> blk_freeze_queue_start()  <- called
        ...
    -> blk_freeze_queue()
        -> blk_freeze_queue_start()  <- called again
        -> blk_mq_freeze_queue_wait()
    ...

So there is a redundant call to blk_freeze_queue_start().

Replace blk_freeze_queue() with blk_mq_freeze_queue_wait() to avoid the
redundant call.
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030083212.1251255-1-nickyc975@zju.edu.cnSigned-off-by: Jens Axboe <axboe@kernel.dk>

56c1ee92

blk-mq: move queue_is_mq out of blk_mq_cancel_work_sync · 219cf43c

Jinlong Chen authored Oct 30, 2022

The only caller that needs queue_is_mq check is del_gendisk, so move the
check into it.
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221030094730.1275463-1-nickyc975@zju.edu.cnSigned-off-by: Jens Axboe <axboe@kernel.dk>

219cf43c

block: simplify blksize_bits() implementation · adff2158

Dawei Li authored Oct 30, 2022

Convert current looping-based implementation into bit operation,
which can bring improvement for:

1) bitops is more efficient for its arch-level optimization.

2) Given that blksize_bits() is inline, _if_ @size is compile-time
constant, it's possible that order_base_2() _may_ make output
compile-time evaluated, depending on code context and compiler behavior.
Signed-off-by: Dawei Li <set_pte_at@outlook.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/TYCP286MB23238842958D7C083D6B67CECA349@TYCP286MB2323.JPNP286.PROD.OUTLOOK.COMSigned-off-by: Jens Axboe <axboe@kernel.dk>

adff2158

blk-mq: avoid double ->queue_rq() because of early timeout · 82c22947

David Jeffery authored Oct 26, 2022

David Jeffery found one double ->queue_rq() issue, so far it can
be triggered in VM use case because of long vmexit latency or preempt
latency of vCPU pthread or long page fault in vCPU pthread, then block
IO req could be timed out before queuing the request to hardware but after
calling blk_mq_start_request() during ->queue_rq(), then timeout handler
may handle it by requeue, then double ->queue_rq() is caused, and kernel
panic.

So far, it is driver's responsibility to cover the race between timeout
and completion, so it seems supposed to be solved in driver in theory,
given driver has enough knowledge.

But it is really one common problem, lots of driver could have similar
issue, and could be hard to fix all affected drivers, even it isn't easy
for driver to handle the race. So David suggests this patch by draining
in-progress ->queue_rq() for solving this issue.

Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: virtualization@lists.linux-foundation.org
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: David Jeffery <djeffery@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221026051957.358818-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

82c22947

25 Oct, 2022 8 commits

block: Micro-optimize get_max_segment_size() · 95465318

Bart Van Assche authored Oct 25, 2022

This patch removes a conditional jump from get_max_segment_size(). The
x86-64 assembler code for this function without this patch is as follows:

206             return min_not_zero(mask - offset + 1,
   0x0000000000000118 <+72>:    not    %rax
   0x000000000000011b <+75>:    and    0x8(%r10),%rax
   0x000000000000011f <+79>:    add    $0x1,%rax
   0x0000000000000123 <+83>:    je     0x138 <bvec_split_segs+104>
   0x0000000000000125 <+85>:    cmp    %rdx,%rax
   0x0000000000000128 <+88>:    mov    %rdx,%r12
   0x000000000000012b <+91>:    cmovbe %rax,%r12
   0x000000000000012f <+95>:    test   %rdx,%rdx
   0x0000000000000132 <+98>:    mov    %eax,%edx
   0x0000000000000134 <+100>:   cmovne %r12d,%edx

With this patch applied:

206             return min(mask - offset, (unsigned long)lim->max_segment_size - 1) + 1;
   0x000000000000003f <+63>:    mov    0x28(%rdi),%ebp
   0x0000000000000042 <+66>:    not    %rax
   0x0000000000000045 <+69>:    and    0x8(%rdi),%rax
   0x0000000000000049 <+73>:    sub    $0x1,%rbp
   0x000000000000004d <+77>:    cmp    %rbp,%rax
   0x0000000000000050 <+80>:    cmova  %rbp,%rax
   0x0000000000000054 <+84>:    add    $0x1,%eax
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-4-bvanassche@acm.orgReviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

95465318

block: Constify most queue limits pointers · aa261f20

Bart Van Assche authored Oct 25, 2022

Document which functions do not modify the queue limits.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Keith Busch <kbusch@kernel.org>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-3-bvanassche@acm.orgReviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

aa261f20

block: Remove request.write_hint · b179c98f

Bart Van Assche authored Oct 25, 2022

Commit c75e707f ("block: remove the per-bio/request write hint")
removed all code that uses the struct request write_hint member. Hence
also remove 'write_hint' itself.
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20221025191755.1711437-2-bvanassche@acm.orgReviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b179c98f

block: remove bio_start_io_acct_time · a55b70f1

Christoph Hellwig authored Oct 25, 2022

bio_start_io_acct_time is not actually used anywhere, so remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221025155916.270303-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

a55b70f1

nvme-apple: remove an extra queue reference · 941f7298

Christoph Hellwig authored Oct 18, 2022

Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second admin queue reference to be held by the
apple_nvme structure.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

941f7298

nvme-pci: remove an extra queue reference · 7dcebef9

Christoph Hellwig authored Oct 18, 2022

Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second admin queue reference to be held by the nvme_dev.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

7dcebef9

scsi: remove an extra queue reference · dc917c36

Christoph Hellwig authored Oct 18, 2022

Now that blk_mq_destroy_queue does not release the queue reference, there
is no need for a second queue reference to be held by the scsi_device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

dc917c36

blk-mq: move the call to blk_put_queue out of blk_mq_destroy_queue · 2b3f056f

Christoph Hellwig authored Oct 18, 2022

The fact that blk_mq_destroy_queue also drops a queue reference leads
to various places having to grab an extra reference.  Move the call to
blk_put_queue into the callers to allow removing the extra references.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20221018135720.670094-2-hch@lst.de
[axboe: fix fabrics_q vs admin_q conflict in nvme core.c]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2b3f056f

24 Oct, 2022 3 commits

block: fix up elevator_type refcounting · 8ed40ee3

Jinlong Chen authored Oct 20, 2022

The current reference management logic of io scheduler modules contains
refcnt problems. For example, blk_mq_init_sched may fail before or after
the calling of e->ops.init_sched. If it fails before the calling, it does
nothing to the reference to the io scheduler module. But if it fails after
the calling, it releases the reference by calling kobject_put(&eq->kobj).

As the callers of blk_mq_init_sched can't know exactly where the failure
happens, they can't handle the reference to the io scheduler module
properly: releasing the reference on failure results in double-release if
blk_mq_init_sched has released it, and not releasing the reference results
in ghost reference if blk_mq_init_sched did not release it either.

The same problem also exists in io schedulers' init_sched implementations.

We can address the problem by adding releasing statements to the error
handling procedures of blk_mq_init_sched and init_sched implementations.
But that is counterintuitive and requires modifications to existing io
schedulers.

Instead, We make elevator_alloc get the io scheduler module references
that will be released by elevator_release. And then, we match each
elevator_get with an elevator_put. Therefore, each reference to an io
scheduler module explicitly has its own getter and releaser, and we no
longer need to worry about the refcnt problems.

The bugs and the patch can be validated with tools here:
https://github.com/nickyc975/linux_elv_refcnt_bug.git

[hch: split out a few bits into separate patches, use a non-try
module_get in elevator_alloc]
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

8ed40ee3

block: check for an unchanged elevator earlier in __elevator_change · b54c2ad9

Jinlong Chen authored Oct 20, 2022

No need to find the actual elevator_type struct for this comparism,
the name is all that is needed.
Signed-off-by: Jinlong Chen <nickyc975@zju.edu.cn>
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

b54c2ad9

block: sanitize the elevator name before passing it to __elevator_change · 58367c8a

Christoph Hellwig authored Oct 20, 2022

The stripped name should also be used for the none check.  To do so
strip it in the caller and pass in the sanitized name.  Drop the pointless
__ prefix in the function name while we're at it.

Based on a patch from Jinlong Chen <nickyc975@zju.edu.cn>.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20221020064819.1469928-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

58367c8a