Commits · 3304742562d27fb87a6d8291cc48824dd20f6964 · Kirill Smelkov / linux

29 Nov, 2021 40 commits

block: mark put_io_context_active static · 33047425

Christoph Hellwig authored Nov 26, 2021

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

33047425

Revert "block: Provide blk_mq_sched_get_icq()" · c2a32464

Christoph Hellwig authored Nov 26, 2021

This reverts commit 4896c4e64ba5d5d5acdbcf68c5910dd4f6d8fa62.

The helper is not needed any more.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

c2a32464

bfq: use bfq_bic_lookup in bfq_limit_depth · a0725c22

Christoph Hellwig authored Nov 26, 2021

No need to create a new I/O context if there is none present yet in
->limit_depth.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

a0725c22

bfq: simplify bfq_bic_lookup · 836b394b

Christoph Hellwig authored Nov 26, 2021

Remove the unused bfqd argument, and hardcode ioc to current->io_context.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

836b394b

fork: move copy_io to block/blk-ioc.c · 88c9a2ce

Christoph Hellwig authored Nov 26, 2021

Move the copying of the I/O context to the block layer as that is where
we can use the proper low-level interfaces.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

88c9a2ce

RDMA/qib: rename copy_io to qib_copy_io · e92a559e

Christoph Hellwig authored Nov 26, 2021

Add the proper module prefix to avoid conflicts with a function
in the scheduler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

e92a559e

blk-mq: use bio->bi_opf after bio is checked · 5f480b1a

Ming Lei authored Nov 27, 2021

bio->bi_opf isn't finalized before checking the bio, so use it after
submit_bio_checks() returns.

Fixes: 5b13bc8a ("blk-mq: cleanup request allocation")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5f480b1a

bfq: Do not let waker requests skip proper accounting · c65e6fd4

Jan Kara authored Nov 25, 2021

Commit 7cc4ffc5 ("block, bfq: put reqs of waker and woken in
dispatch list") added a condition to bfq_insert_request() which added
waker's requests directly to dispatch list. The rationale was that
completing waker's IO is needed to get more IO for the current queue.
Although this rationale is valid, there is a hole in it. The waker does
not necessarily serve the IO only for the current queue and maybe it's
current IO is not needed for current queue to make progress. Furthermore
injecting IO like this completely bypasses any service accounting within
bfq and thus we do not properly track how much service is waker's queue
getting or that the waker is actually doing any IO. Depending on the
conditions this can result in the waker getting too much or too few
service.

Consider for example the following job file:

[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

[slowwriter]
numjobs=1
prioclass=2
prio=7
fsync=200

[fastwriter]
numjobs=1
prioclass=2
prio=0
fsync=200

Despite processes have very different IO priorities, they get the same
about of service. The reason is that bfq identifies these processes as
having waker-wakee relationship and once that happens, IO from
fastwriter gets injected during slowwriter's time slice. As a result bfq
is not aware that fastwriter has any IO to do and constantly schedules
only slowwriter's queue. Thus fastwriter is forced to compete with
slowwriter's IO all the time instead of getting its share of time based
on IO priority.

Drop the special injection condition from bfq_insert_request(). As a
result, requests will be tracked and queued in a normal way and on next
dispatch bfq_select_queue() can decide whether the waker's inserted
requests should be injected during the current queue's timeslice or not.

Fixes: 7cc4ffc5 ("block, bfq: put reqs of waker and woken in dispatch list")
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-8-jack@suse.czSigned-off-by: Jens Axboe <axboe@kernel.dk>

c65e6fd4

bfq: Log waker detections · 1eb17f5e

Jan Kara authored Nov 25, 2021

Waker - wakee relationships are important in deciding whether one queue
can preempt the other one. Print information about detected waker-wakee
relationships so that scheduling decisions can be better understood from
block traces.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-7-jack@suse.czSigned-off-by: Jens Axboe <axboe@kernel.dk>

1eb17f5e

bfq: Provide helper to generate bfqq name · 582f04e1

Jan Kara authored Nov 25, 2021

Instead of having helper formating bfqq pid, provide a helper to
generate full bfqq name as used in the traces. It saves some code
duplication and will save more in the coming tracepoints.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-6-jack@suse.czSigned-off-by: Jens Axboe <axboe@kernel.dk>

582f04e1

bfq: Limit waker detection in time · 1f18b700

Jan Kara authored Nov 25, 2021

Currently, when process A starts issuing requests shortly after process
B has completed some IO three times in a row, we decide that B is a
"waker" of A meaning that completing IO of B is needed for A to make
progress and generally stop separating A's and B's IO much. This logic
is useful to avoid unnecessary idling and thus throughput loss for cases
where workload needs to switch e.g. between the process and the
journaling thread doing IO. However the detection heuristic tends to
frequently give false positives when A and B are fighting IO bandwidth
and other processes aren't doing much IO as we are basically deemed to
eventually accumulate three occurences of a situation where one process
starts issuing requests after the other has completed some IO. To reduce
these false positives, cancel the waker detection also if we didn't
accumulate three detected wakeups within given timeout. The rationale is
that if wakeups are really rare, the pointless idling doesn't hurt
throughput that much anyway.

This significantly reduces false waker detection for workload like:

[global]
directory=/mnt/repro/
rw=write
size=8g
time_based
runtime=30
ramp_time=10
blocksize=1m
direct=0
ioengine=sync

[slowwriter]
numjobs=1
fsync=200

[fastwriter]
numjobs=1
fsync=200
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-5-jack@suse.czSigned-off-by: Jens Axboe <axboe@kernel.dk>

1f18b700

bfq: Limit number of requests consumed by each cgroup · 76f1df88

Jan Kara authored Nov 25, 2021

When cgroup IO scheduling is used with BFQ it does not really provide
service differentiation if the cgroup drives a big IO depth. That for
example happens with writeback which asynchronously submits lots of IO
but it can happen with AIO as well. The problem is that if we have two
cgroups that submit IO with different weights, the cgroup with higher
weight properly gets more IO time and is able to dispatch more IO.
However this causes lower weight cgroup to accumulate more requests
inside BFQ and eventually lower weight cgroup consumes most of IO
scheduler tags. At that point higher weight cgroup stops getting better
service as it is mostly blocked waiting for a scheduler tag while its
queues inside BFQ are empty and thus lower weight cgroup gets served.

Check how many requests submitting cgroup has allocated in
bfq_limit_depth() and if it consumes more requests than what would
correspond to its weight limit available depth to 1 so that the cgroup
cannot consume many more requests. With this limitation the higher
weight cgroup gets proper service even with writeback.
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-4-jack@suse.czSigned-off-by: Jens Axboe <axboe@kernel.dk>

76f1df88

bfq: Store full bitmap depth in bfq_data · 44dfa279

Jan Kara authored Nov 25, 2021

Store bitmap depth shift inside bfq_data so that we can use it in
bfq_limit_depth() for proportioning when limiting number of available
request tags for a cgroup.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-3-jack@suse.czSigned-off-by: Jens Axboe <axboe@kernel.dk>

44dfa279

bfq: Track number of allocated requests in bfq_entity · 98f04499

Jan Kara authored Nov 25, 2021

When we want to limit number of requests used by each bfqq and also
cgroup, we need to track also number of requests used by each cgroup.
So track number of allocated requests for each bfq_entity.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-2-jack@suse.czSigned-off-by: Jens Axboe <axboe@kernel.dk>

98f04499

block: Provide blk_mq_sched_get_icq() · 790cf9c8

Jan Kara authored Nov 25, 2021

Currently we lookup ICQ only after the request is allocated. However BFQ
will want to decide how many scheduler tags it allows a given bfq queue
(effectively a process) to consume based on cgroup weight. So provide a
function blk_mq_sched_get_icq() so that BFQ can lookup ICQ earlier.
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20211125133645.27483-1-jack@suse.czSigned-off-by: Jens Axboe <axboe@kernel.dk>

790cf9c8

mmc: core: Use blk_mq_complete_request_direct(). · 639d3531

Sebastian Andrzej Siewior authored Oct 25, 2021

The completion callback for the sdhci-pci device is invoked from a
kworker.
I couldn't identify in which context is mmc_blk_mq_req_done() invoke but
the remaining caller are from invoked from preemptible context. Here it
would make sense to complete the request directly instead scheduling
ksoftirqd for its completion.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>
Acked-by: Adrian Hunter <adrian.hunter@intel.com>
Link: https://lore.kernel.org/r/20211025070658.1565848-3-bigeasy@linutronix.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

639d3531

blk-mq: Add blk_mq_complete_request_direct() · e8dc17e2

Sebastian Andrzej Siewior authored Oct 25, 2021

Add blk_mq_complete_request_direct() which completes the block request
directly instead deferring it to softirq for single queue devices.

This is useful for devices which complete the requests in preemptible
context and raising softirq from means scheduling ksoftirqd.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211025070658.1565848-2-bigeasy@linutronix.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

e8dc17e2

blk-crypto: remove blk_crypto_unregister() · 72cd9df2

Eric Biggers authored Nov 23, 2021

This function is trivial and is only used in one place. Having this
function is misleading because it implies that blk_crypto_register()
needs to be paired with blk_crypto_unregister(), which is not the case.
Just set disk->queue->crypto_profile to NULL directly.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124013733.347612-1-ebiggers@kernel.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

72cd9df2

blk-mq: cleanup request allocation · 5b13bc8a

Christoph Hellwig authored Nov 24, 2021

Refactor the request alloction so that blk_mq_get_cached_request tries
to find a cached request first, and the entirely separate and now
self contained blk_mq_get_new_requests allocates one or more requests
if that is not possible.

There is a small change in behavior as submit_bio_checks is called
twice now if a cached request is present but can't be used, but that
is a small price to pay for unwinding this code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124062856.1444266-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

5b13bc8a

block: don't include <linux/part_stat.h> in blk.h · 82d981d4

Christoph Hellwig authored Nov 23, 2021

Not needed, shift it into the source files that need it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-9-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

82d981d4

block: don't include <linux/idr.h> in blk.h · ca5b304c

Christoph Hellwig authored Nov 23, 2021

Not needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-8-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

ca5b304c

block: don't include <linux/blk-mq.h> in blk.h · a2ff7781

Christoph Hellwig authored Nov 23, 2021

Not needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

a2ff7781

block: don't include blk-mq.h in blk.h · e4a19f72

Christoph Hellwig authored Nov 23, 2021

No needed, shift a blk-stat.h include into the source file that needs it
instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

e4a19f72

block: don't include blk-mq-sched.h in blk.h · 2aa7745b

Christoph Hellwig authored Nov 23, 2021

No needed, shift it into the source files that need it instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

2aa7745b

block: remove the e argument to elevator_exit · 0c6cb3a2

Christoph Hellwig authored Nov 23, 2021

All callers pass q->elevator.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

0c6cb3a2

block: remove elevator_exit · f46b81c5

Christoph Hellwig authored Nov 23, 2021

Open code elevator_exit in it's only caller, and rename __elevator_exit to
elevator_exit.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

f46b81c5

block: move blk_get_flush_queue to blk-flush.c · 0281ed3c

Christoph Hellwig authored Nov 23, 2021

blk_get_flush_queue is only used in blk-flush.c, so move it there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123185312.1432157-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

0281ed3c

blk_mq: remove repeated includes · 35c90e6e

Guo Zhengkui authored Nov 23, 2021

Remove a repeated "#include<linux/sched/sysctl.h>".
Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Link: https://lore.kernel.org/r/20211123063340.25882-1-guozhengkui@vivo.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

35c90e6e

block: move io_context creation into where it's needed · 5a9d041b

Jens Axboe authored Nov 13, 2021

The only user of the io_context for IO is BFQ, yet we put the checking
and logic of it into the normal IO path.

Put the creation into blk_mq_sched_assign_ioc(), and have BFQ use that
helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5a9d041b

block: only allocate poll_stats if there's a user of them · 48b5c1fb

Jens Axboe authored Nov 13, 2021

This is essentially never used, yet it's about 1/3rd of the total
queue size. Allocate it when needed, and don't embed it in the queue.

Kill the queue flag for this while at it, since we can just check the
assigned pointer now.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

48b5c1fb

blk-ioprio: don't set bio priority if not needed · 25c4b5e0

Jens Axboe authored Nov 13, 2021

We don't need to write to the bio if:

1) No ioprio value has ever been assigned to the blkcg
2) We wouldn't anyway, depending on bio and blkcg IO priority
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

25c4b5e0

blk-mq: move more plug handling from blk_mq_submit_bio into blk_add_rq_to_plug · 1e9c2303

Christoph Hellwig authored Nov 23, 2021

Keep all the functionality for adding a request to a plug in a single place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

1e9c2303

blk-mq: simplify the plug handling in blk_mq_submit_bio · 0c5bcc92

Christoph Hellwig authored Nov 23, 2021

blk_mq_submit_bio has two different plug cases, one that uses full
plugging and a limited plugging one.

The limited plugging case is only used for a corner case that does
not matter in real life:

 - no ->commit_rqs (so not NVMe)
 - no shared tags (so not SCSI)
 - not rotational (so no old disk or floppy driver)
 - must have multiple queues (so no eMMC)

Remove the limited merging case and all the related junk to simplify
blk_mq_submit_bio and the functions called from it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211123160443.1315598-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

0c5bcc92

sr: set GENHD_FL_REMOVABLE earlier · a4561f9f

Christoph Hellwig authored Nov 22, 2021

Set up GENHD_FL_REMOVABLE together with the rest of the gendisk fields.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-15-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

a4561f9f

block: cleanup the GENHD_FL_* definitions · 430cc5d3

Christoph Hellwig authored Nov 22, 2021

Switch to an enum and tidy up the documentation.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-14-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

430cc5d3

block: don't set GENHD_FL_NO_PART for hidden gendisks · 9f18db57

Christoph Hellwig authored Nov 22, 2021

Hidden gendisks can't be opened using blkdev_get_*, so we can't really
reach any of the partition scanning paths or partitioning ioctls except
for the initial partition scan from add_disk.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-13-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

9f18db57

block: remove GENHD_FL_EXT_DEVT · 1ebe2e5f

Christoph Hellwig authored Nov 22, 2021

All modern drivers can support extra partitions using the extended
dev_t.  In fact except for the ioctl method drivers never even see
partitions in normal operation.

So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

1ebe2e5f

block: remove GENHD_FL_SUPPRESS_PARTITION_INFO · 3b5149ac

Christoph Hellwig authored Nov 22, 2021

This flag is not set directly anywhere and only inherited from
GENHD_FL_HIDDEN. Just check for GENHD_FL_HIDDEN instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-11-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

3b5149ac

mmc: don't set GENHD_FL_SUPPRESS_PARTITION_INFO · 79b0f79a

Christoph Hellwig authored Nov 22, 2021

This manually reverts 07b652cdbec3 ("mmc: card: Don't show eMMC RPMB and
BOOT areas in /proc/partitions"). Based on the commit description that
change was purely cosmetic. mmc is the last driver that sets this
flag and thus prevents it from being removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20211122130625.1136848-10-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

79b0f79a

null_blk: don't suppress partitioning information · 94b49c3d

Christoph Hellwig authored Nov 22, 2021

This manually reverts commit 27290b469051 ("null_blk: suppress invalid
partition info"). The message in that commit log can't appearch as
the flag is never checked during probing, and there is no good reason
to treat null_blk special in /proc/partitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-9-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

94b49c3d