Commits · 0f7c8f0f7934c389b0f9fa1f151e753d8de6348f · Kirill Smelkov / linux

16 Feb, 2023 5 commits

block: Fix io statistics for cgroup in throttle path · 0f7c8f0f

Jinke Han authored Feb 16, 2023

In the current code, io statistics are missing for cgroup when bio
was throttled by blk-throttle. Fix it by moving the unreaching code
to submit_bio_noacct_nocheck.

Fixes: 3f98c753 ("block: don't check bio in blk_throtl_dispatch_work_fn")
Signed-off-by: Jinke Han <hanjinke.666@bytedance.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230216032250.74230-1-hanjinke.666@bytedance.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0f7c8f0f

brd: mark as nowait compatible · 67205f80

Jens Axboe authored Feb 15, 2023

By default, non-mq drivers do not support nowait. This causes io_uring
to use a slower path as the driver cannot be trust not to block. brd
can safely set the nowait flag, as worst case all it does is a NOIO
allocation.

For io_uring, this makes a substantial difference. Before:

submitter=0, tid=453, file=/dev/ram0, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=440.03K, BW=1718MiB/s, IOS/call=32/31
IOPS=428.96K, BW=1675MiB/s, IOS/call=32/32
IOPS=442.59K, BW=1728MiB/s, IOS/call=32/31
IOPS=419.65K, BW=1639MiB/s, IOS/call=32/32
IOPS=426.82K, BW=1667MiB/s, IOS/call=32/31

and after:

submitter=0, tid=354, file=/dev/ram0, node=-1
polled=0, fixedbufs=1/0, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=3.37M, BW=13.15GiB/s, IOS/call=32/31
IOPS=3.45M, BW=13.46GiB/s, IOS/call=32/31
IOPS=3.43M, BW=13.42GiB/s, IOS/call=32/32
IOPS=3.43M, BW=13.39GiB/s, IOS/call=32/31
IOPS=3.43M, BW=13.38GiB/s, IOS/call=32/31

or about an 8x in difference. Now that brd is prepared to deal with
REQ_NOWAIT reads/writes, mark it as supporting that.

Cc: stable@vger.kernel.org # 5.10+
Link: https://lore.kernel.org/linux-block/20230203103005.31290-1-p.raghav@samsung.com/Signed-off-by: Jens Axboe <axboe@kernel.dk>

67205f80

brd: check for REQ_NOWAIT and set correct page allocation mask · 6ded703c

Jens Axboe authored Feb 16, 2023

If REQ_NOWAIT is set, then do a non-blocking allocation if the operation
is a write and we need to insert a new page. Currently REQ_NOWAIT cannot
be set as the queue isn't marked as supporting nowait, this change is in
preparation for allowing that.

radix_tree_preload() warns on attempting to call it with an allocation
mask that doesn't allow blocking. While that warning could arguably
be removed, we need to handle radix insertion failures anyway as they
are more likely if we cannot block to get memory.

Remove legacy BUG_ON()'s and turn them into proper errors instead, one
for the allocation failure and one for finding a page that doesn't
match the correct index.

Cc: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6ded703c

brd: return 0/-error from brd_insert_page() · db0ccc44

Jens Axboe authored Feb 16, 2023

It currently returns a page, but callers just check for NULL/page to
gauge success. Clean this up and return the appropriate error directly
instead.

Cc: stable@vger.kernel.org # 5.10+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

db0ccc44

block: sync mixed merged request's failfast with 1st bio's · 3ce6a115

Ming Lei authored Feb 09, 2023

We support mixed merge for requests/bios with different fastfail
settings. When request fails, each time we only handle the portion
with same failfast setting, then bios with failfast can be failed
immediately, and bios without failfast can be retried.

The idea is pretty good, but the current implementation has several
defects:

1) initially RA bio doesn't set failfast, however bio merge code
doesn't consider this point, and just check its failfast setting for
deciding if mixed merge is required. Fix this issue by adding helper
of bio_failfast().

2) when merging bio to request front, if this request is mixed
merged, we have to sync request's faifast setting with 1st bio's
failfast. Fix it by calling blk_update_mixed_merge().

3) when merging bio to request back, if this request is mixed
merged, we have to mark the bio as failfast, because blk_update_request
simply updates request failfast with 1st bio's failfast. Fix
it by calling blk_update_mixed_merge().

Fixes one normal EXT4 READ IO failure issue, because it is observed
that the normal READ IO is merged with RA IO, and the mixed merged
request has different failfast setting with 1st bio's, so finally
the normal READ IO doesn't get retried.

Cc: Tejun Heo <tj@kernel.org>
Fixes: 80a761fd ("block: implement mixed merge of different failfast requests")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230209125527.667004-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3ce6a115

15 Feb, 2023 1 commit

Merge tag 'nvme-6.3-2023-02-15' of git://git.infradead.org/nvme into for-6.3/block · d2ad8f0c

Jens Axboe authored Feb 15, 2023

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 6.3

 - fix and cleanup freeing single sgl (Keith Busch)"

* tag 'nvme-6.3-2023-02-15' of git://git.infradead.org/nvme:
  nvme-pci: remove iod use_sgls
  nvme-pci: fix freeing single sgl

d2ad8f0c

14 Feb, 2023 7 commits

Revert "blk-cgroup: pin the gendisk in struct blkcg_gq" · a06377c5

Christoph Hellwig authored Feb 14, 2023

This reverts commit 84d7d462.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230214183308.1658775-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

a06377c5

Revert "blk-cgroup: pass a gendisk to blkg_lookup" · 9a9c261e

Christoph Hellwig authored Feb 14, 2023

This reverts commit 821e840c08ad83736eced4037cdad864e95e2584.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230214183308.1658775-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

9a9c261e

Revert "blk-cgroup: delay blk-cgroup initialization until add_disk" · b6553bef

Christoph Hellwig authored Feb 14, 2023

This reverts commit 178fa7d4.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230214183308.1658775-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

b6553bef

Revert "blk-cgroup: delay calling blkcg_exit_disk until disk_release" · b4e94f9c

Christoph Hellwig authored Feb 14, 2023

This reverts commit c43332fe as it is not
needed without moving to disk references in the blkg.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230214183308.1658775-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

b4e94f9c

Revert "blk-cgroup: move the cgroup information to struct gendisk" · 1231039d

Christoph Hellwig authored Feb 14, 2023

This reverts commit 3f13ab7c as a patch
it depends on caused a few problems.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230214183308.1658775-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

1231039d

nvme-pci: remove iod use_sgls · b6c0c237

Keith Busch authored Feb 10, 2023

It's not used anywhere anymore, so remove it.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>

b6c0c237

nvme-pci: fix freeing single sgl · 8f0edf45

Keith Busch authored Feb 10, 2023

There may only be a single DMA mapped entry from multiple physical
segments, which means we don't allocate a separte SGL list. Check the
number of allocations prior to know if we need to free something.

Freeing a single list allocation is the same for both PRP and SGL
usages, so we don't need to check the use_sgl flag anymore.

Fixes: 01df742d ("nvme-pci: remove SGL segment descriptors")
Reported-by: Niklas Schnelle <schnelle@linux.ibm.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Niklas Schnelle <schnelle@linux.ibm.com>

8f0edf45

13 Feb, 2023 1 commit

block: ublk: check IO buffer based on flag need_get_data · 2f1e07dd

Liu Xiaodong authored Feb 10, 2023

Currently, uring_cmd with UBLK_IO_FETCH_REQ or
UBLK_IO_COMMIT_AND_FETCH_REQ is always checked whether
userspace server has provided IO buffer even flag
UBLK_F_NEED_GET_DATA is configured.

This is a excessive check. If UBLK_F_NEED_GET_DATA is
configured, FETCH_RQ doesn't need to provide IO buffer;
COMMIT_AND_FETCH_REQ also doesn't need to do that if
the IO type is not READ.

Check ub_cmd->addr together with ublk_need_get_data()
and IO type in ublk_ch_uring_cmd().

With this fix, userspace server doesn't need to preserve
buffers for every ublk_io when flag UBLK_F_NEED_GET_DATA
is configured, in order to save memory.
Signed-off-by: Liu Xiaodong <xiaodong.liu@intel.com>
Fixes: c86019ff ("ublk_drv: add support for UBLK_IO_NEED_GET_DATA")
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230210141356.112321-1-xiaodong.liu@intel.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

2f1e07dd

10 Feb, 2023 3 commits

s390/dasd: Fix potential memleak in dasd_eckd_init() · 460e9bed

Qiheng Lin authored Feb 10, 2023

`dasd_reserve_req` is allocated before `dasd_vol_info_req`, and it
also needs to be freed before the error returns, just like the other
cases in this function.

Fixes: 9e12e54c ("s390/dasd: Handle out-of-space constraint")
Signed-off-by: Qiheng Lin <linqiheng@huawei.com>
Link: https://lore.kernel.org/r/20221208133809.16796-1-linqiheng@huawei.comSigned-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20230210000253.1644903-3-sth@linux.ibm.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

460e9bed

s390/dasd: sort out physical vs virtual pointers usage · b87c52e4

Alexander Gordeev authored Feb 10, 2023

This does not fix a real bug, since virtual addresses
are currently indentical to physical ones.
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: Stefan Haberland <sth@linux.ibm.com>
Link: https://lore.kernel.org/r/20230210000253.1644903-2-sth@linux.ibm.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b87c52e4

block: Remove the ALLOC_CACHE_SLACK constant · 9af99354

Bart Van Assche authored Feb 09, 2023

Commit b99182c5 ("bio: add pcpu caching for non-polling bio_put")
removed the code that uses this constant. Hence also remove the constant
itself.

Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230209230135.3475829-1-bvanassche@acm.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

9af99354

09 Feb, 2023 4 commits

block: make kobj_type structures constant · 5f622417

Thomas Weißschuh authored Feb 08, 2023

Since commit ee6d3dd4 ("driver core: make kobj_type constant.")
the driver core allows the usage of const struct kobj_type.

Take advantage of this to constify the structure definitions to prevent
modification at runtime.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Link: https://lore.kernel.org/r/20230208-kobj_type-block-v1-1-0b3eafd7d983@weissschuh.netSigned-off-by: Jens Axboe <axboe@kernel.dk>

5f622417

block: Merge bio before checking ->cached_rq · 23f3e327

Xiao Ni authored Feb 09, 2023

It checks if plug->cached_rq is empty before merging bio. But the merge action
doesn't have relationship with plug->cached_rq, it trys to merge bio with
requests within plug->mq_list. Now it checks if ->cached_rq is empty before
merging bio. If it's empty, it will miss the merge chances. So move the merge
function before checking ->cached_rq.
Signed-off-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230209031930.27354-1-xni@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

23f3e327

Revert "blk-cgroup: simplify blkg freeing from initialization failure paths" · dcb52201

Christoph Hellwig authored Feb 09, 2023

It turns out this was too soon.  blkg_conf_prep does to funky locking games
with the queue lock for this to work properly.

This reverts commit 27b642b0.
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230209053523.437927-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

dcb52201

blk-cgroup: delay calling blkcg_exit_disk until disk_release · c43332fe

Christoph Hellwig authored Feb 08, 2023

While del_gendisk ensures there is no outstanding I/O on the queue,
it can't prevent block layer users from building new I/O.

This leads to a NULL ->root_blkg reference in bio_associate_blkg when
allocating a new bio on a shut down file system. Delay freeing the
blk-cgroup subsystems from del_gendisk until disk_release to make
sure the blkg and throttle information is still avaіlable for bio
submitters, even if those bios will immediately fail.

This now can cause a case where disk_release is called on a disk
that hasn't been added. That's mostly harmless, except for a case
in blk_throttl_exit that now needs to check for a NULL ->td pointer.

Fixes: 178fa7d4 ("blk-cgroup: delay blk-cgroup initialization until add_disk")
Reported-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230208063514.171485-1-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

c43332fe

08 Feb, 2023 3 commits

Merge branch 'md-next' of... · a872818f

Jens Axboe authored Feb 08, 2023

Merge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into for-6.3/block

Pull MD fix from Song:

"This commit fixes a rare crash during the takeover process."

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
  md: account io_acct_set usage with active_io

a872818f

md: account io_acct_set usage with active_io · 76fed014

Xiao Ni authored Feb 03, 2023

io_acct_set was enabled for raid0/raid5 io accounting. bios that contain
md_io_acct are allocated in the i/o path. There isn't a good method to
monitor if these bios are all finished and freed. In the takeover process,
io_acct_set (which is used for bios with md_io_acct) need to be freed.
However, if some bios finish after io_acct_set is freed, it may trigger
the following panic:

[ 6973.767999] RIP: 0010:mempool_free+0x52/0x80
[ 6973.786098] Call Trace:
[ 6973.786549]  md_end_io_acct+0x31/0x40
[ 6973.787227]  blk_update_request+0x224/0x380
[ 6973.787994]  blk_mq_end_request+0x1a/0x130
[ 6973.788739]  blk_complete_reqs+0x35/0x50
[ 6973.789456]  __do_softirq+0xd7/0x2c8
[ 6973.790114]  ? sort_range+0x20/0x20
[ 6973.790763]  run_ksoftirqd+0x2a/0x40
[ 6973.791400]  smpboot_thread_fn+0xb5/0x150
[ 6973.792114]  kthread+0x10b/0x130
[ 6973.792724]  ? set_kthread_struct+0x50/0x50
[ 6973.793491]  ret_from_fork+0x1f/0x40

Fix this by increasing and decreasing active_io for each bio with
md_io_acct so that mddev_suspend() will wait until all bios from
io_acct_set finish before freeing io_acct_set.
Reported-by: Fine Fan <ffan@redhat.com>
Signed-off-by: Xiao Ni <xni@redhat.com>
Signed-off-by: Song Liu <song@kernel.org>

76fed014

block: ublk: improve handling device deletion · 0abe39de

Ming Lei authored Feb 07, 2023

Inside ublk_ctrl_del_dev(), when the device is removed, we wait
until the device number is freed with holding global lock of
ublk_ctl_mutex, this way isn't friendly from user viewpoint:

1) if device is in-use, the current delete command hangs in
ublk_ctrl_del_dev(), and user can't break from the handling
because wait_event() is used

2) global lock is held, so any new device can't be added and
other old devices can't be removed.

Improve the deleting handling by the following way, suggested by
Nadav:

1) wait without holding the global lock

2) replace wait_event() with wait_event_interruptible()
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Suggested-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230207150700.545530-1-ming.lei@redhat.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0abe39de

07 Feb, 2023 5 commits

block, bfq: cleanup 'bfqg->online' · f37bf75c

Yu Kuai authored Feb 02, 2023

After commit dfd6200a ("blk-cgroup: support to track if policy is
online"), there is no need to do this again in bfq.

However, 'pd->online' is not protected by 'bfqd->lock', in order to make
sure bfq won't see that 'pd->online' is still set after bfq_pd_offline(),
clear it before bfq_pd_offline() is called. This is fine because other
polices doesn't use 'pd->online' and bfq_pd_offline() will move active
bfqq to root cgroup anyway.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230202134913.2364549-1-yukuai1@huaweicloud.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f37bf75c

Merge tag 'nvme-6.3-2023-02-07' of git://git.infradead.org/nvme into for-6.3/block · 455944e4

Jens Axboe authored Feb 07, 2023

Pull NVMe updates from Christoph:

"nvme updates for Linux 6.3

 - small improvements to the logging functionality (Amit Engel)
 - authentication cleanups (Hannes Reinecke)
 - cleanup and optimize the DMA mapping cod in the PCIe driver
   (Keith Busch)
 - work around the command effects for Format NVM (Keith Busch)
 - misc cleanups (Keith Busch, Christoph Hellwig)"

* tag 'nvme-6.3-2023-02-07' of git://git.infradead.org/nvme:
  nvme: mask CSE effects for security receive
  nvme: always initialize known command effects
  nvmet: for nvme admin set_features cmd, call nvmet_check_data_len_lte()
  nvme-tcp: add additional info for nvme_tcp_timeout log
  nvme: add nvme_opcode_str function for all nvme cmd types
  nvme: remove nvme_execute_passthru_rq
  nvme-pci: place descriptor addresses in iod
  nvme-pci: use mapped entries for sgl decision
  nvme-pci: remove SGL segment descriptors
  nvme-auth: don't use NVMe status codes
  nvme-fabrics: clarify AUTHREQ result handling

455944e4

ublk: pass NULL to blk_mq_alloc_disk() as queuedata · 1972d038

Ziyang Zhang authored Feb 07, 2023

queuedata is not referenced in ublk_drv and we can use driver_data
instead. Pass NULL to blk_mq_alloc_disk() as queuedata while allocating
ublk's gendisk.
Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230207070839.370817-4-ZiyangZhang@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

1972d038

ublk: mention WRITE_ZEROES in comment of ublk_complete_rq() · b352389e

Ziyang Zhang authored Feb 07, 2023

WRITE_ZEROES won't return bytes returned just like FLUSH and DISCARD,
and we can end it directly. Add missing comment for it in
ublk_complete_rq().
Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230207070839.370817-3-ZiyangZhang@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b352389e

ublk: remove unnecessary NULL check in ublk_rq_has_data() · 731e208d

Ziyang Zhang authored Feb 07, 2023

bio_has_data() allows a NULL bio so the NULL check in
ublk_rq_has_data() is unnecessary.
Signed-off-by: Ziyang Zhang <ZiyangZhang@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20230207070839.370817-2-ZiyangZhang@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

731e208d

06 Feb, 2023 11 commits

trace/blktrace: fix memory leak with using debugfs_lookup() · 83e8864f

Greg Kroah-Hartman authored Feb 02, 2023

When calling debugfs_lookup() the result must have dput() called on it,
otherwise the memory will leak over time.  To make things simpler, just
call debugfs_lookup_and_remove() instead which handles all of the logic
at once.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-trace-kernel@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20230202141956.2299521-1-gregkh@linuxfoundation.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>