Commits · 285207f67c9bcad1d9168993f175d6d88ce310f1 · Kirill Smelkov / linux

15 Apr, 2024 40 commits

io_uring/kbuf: remove dead define · 285207f6

Jens Axboe authored Mar 29, 2024

We no longer use IO_BUFFER_LIST_BUF_PER_PAGE, kill it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

285207f6

io_uring: fix warnings on shadow variables · 1da2f311

Jens Axboe authored Mar 29, 2024

There are a few of those:

io_uring/fdinfo.c:170:16: warning: declaration shadows a local variable [-Wshadow]
  170 |                 struct file *f = io_file_from_index(&ctx->file_table, i);
      |                              ^
io_uring/fdinfo.c:53:67: note: previous declaration is here
   53 | __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *f)
      |                                                                   ^
io_uring/cancel.c:187:25: warning: declaration shadows a local variable [-Wshadow]
  187 |                 struct io_uring_task *tctx = node->task->io_uring;
      |                                       ^
io_uring/cancel.c:166:31: note: previous declaration is here
  166 |                              struct io_uring_task *tctx,
      |                                                    ^
io_uring/register.c:371:25: warning: declaration shadows a local variable [-Wshadow]
  371 |                 struct io_uring_task *tctx = node->task->io_uring;
      |                                       ^
io_uring/register.c:312:24: note: previous declaration is here
  312 |         struct io_uring_task *tctx = NULL;
      |                               ^

and a simple cleanup gets rid of them. For the fdinfo case, make a
distinction between the file being passed in (for the ring), and the
registered files we iterate. For the other two cases, just get rid of
shadowed variable, there's no reason to have a new one.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1da2f311

io_uring: move mapping/allocation helpers to a separate file · f15ed8b4

Jens Axboe authored Mar 27, 2024

Move the related code from io_uring.c into memmap.c. No functional
changes in this patch, just cleaning it up a bit now that the full
transition is done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f15ed8b4

io_uring: use unpin_user_pages() where appropriate · 18595c0a

Jens Axboe authored Mar 13, 2024

There are a few cases of open-rolled loops around unpin_user_page(), use
the generic helper instead.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

18595c0a

io_uring/kbuf: use vm_insert_pages() for mmap'ed pbuf ring · 87585b05

Jens Axboe authored Mar 12, 2024

Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_page() and have it Just Work.

This requires a bit of effort on the mmap lookup side, as the ctx
uring_lock isn't held, which otherwise protects buffer_lists from being
torn down, and it's not safe to grab from mmap context that would
introduce an ABBA deadlock between the mmap lock and the ctx uring_lock.
Instead, lookup the buffer_list under RCU, as the the list is RCU freed
already. Use the existing reference count to determine whether it's
possible to safely grab a reference to it (eg if it's not zero already),
and drop that reference when done with the mapping. If the mmap
reference is the last one, the buffer_list and the associated memory can
go away, since the vma insertion has references to the inserted pages at
that point.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

87585b05

io_uring/kbuf: vmap pinned buffer ring · e270bfd2

Jens Axboe authored Mar 12, 2024

This avoids needing to care about HIGHMEM, and it makes the buffer
indexing easier as both ring provided buffer methods are now virtually
mapped in a contigious fashion.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e270bfd2

io_uring: unify io_pin_pages() · 1943f96b

Jens Axboe authored Mar 13, 2024

Move it into io_uring.c where it belongs, and use it in there as well
rather than have two implementations of this.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1943f96b

io_uring: use vmap() for ring mapping · 09fc75e0

Jens Axboe authored Mar 13, 2024

This is the last holdout which does odd page checking, convert it to
vmap just like what is done for the non-mmap path.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

09fc75e0

io_uring: get rid of remap_pfn_range() for mapping rings/sqes · 3ab1db3c

Jens Axboe authored Mar 13, 2024

Rather than use remap_pfn_range() for this and manually free later,
switch to using vm_insert_pages() and have it Just Work.

If possible, allocate a single compound page that covers the range that
is needed. If that works, then we can just use page_address() on that
page. If we fail to get a compound page, allocate single pages and use
vmap() to map them into the kernel virtual address space.

This just covers the rings/sqes, the other remaining user of the mmap
remap_pfn_range() user will be converted separately. Once that is done,
we can kill the old alloc/free code.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3ab1db3c

mm: add nommu variant of vm_insert_pages() · 62346c6c

Jens Axboe authored Mar 16, 2024

An identical one exists for vm_insert_page(), add one for
vm_insert_pages() to avoid needing to check for CONFIG_MMU in code using
it.
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

62346c6c

io_uring: Avoid anonymous enums in io_uring uapi · 0f21a957

Gabriel Krisman Bertazi authored Mar 28, 2024

While valid C, anonymous enums confuse Cython (Python to C translator),
as reported by Ritesh (YoSTEALTH) [1] .  Since people rely on it when
building against liburing and we want to keep this header in sync with
the library version, let's name the existing enums in the uapi header.

[1] https://github.com/cython/cython/issues/3240Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240328210935.25640-1-krisman@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

0f21a957

io_uring: use the right type for work_llist empty check · 22537c9f

Jens Axboe authored Mar 25, 2024

io_task_work_pending() uses wq_list_empty() on ctx->work_llist, but it's
not an io_wq_work_list, it's a struct llist_head. They both have
->first as head-of-list, and it turns out the checks are identical. But
be proper and use the right helper.

Fixes: dac6a0ea ("io_uring: ensure iopoll runs local task work as well")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

22537c9f

io_uring: Remove the now superfluous sentinel elements from ctl_table array · a80929d1

Joel Granados authored Mar 28, 2024

This commit comes at the tail end of a greater effort to remove the
empty elements at the end of the ctl_table arrays (sentinels) which will
reduce the overall build time size of the kernel and run time memory
bloat by ~64 bytes per sentinel (further information Link :
https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/)

Remove sentinel element from kernel_io_uring_disabled_table
Signed-off-by: Joel Granados <j.granados@samsung.com>
Link: https://lore.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-6-47c1463b3af2@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a80929d1

io_uring: Remove unused function · 4e9706c6

Jiapeng Chong authored Mar 28, 2024

The function are defined in the io_uring.c file, but not called
elsewhere, so delete the unused function.

io_uring/io_uring.c:646:20: warning: unused function '__io_cq_unlock'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8660Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Link: https://lore.kernel.org/r/20240328022324.78029-1-jiapeng.chong@linux.alibaba.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4e9706c6

io_uring: re-arrange Makefile order · 77a1cd5e

Jens Axboe authored Mar 26, 2024

The object list is a bit of a mess, with core and opcode files mixed in.
Re-arrange it so that we have the core bits first, and then opcode
specific files after that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

77a1cd5e

io_uring: refill request cache in memory order · 05eb5fe2

Jens Axboe authored Mar 25, 2024

The allocator will generally return memory in order, but
__io_alloc_req_refill() then adds them to a stack and we'll extract them
in the opposite order. This obviously isn't a huge deal, but:

1) it makes debugging easier when they are in order
2) keeping them in-order is the right thing to do
3) reduces the code for adding them to the stack

Just add them in reverse to the stack.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

05eb5fe2

io_uring/poll: shrink alloc cache size to 32 · da22bdf3

Jens Axboe authored Mar 21, 2024

This should be plenty, rather than the default of 128, and matches what
we have on the rsrc and futex side as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da22bdf3

io_uring/alloc_cache: switch to array based caching · 414d0f45

Jens Axboe authored Mar 20, 2024

Currently lists are being used to manage this, but best practice is
usually to have these in an array instead as that it cheaper to manage.

Outside of that detail, games are also played with KASAN as the list
is inside the cached entry itself.

Finally, all users of this need a struct io_cache_entry embedded in
their struct, which is union'ized with something else in there that
isn't used across the free -> realloc cycle.

Get rid of all of that, and simply have it be an array. This will not
change the memory used, as we're just trading an 8-byte member entry
for the per-elem array size.

This reduces the overhead of the recycled allocations, and it reduces
the amount of code code needed to support recycling to about half of
what it currently is.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

414d0f45

io_uring: drop ->prep_async() · e10677a8

Jens Axboe authored Mar 18, 2024

It's now unused, drop the code related to it. This includes the
io_issue_defs->manual alloc field.

While in there, and since ->async_size is now being used a bit more
frequently and in the issue path, move it to io_issue_defs[].
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e10677a8

io_uring/uring_cmd: defer SQE copying until it's needed · 5eff57fa

Jens Axboe authored Mar 20, 2024

The previous commit turned on async data for uring_cmd, and did the
basic conversion of setting everything up on the prep side. However, for
a lot of use cases, -EIOCBQUEUED will get returned on issue, as the
operation got successfully queued. For that case, a persistent SQE isn't
needed, as it's just used for issue.

Unless execution goes async immediately, defer copying the double SQE
until it's necessary.

This greatly reduces the overhead of such commands, as evidenced by
a perf diff from before and after this change:

    10.60%     -8.58%  [kernel.vmlinux]  [k] io_uring_cmd_prep

where the prep side drops from 10.60% to ~2%, which is more expected.
Performance also rises from ~113M IOPS to ~122M IOPS, bringing us back
to where it was before the async command prep.
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5eff57fa

io_uring/uring_cmd: switch to always allocating async data · d10f19df

Jens Axboe authored Mar 18, 2024

Basic conversion ensuring async_data is allocated off the prep path. Adds
a basic alloc cache as well, as passthrough IO can be quite high in rate.
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d10f19df

io_uring/net: move connect to always using async data · e2ea5a70

Jens Axboe authored Mar 18, 2024

While doing that, get rid of io_async_connect and just use the generic
io_async_msghdr. Both of them have a struct sockaddr_storage in there,
and while io_async_msghdr is bigger, if the same type can be used then
the netmsg_cache can get reused for connect as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e2ea5a70

io_uring/rw: add iovec recycling · d6f911a6

Jens Axboe authored Mar 18, 2024

Let the io_async_rw hold on to the iovec and reuse it, rather than always
allocate and free them.

Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.

While doing so, shrink io_async_rw by getting rid of the bigger embedded
fast iovec. Since iovecs are being recycled now, shrink it from 8 to 1.
This reduces the io_async_rw size from 264 to 160 bytes, a 40% reduction.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d6f911a6

io_uring/rw: cleanup retry path · cca65713

Jens Axboe authored Mar 22, 2024

We no longer need to gate a potential retry on whether or not the
context matches our original task, as all read/write operations have
been fully prepared upfront. This means there's never any re-import
needed, and hence we can always retry requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cca65713

io_uring: get rid of struct io_rw_state · 0d10bd77

Jens Axboe authored Mar 18, 2024

A separate state struct is not needed anymore, just fold it in with
io_async_rw.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0d10bd77

io_uring/rw: always setup io_async_rw for read/write requests · a9165b83

Jens Axboe authored Mar 18, 2024

read/write requests try to put everything on the stack, and then alloc
and copy if a retry is needed. This necessitates a bunch of nasty code
that deals with intermediate state.

Get rid of this, and have the prep side setup everything that is needed
upfront, which greatly simplifies the opcode handlers.

This includes adding an alloc cache for io_async_rw, to make it cheap
to handle.

In terms of cost, this should be basically free and transparent. For
the worst case of {READ,WRITE}_FIXED which didn't need it before,
performance is unaffected in the normal peak workload that is being
used to test that. Still runs at 122M IOPS.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a9165b83

io_uring/net: drop 'kmsg' parameter from io_req_msg_cleanup() · d80f9407

Jens Axboe authored Mar 18, 2024

Now that iovec recycling is being done, the iovec is no longer being
freed in there. Hence the kmsg parameter is now useless.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d80f9407

io_uring/net: add iovec recycling · 75191341

Jens Axboe authored Mar 16, 2024

Right now the io_async_msghdr is recycled to avoid the overhead of
allocating+freeing it for every request. But the iovec is not included,
hence that will be allocated and freed for each transfer regardless.
This commit enables recyling of the iovec between io_async_msghdr
recycles. This avoids alloc+free for each one if an iovec is used, and
on top of that, it extends the cache hot nature of msg to the iovec as
well.

Also enables KASAN for the iovec entries, so that reuse can be detected
even while they are in the cache.

The io_async_msghdr also shrinks from 376 -> 288 bytes, an 88 byte
saving (or ~23% smaller), as the fast_iovec entry is dropped from 8
entries to a single entry. There's no point keeping a big fast iovec
entry, if iovecs aren't being allocated and freed continually.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

75191341

io_uring/net: remove (now) dead code in io_netmsg_recycle() · 9f8539fe

Jens Axboe authored Mar 20, 2024

All net commands have async data at this point, there's no reason to
check if this is the case or not.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9f8539fe

io_uring: kill io_msg_alloc_async_prep() · 6498c5c9

Jens Axboe authored Mar 18, 2024

We now ONLY call io_msg_alloc_async() from inside prep handling, which
is always locked. No need for this helper anymore, or the check in
io_msg_alloc_async() on whether the ring is locked or not.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6498c5c9

io_uring/net: get rid of ->prep_async() for send side · 50220d6a

Jens Axboe authored Mar 18, 2024

Move the io_async_msghdr out of the issue path and into prep handling,
e it's now done unconditionally and hence does not need to be part
of the issue path. This means any usage of io_sendrecv_prep_async() and
io_sendmsg_prep_async(), and hence the forced async setup path is now
unified with the normal prep setup.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

50220d6a

io_uring/net: get rid of ->prep_async() for receive side · c6f32c7d

Jens Axboe authored Mar 18, 2024

Move the io_async_msghdr out of the issue path and into prep handling,
since it's now done unconditionally and hence does not need to be part
of the issue path. This reduces the footprint of the multishot fast
path of multiple invocations of ->issue() per prep, and also means that
using ->prep_async() can be dropped for recvmsg asthis is now done via
setup on the prep side.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c6f32c7d

io_uring/net: always set kmsg->msg.msg_control_user before issue · 3ba8345a

Jens Axboe authored Apr 12, 2024

We currently set this separately for async/sync entry, but let's just
move it to a generic pre-issue spot and eliminate the difference
between the two.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3ba8345a

io_uring/net: always setup an io_async_msghdr · 790b68b3

Jens Axboe authored Mar 16, 2024

Rather than use an on-stack one and then need to allocate and copy if
async execution is required, always grab one upfront. This should be
very cheap, and potentially even have cache hotness benefits for
back-to-back send/recv requests.

For any recv type of request, this is probably a good choice in general,
as it's expected that no data is available initially. For send this is
not necessarily the case, as space in the socket buffer is expected to
be available. However, getting a cached io_async_msghdr is very cheap,
and as it should be cache hot, probably the difference here is neglible,
if any.

A nice side benefit is that io_setup_async_msg can get killed
completely, which has some nasty iovec manipulation code.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

790b68b3

io_uring/net: unify cleanup handling · f5b00ab2

Jens Axboe authored Mar 12, 2024

Now that recv/recvmsg both do the same cleanup, put it in the retry and
finish handlers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f5b00ab2

io_uring/net: switch io_recv() to using io_async_msghdr · 4a3223f7

Jens Axboe authored Mar 05, 2024

No functional changes in this patch, just in preparation for carrying
more state than what is available now, if necessary.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4a3223f7

io_uring/net: switch io_send() and io_send_zc() to using io_async_msghdr · 54cdcca0

Jens Axboe authored Mar 05, 2024

No functional changes in this patch, just in preparation for carrying
more state then what is being done now, if necessary. While unifying
some of this code, add a generic send setup prep handler that they can
both use.

This gets rid of some manual msghdr and sockaddr on the stack, and makes
it look a bit more like the sendmsg/recvmsg variants. Going forward, more
can get unified on top.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

54cdcca0

io_uring/alloc_cache: shrink default max entries from 512 to 128 · 0ae9b9a1

Jens Axboe authored Mar 16, 2024

In practice, we just need to recycle a few elements for (by far) most
use cases. Shrink the total size down from 512 to 128, which should be
more than plenty.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0ae9b9a1

io_uring: remove timeout/poll specific cancelations · 29f858a7

Jens Axboe authored Mar 16, 2024

For historical reasons these were special cased, as they were the only
ones that needed cancelation. But now we handle cancelations generally,
and hence there's no need to check for these in
io_ring_ctx_wait_and_kill() when io_uring_try_cancel_requests() handles
both these and the rest as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

29f858a7

io_uring: flush delayed fallback task_work in cancelation · 25417623

Jens Axboe authored Mar 18, 2024

Just like we run the inline task_work, ensure we also factor in and
run the fallback task_work.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

25417623