Commits · eac2ca2d682f94f46b1973bdf5e77d85d77b8e53 · Kirill Smelkov / linux

20 Sep, 2024 2 commits

io_uring: check if we need to reschedule during overflow flush · eac2ca2d

Jens Axboe authored Sep 20, 2024

In terms of normal application usage, this list will always be empty.
And if an application does overflow a bit, it'll have a few entries.
However, nothing obviously prevents syzbot from running a test case
that generates a ton of overflow entries, and then flushing them can
take quite a while.

Check for needing to reschedule while flushing, and drop our locks and
do so if necessary. There's no state to maintain here as overflows
always prune from head-of-list, hence it's fine to drop and reacquire
the locks at the end of the loop.

Link: https://lore.kernel.org/io-uring/66ed061d.050a0220.29194.0053.GAE@google.com/
Reported-by: syzbot+5fca234bd7eb378ff78e@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eac2ca2d

io_uring: improve request linking trace · eed138d6

Jens Axboe authored Sep 19, 2024

Right now any link trace is listed as being linked after the head
request in the chain, but it's more useful to note explicitly which
request a given new request is chained to. Change the link trace to dump
the tail request so that chains are immediately apparent when looking at
traces.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eed138d6

19 Sep, 2024 1 commit

io_uring: check for presence of task_work rather than TIF_NOTIFY_SIGNAL · 04beb6e0

Jens Axboe authored Sep 18, 2024

If some part of the kernel adds task_work that needs executing, in terms
of signaling it'll generally use TWA_SIGNAL or TWA_RESUME. Those two
directly translate to TIF_NOTIFY_SIGNAL or TIF_NOTIFY_RESUME, and can
be used for a variety of use case outside of task_work.

However, io_cqring_wait_schedule() only tests explicitly for
TIF_NOTIFY_SIGNAL. This means it can miss if task_work got added for
the task, but used a different kind of signaling mechanism (or none at
all). Normally this doesn't matter as any task_work will be run once
the task exits to userspace, except if:

1) The ring is setup with DEFER_TASKRUN
2) The local work item may generate normal task_work

For condition 2, this can happen when closing a file and it's the final
put of that file, for example. This can cause stalls where a task is
waiting to make progress inside io_cqring_wait(), but there's nothing else
that will wake it up. Hence change the "should we schedule or loop around"
check to check for the presence of task_work explicitly, rather than just
TIF_NOTIFY_SIGNAL as the mechanism. While in there, also change the
ordering of what type of task_work first in terms of ordering, to both
make it consistent with other task_work runs in io_uring, but also to
better handle the case of defer task_work generating normal task_work,
like in the above example.
Reported-by: Jan Hendrik Farr <kernel@jfarr.cc>
Link: https://github.com/axboe/liburing/issues/1235
Cc: stable@vger.kernel.org
Fixes: 846072f1 ("io_uring: mimimise io_cqring_wait_schedule")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

04beb6e0

17 Sep, 2024 1 commit

io_uring/sqpoll: do the napi busy poll outside the submission block · 53d69bdd

Olivier Langlois authored Sep 16, 2024

there are many small reasons justifying this change.

1. busy poll must be performed even on rings that have no iopoll and no
   new sqe. It is quite possible that a ring configured for inbound
   traffic with multishot be several hours without receiving new request
   submissions
2. NAPI busy poll does not perform any credential validation
3. If the thread is awaken by task work, processing the task work is
   prioritary over NAPI busy loop. This is why a second loop has been
   created after the io_sq_tw() call instead of doing the busy loop in
   __io_sq_thread() outside its credential acquisition block.
Signed-off-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/de7679adf1249446bd47426db01d82b9603b7224.1726161831.git.olivier@trillion01.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

53d69bdd

16 Sep, 2024 3 commits

io_uring: clean up a type in io_uring_register_get_file() · 2f6a55e4

Dan Carpenter authored Sep 16, 2024

Originally "fd" was unsigned int but it was changed to int when we pulled
this code into a separate function in commit 0b6d253e
("io_uring/register: provide helper to get io_ring_ctx from 'fd'").  This
doesn't really cause a runtime problem because the call to
array_index_nospec() will clamp negative fds to 0 and nothing else uses
the negative values.
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/6f6cb630-079f-4fdf-bf95-1082e0a3fc6e@stanley.mountainSigned-off-by: Jens Axboe <axboe@kernel.dk>

2f6a55e4

io_uring/sqpoll: do not put cpumask on stack · 7f44bead

Felix Moessbauer authored Sep 16, 2024

Putting the cpumask on the stack is deprecated for a long time (since
2d3854a3), as these can be big. Given that, change the on-stack
allocation of allowed_mask to be dynamically allocated.

Fixes: f011c9cf ("io_uring/sqpoll: do not allow pinning outside of cpuset")
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/r/20240916111150.1266191-1-felix.moessbauer@siemens.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

7f44bead

io_uring/sqpoll: retain test for whether the CPU is valid · a09c1724

Jens Axboe authored Sep 16, 2024

A recent commit ensured that SQPOLL cannot be setup with a CPU that
isn't in the current tasks cpuset, but it also dropped testing whether
the CPU is valid in the first place. Without that, if a task passes in
a CPU value that is too high, the following KASAN splat can get
triggered:

BUG: KASAN: stack-out-of-bounds in io_sq_offload_create+0x858/0xaa4
Read of size 8 at addr ffff800089bc7b90 by task wq-aff.t/1391

CPU: 4 UID: 1000 PID: 1391 Comm: wq-aff.t Not tainted 6.11.0-rc7-00227-g371c468f4db6 #7080
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace.part.0+0xcc/0xe0
 show_stack+0x14/0x1c
 dump_stack_lvl+0x58/0x74
 print_report+0x16c/0x4c8
 kasan_report+0x9c/0xe4
 __asan_report_load8_noabort+0x1c/0x24
 io_sq_offload_create+0x858/0xaa4
 io_uring_setup+0x1394/0x17c4
 __arm64_sys_io_uring_setup+0x6c/0x180
 invoke_syscall+0x6c/0x260
 el0_svc_common.constprop.0+0x158/0x224
 do_el0_svc+0x3c/0x5c
 el0_svc+0x34/0x70
 el0t_64_sync_handler+0x118/0x124
 el0t_64_sync+0x168/0x16c

The buggy address belongs to stack of task wq-aff.t/1391
 and is located at offset 48 in frame:
 io_sq_offload_create+0x0/0xaa4

This frame has 1 object:
 [32, 40) 'allowed_mask'

The buggy address belongs to the virtual mapping at
 [ffff800089bc0000, ffff800089bc9000) created by:
 kernel_clone+0x124/0x7e0

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff0000d740af80 pfn:0x11740a
memcg:ffff0000c2706f02
flags: 0xbffe00000000000(node=0|zone=2|lastcpupid=0x1fff)
raw: 0bffe00000000000 0000000000000000 dead000000000122 0000000000000000
raw: ffff0000d740af80 0000000000000000 00000001ffffffff ffff0000c2706f02
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff800089bc7a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff800089bc7b00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
>ffff800089bc7b80: 00 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00
                         ^
 ffff800089bc7c00: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
 ffff800089bc7c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f3
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409161632.cbeeca0d-lkp@intel.com
Fixes: f011c9cf ("io_uring/sqpoll: do not allow pinning outside of cpuset")
Tested-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a09c1724

15 Sep, 2024 2 commits

io_uring/rsrc: change ubuf->ubuf_end to length tracking · 9753c642

Jens Axboe authored Sep 15, 2024

If we change it to tracking ubuf->start + ubuf->len, then we can reduce
the size of struct io_mapped_ubuf by another 4 bytes, effectively 8
bytes, as a hole is eliminated too.

This shrinks io_mapped_ubuf to 32 bytes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9753c642

io_uring/rsrc: get rid of io_mapped_ubuf->folio_mask · 8b0c6025

Jens Axboe authored Sep 15, 2024

We don't really need to cache this, let's reclaim 8 bytes from struct
io_mapped_ubuf and just calculate it when we need it. The only hot path
here is io_import_fixed().
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8b0c6025

14 Sep, 2024 1 commit

io_uring: rename "copy buffers" to "clone buffers" · 636119af

Jens Axboe authored Sep 14, 2024

A recent commit added support for copying registered buffers from one
ring to another. But that term is a bit confusing, as no copying of
buffer data is done here. What is being done is simply cloning the
buffer registrations from one ring to another.

Rename it while we still can, so that it's more descriptive. No
functional changes in this patch.

Fixes: 7cc2a6ea ("io_uring: add IORING_REGISTER_COPY_BUFFERS method")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

636119af

12 Sep, 2024 2 commits

io_uring: add IORING_REGISTER_COPY_BUFFERS method · 7cc2a6ea

Jens Axboe authored Sep 11, 2024

Buffers can get registered with io_uring, which allows to skip the
repeated pin_pages, unpin/unref pages for each O_DIRECT operation. This
reduces the overhead of O_DIRECT IO.

However, registrering buffers can take some time. Normally this isn't an
issue as it's done at initialization time (and hence less critical), but
for cases where rings can be created and destroyed as part of an IO
thread pool, registering the same buffers for multiple rings become a
more time sensitive proposition. As an example, let's say an application
has an IO memory pool of 500G. Initial registration takes:

Got 500 huge pages (each 1024MB)
Registered 500 pages in 409 msec

or about 0.4 seconds. If we go higher to 900 1GB huge pages being
registered:

Registered 900 pages in 738 msec

which is, as expected, a fully linear scaling.

Rather than have each ring pin/map/register the same buffer pool,
provide an io_uring_register(2) opcode to simply duplicate the buffers
that are registered with another ring. Adding the same 900GB of
registered buffers to the target ring can then be accomplished in:

Copied 900 pages in 17 usec

While timing differs a bit, this provides around a 25,000-40,000x
speedup for this use case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7cc2a6ea

io_uring/register: provide helper to get io_ring_ctx from 'fd' · 0b6d253e

Jens Axboe authored Sep 12, 2024

Can be done in one of two ways:

1) Regular file descriptor, just fget()
2) Registered ring, index our own table for that

In preparation for adding another register use of needing to get a ctx
from a file descriptor, abstract out this helper and use it in the main
register syscall as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0b6d253e

11 Sep, 2024 4 commits

io_uring/rsrc: add reference count to struct io_mapped_ubuf · bfc0aa7a

Jens Axboe authored Sep 11, 2024

Currently there's a single ring owner of a mapped buffer, and hence the
reference count will always be 1 when it's torn down and freed. However,
in preparation for being able to link io_mapped_ubuf to different spots,
add a reference count to manage the lifetime of it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bfc0aa7a

io_uring/rsrc: clear 'slot' entry upfront · 021b153f

Jens Axboe authored Sep 11, 2024

No functional changes in this patch, but clearing the slot pointer
earlier will be required by a later change.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

021b153f

io_uring/io-wq: inherit cpuset of cgroup in io worker · 84eacf17

Felix Moessbauer authored Sep 10, 2024

The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).

When creating a new io worker, this worker should inherit the cpuset
of the cgroup.

Fixes: da64d6db ("io_uring: One wqe per wq")
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/r/20240910171157.166423-3-felix.moessbauer@siemens.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

84eacf17

io_uring/io-wq: do not allow pinning outside of cpuset · 0997aa54

Felix Moessbauer authored Sep 10, 2024

The io worker threads are userland threads that just never exit to the
userland. By that, they are also assigned to a cgroup (the group of the
creating task).

When changing the affinity of the io_wq thread via syscall, we must only
allow cpumasks within the limits defined by the cpuset controller of the
cgroup (if enabled).

Fixes: da64d6db ("io_uring: One wqe per wq")
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/r/20240910171157.166423-2-felix.moessbauer@siemens.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0997aa54

10 Sep, 2024 2 commits

io_uring/rw: drop -EOPNOTSUPP check in __io_complete_rw_common() · 90bfb28d

Jens Axboe authored Sep 10, 2024

A recent change ensured that the necessary -EOPNOTSUPP -> -EAGAIN
transformation happens inline on both the reader and writer side,
and hence there's no need to check for both of these anymore on
the completion handler side.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

90bfb28d

io_uring/rw: treat -EOPNOTSUPP for IOCB_NOWAIT like -EAGAIN · c0a9d496

Jens Axboe authored Sep 10, 2024

Some file systems, ocfs2 in this case, will return -EOPNOTSUPP for
an IOCB_NOWAIT read/write attempt. While this can be argued to be
correct, the usual return value for something that requires blocking
issue is -EAGAIN.

A refactoring io_uring commit dropped calling kiocb_done() for
negative return values, which is otherwise where we already do that
transformation. To ensure we catch it in both spots, check it in
__io_read() itself as well.
Reported-by: Robert Sander <r.sander@heinlein-support.de>
Link: https://fosstodon.org/@gurubert@mastodon.gurubert.de/113112431889638440
Cc: stable@vger.kernel.org
Fixes: a08d195b ("io_uring/rw: split io_read() into a helper")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c0a9d496

09 Sep, 2024 1 commit

io_uring/sqpoll: do not allow pinning outside of cpuset · f011c9cf

Felix Moessbauer authored Sep 09, 2024

The submit queue polling threads are userland threads that just never
exit to the userland. When creating the thread with IORING_SETUP_SQ_AFF,
the affinity of the poller thread is set to the cpu specified in
sq_thread_cpu. However, this CPU can be outside of the cpuset defined
by the cgroup cpuset controller. This violates the rules defined by the
cpuset controller and is a potential issue for realtime applications.

In b7ed6d8ffd6 we fixed the default affinity of the poller thread, in
case no explicit pinning is required by inheriting the one of the
creating task. In case of explicit pinning, the check is more
complicated, as also a cpu outside of the parent cpumask is allowed.
We implemented this by using cpuset_cpus_allowed (that has support for
cgroup cpusets) and testing if the requested cpu is in the set.

Fixes: 37d1e2e3 ("io_uring: move SQPOLL thread io-wq forked worker")
Cc: stable@vger.kernel.org # 6.1+
Signed-off-by: Felix Moessbauer <felix.moessbauer@siemens.com>
Link: https://lore.kernel.org/r/20240909150036.55921-1-felix.moessbauer@siemens.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f011c9cf

08 Sep, 2024 1 commit

io_uring/eventfd: move refs to refcount_t · 0e0bcf07

Jens Axboe authored Sep 08, 2024

atomic_t for the struct io_ev_fd references and there are no issues with
it. While the ref getting and putting for the eventfd code is somewhat
performance critical for cases where eventfd signaling is used (news
flash, you should not...), it probably doesn't warrant using an atomic_t
for this. Let's just move to it to refcount_t to get the added
protection of over/underflows.

Link: https://lore.kernel.org/lkml/202409082039.hnsaIJ3X-lkp@intel.com/Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409082039.hnsaIJ3X-lkp@intel.com/Signed-off-by: Jens Axboe <axboe@kernel.dk>

0e0bcf07

02 Sep, 2024 2 commits

io_uring: remove unused rsrc_put_fn · c9f9ce65

Anuj Gupta authored Sep 02, 2024

rsrc_put_fn is declared but never used, remove it.
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20240902062134.136387-3-anuj20.g@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c9f9ce65

io_uring: add new line after variable declaration · 6cf52b42

Anuj Gupta authored Sep 02, 2024

Fixes checkpatch warning
Signed-off-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/20240902062134.136387-2-anuj20.g@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

6cf52b42

30 Aug, 2024 1 commit

io_uring: add GCOV_PROFILE_URING Kconfig option · 1802656e

Jens Axboe authored Aug 30, 2024

If GCOV is enabled and this option is set, it enables code coverage
profiling of the io_uring subsystem. Only use this for test purposes,
as it will impact the runtime performance.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1802656e

29 Aug, 2024 5 commits

io_uring/kbuf: add support for incremental buffer consumption · ae98dbf4

Jens Axboe authored Aug 09, 2024

By default, any recv/read operation that uses provided buffers will
consume at least 1 buffer fully (and maybe more, in case of bundles).
This adds support for incremental consumption, meaning that an
application may add large buffers, and each read/recv will just consume
the part of the buffer that it needs.

For example, let's say an application registers 1MB buffers in a
provided buffer ring, for streaming receives. If it gets a short recv,
then the full 1MB buffer will be consumed and passed back to the
application. With incremental consumption, only the part that was
actually used is consumed, and the buffer remains the current one.

This means that both the application and the kernel needs to keep track
of what the current receive point is. Each recv will still pass back a
buffer ID and the size consumed, the only difference is that before the
next receive would always be the next buffer in the ring. Now the same
buffer ID may return multiple receives, each at an offset into that
buffer from where the previous receive left off. Example:

Application registers a provided buffer ring, and adds two 32K buffers
to the ring.

Buffer1 address: 0x1000000 (buffer ID 0)
Buffer2 address: 0x2000000 (buffer ID 1)

A recv completion is received with the following values:

cqe->res	0x1000	(4k bytes received)
cqe->flags	0x11	(CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)

and the application now knows that 4096b of data is available at
0x1000000, the start of that buffer, and that more data from this buffer
will be coming. Now the next receive comes in:

cqe->res	0x2010	(8k bytes received)
cqe->flags	0x11	(CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 0)

which tells the application that 8k is available where the last
completion left off, at 0x1001000. Next completion is:

cqe->res	0x5000	(20k bytes received)
cqe->flags	0x1	(CQE_F_BUFFER set, buffer ID 0)

and the application now knows that 20k of data is available at
0x1003000, which is where the previous receive ended. CQE_F_BUF_MORE
isn't set, as no more data is available in this buffer ID. The next
completion is then:

cqe->res	0x1000	(4k bytes received)
cqe->flags	0x10001	(CQE_F_BUFFER|CQE_F_BUF_MORE set, buffer ID 1)

which tells the application that buffer ID 1 is now the current one,
hence there's 4k of valid data at 0x2000000. 0x2001000 will be the next
receive point for this buffer ID.

When a buffer will be reused by future CQE completions,
IORING_CQE_BUF_MORE will be set in cqe->flags. This tells the application
that the kernel isn't done with the buffer yet, and that it should expect
more completions for this buffer ID. Will only be set by provided buffer
rings setup with IOU_PBUF_RING INC, as that's the only type of buffer
that will see multiple consecutive completions for the same buffer ID.
For any other provided buffer type, any completion that passes back
a buffer to the application is final.

Once a buffer has been fully consumed, the buffer ring head is
incremented and the next receive will indicate the next buffer ID in the
CQE cflags.

On the send side, the application can manage how much data is sent from
an existing buffer by setting sqe->len to the desired send length.

An application can request incremental consumption by setting
IOU_PBUF_RING_INC in the provided buffer ring registration. Outside of
that, any provided buffer ring setup and buffer additions is done like
before, no changes there. The only change is in how an application may
see multiple completions for the same buffer ID, hence needing to know
where the next receive will happen.

Note that like existing provided buffer rings, this should not be used
with IOSQE_ASYNC, as both really require the ring to remain locked over
the duration of the buffer selection and the operation completion. It
will consume a buffer otherwise regardless of the size of the IO done.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ae98dbf4

io_uring/kbuf: pass in 'len' argument for buffer commit · 6733e678

Jens Axboe authored Aug 27, 2024

In preparation for needing the consumed length, pass in the length being
completed. Unused right now, but will be used when it is possible to
partially consume a buffer.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6733e678

Revert "io_uring: Require zeroed sqe->len on provided-buffers send" · 641a6816

Jens Axboe authored Aug 20, 2024

This reverts commit 79996b45.

Revert the change that restricts a send provided buffer to be zero, so
it will always consume the whole buffer. This is strictly needed for
partial consumption, as the send may very well be a subset of the
current buffer. In fact, that's the intended use case.

For non-incremental provided buffer rings, an application should set
sqe->len carefully to avoid the potential issue described in the
reverted commit. It is recommended that '0' still be set for len for
that case, if the application is set on maintaining more than 1 send
inflight for the same socket. This is somewhat of a nonsensical thing
to do.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

641a6816

io_uring/kbuf: move io_ring_head_to_buf() to kbuf.h · 2c8fa70b

Jens Axboe authored Aug 12, 2024

In preparation for using this helper in kbuf.h as well, move it there and
turn it into a macro.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2c8fa70b

io_uring/kbuf: add io_kbuf_commit() helper · ecd5c9b2

Jens Axboe authored Aug 12, 2024

Committing the selected ring buffer is currently done in three different
spots, combine it into a helper and just call that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ecd5c9b2

25 Aug, 2024 12 commits

io_uring/kbuf: shrink nr_iovs/mode in struct buf_sel_arg · 12044332

Jens Axboe authored Aug 08, 2024

nr_iovs is capped at 1024, and mode only has a few low values. We can
safely make them u16, in preparation for adding a few more members.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

12044332

io_uring: wire up min batch wake timeout · 7ed9e09e

Jens Axboe authored Jan 04, 2024

Expose min_wait_usec in io_uring_getevents_arg, replacing the pad member
that is currently in there. The value is in usecs, which is explained in
the name as well.

Note that if min_wait_usec and a normal timeout is used in conjunction,
the normal timeout is still relative to the base time. For example, if
min_wait_usec is set to 100 and the normal timeout is 1000, the max
total time waited is still 1000. This also means that if the normal
timeout is shorter than min_wait_usec, then only the min_wait_usec will
take effect.

See previous commit for an explanation of how this works.

IORING_FEAT_MIN_TIMEOUT is added as a feature flag for this, as
applications doing submit_and_wait_timeout() style operations will
generally not see the -EINVAL from the wait side as they return the
number of IOs submitted. Only if no IOs are submitted will the -EINVAL
bubble back up to the application.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7ed9e09e

io_uring: add support for batch wait timeout · 1100c4a2

Jens Axboe authored Jan 04, 2024

Waiting for events with io_uring has two knobs that can be set:

1) The number of events to wake for
2) The timeout associated with the event

Waiting will abort when either of those conditions are met, as expected.

This adds support for a third event, which is associated with the number
of events to wait for. Applications generally like to handle batches of
completions, and right now they'd set a number of events to wait for and
the timeout for that. If no events have been received but the timeout
triggers, control is returned to the application and it can wait again.
However, if the application doesn't have anything to do until events are
reaped, then it's possible to make this waiting more efficient.

For example, the application may have a latency time of 50 usecs and
wanting to handle a batch of 8 requests at the time. If it uses 50 usecs
as the timeout, then it'll be doing 20K context switches per second even
if nothing is happening.

This introduces the notion of min batch wait time. If the min batch wait
time expires, then we'll return to userspace if we have any events at all.
If none are available, the general wait time is applied. Any request
arriving after the min batch wait time will cause waiting to stop and
return control to the application.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1100c4a2

io_uring: implement our own schedule timeout handling · cebf123c

Jens Axboe authored Jan 04, 2024

In preparation for having two distinct timeouts and avoid waking the
task if we don't need to.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

cebf123c

io_uring: move schedule wait logic into helper · 45a41e74

Jens Axboe authored Jan 04, 2024

In preparation for expanding how we handle waits, move the actual
schedule and schedule_timeout() handling into a helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

45a41e74

io_uring: encapsulate extraneous wait flags into a separate struct · f42b58e4

Jens Axboe authored Aug 15, 2024

Rather than need to pass in 2 or 3 separate arguments, add a struct
to encapsulate the timeout and sigset_t parts of waiting. In preparation
for adding another argument for waiting.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

f42b58e4

io_uring: user registered clockid for wait timeouts · 2b8e976b

Pavel Begunkov authored Aug 07, 2024

Add a new registration opcode IORING_REGISTER_CLOCK, which allows the
user to select which clock id it wants to use with CQ waiting timeouts.
It only allows a subset of all posix clocks and currently supports
CLOCK_MONOTONIC and CLOCK_BOOTTIME.
Suggested-by: Lewis Baker <lewissbaker@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/98f2bc8a3c36cdf8f0e6a275245e81e903459703.1723039801.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

2b8e976b

io_uring: add absolute mode wait timeouts · d29cb372

Pavel Begunkov authored Aug 07, 2024

In addition to current relative timeouts for the waiting loop, where the
timespec argument specifies the maximum time it can wait for, add
support for the absolute mode, with the value carrying a CLOCK_MONOTONIC
absolute time until which we should return control back to the user.
Suggested-by: Lewis Baker <lewissbaker@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d5b74d67ada882590b2e42aa3aa7117bbf6b55f.1723039801.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

d29cb372

io_uring/napi: postpone napi timeout adjustment · d5cce407

Pavel Begunkov authored Aug 07, 2024

Remove io_napi_adjust_timeout() and move the adjustments out of the
common path into __io_napi_busy_loop(). Now the limit it's calculated
based on struct io_wait_queue::timeout, for which we query current time
another time. The overhead shouldn't be a problem, it's a polling path,
however that can be optimised later by additionally saving the delta
time value in io_cqring_wait().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/88e14686e245b3b42ff90a3c4d70895d48676206.1723039801.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

d5cce407

io_uring/napi: refactor __io_napi_busy_loop() · 489b8006

Pavel Begunkov authored Aug 07, 2024

we don't need to set ->napi_prefer_busy_poll if we're not going to poll,
do the checks first and all polling preparation after.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2ad7ede8cc7905328fc62e8c3805fdb11635ae0b.1723039801.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

489b8006

io_uring/kbuf: turn io_buffer_list booleans into flags · a69307a5

Jens Axboe authored Aug 09, 2024

We could just move these two and save some space, but in preparation
for adding another flag, turn them into flags first.

This saves 8 bytes in struct io_buffer_list, making it exactly half
a cacheline on 64-bit archs now rather than 40 bytes.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a69307a5

io_uring/net: use ITER_UBUF for single segment send maps · 566a4242

Jens Axboe authored Aug 08, 2024

Just like what is being done on the recv side, if we only map a single
segment, then use ITER_UBUF for mapping it. That's more efficient than
using an ITER_IOVEC.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

566a4242