Commits · 606559dc4fa36a954a51fbf1c6c0cc320f551fe0 · Kirill Smelkov / linux

09 Mar, 2024 1 commit

io_uring: Fix sqpoll utilization check racing with dying sqpoll · 606559dc

Gabriel Krisman Bertazi authored Mar 08, 2024

Commit 3fcb9d17 ("io_uring/sqpoll: statistics of the true
utilization of sq threads"), currently in Jens for-next branch, peeks at
io_sq_data->thread to report utilization statistics. But, If
io_uring_show_fdinfo races with sqpoll terminating, even though we hold
the ctx lock, sqd->thread might be NULL and we hit the Oops below.

Note that we could technically just protect the getrusage() call and the
sq total/work time calculations.  But showing some sq
information (pid/cpu) and not other information (utilization) is more
confusing than not reporting anything, IMO.  So let's hide it all if we
happen to race with a dying sqpoll.

This can be triggered consistently in my vm setup running
sqpoll-cancel-hang.t in a loop.

BUG: kernel NULL pointer dereference, address: 00000000000007b0
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted 6.8.0-rc3-g3fcb9d17-dirty #69
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
RIP: 0010:getrusage+0x21/0x3e0
Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282
RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60
RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000
R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000
R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000
FS:  00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __die_body+0x1a/0x60
 ? page_fault_oops+0x154/0x440
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? do_user_addr_fault+0x174/0x7c0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? exc_page_fault+0x63/0x140
 ? asm_exc_page_fault+0x22/0x30
 ? getrusage+0x21/0x3e0
 ? seq_printf+0x4e/0x70
 io_uring_show_fdinfo+0x9db/0xa10
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? vsnprintf+0x101/0x4d0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? seq_vprintf+0x34/0x50
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? seq_printf+0x4e/0x70
 ? seq_show+0x16b/0x1d0
 ? __pfx_io_uring_show_fdinfo+0x10/0x10
 seq_show+0x16b/0x1d0
 seq_read_iter+0xd7/0x440
 seq_read+0x102/0x140
 vfs_read+0xae/0x320
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __do_sys_newfstat+0x35/0x60
 ksys_read+0xa5/0xe0
 do_syscall_64+0x50/0x110
 entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f786ec1db4d
Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
RSP: 002b:00007ffcb361a4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000055a4c8fe42f0 RCX: 00007f786ec1db4d
RDX: 0000000000000400 RSI: 000055a4c8fe48a0 RDI: 0000000000000006
RBP: 00007f786ecfb0b0 R08: 00007f786ecfb2a8 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f786ecfaf60
R13: 000055a4c8fe42f0 R14: 0000000000000000 R15: 00007ffcb361a628
 </TASK>
Modules linked in:
CR2: 00000000000007b0
---[ end trace 0000000000000000 ]---
RIP: 0010:getrusage+0x21/0x3e0
Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282
RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60
RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000
R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000
R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000
FS:  00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Fixes: 3fcb9d17 ("io_uring/sqpoll: statistics of the true utilization of sq threads")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

606559dc

08 Mar, 2024 8 commits

io_uring/net: dedup io_recv_finish req completion · 1af04699

Pavel Begunkov authored Mar 08, 2024

There are two block in io_recv_finish() completing the request, which we
can combine and remove jumping.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0e338dcb33c88de83809fda021cba9e7c9681620.1709905727.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

1af04699

io_uring: refactor DEFER_TASKRUN multishot checks · e0e4ab52

Pavel Begunkov authored Mar 08, 2024

We disallow DEFER_TASKRUN multishots from running by io-wq, which is
checked by individual opcodes in the issue path. We can consolidate all
it in io_wq_submit_work() at the same time moving the checks out of the
hot path.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e0e4ab52

io_uring: fix mshot io-wq checks · 3a96378e

Pavel Begunkov authored Mar 08, 2024

When checking for concurrent CQE posting, we're not only interested in
requests running from the poll handler but also strayed requests ended
up in normal io-wq execution. We're disallowing multishots in general
from io-wq, not only when they came in a certain way.

Cc: stable@vger.kernel.org
Fixes: 17add5ce ("io_uring: force multishot CQEs into task context")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d8c5b36a39258036f93301cd60d3cd295e40653d.1709905727.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3a96378e

io_uring/net: add io_req_msg_cleanup() helper · d9b44188

Jens Axboe authored Mar 06, 2024

For the fast inline path, we manually recycle the io_async_msghdr and
free the iovec, and then clear the REQ_F_NEED_CLEANUP flag to avoid
that needing doing in the slower path. We already do that in 2 spots, and
in preparation for adding more, add a helper and use it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d9b44188

io_uring/net: simplify msghd->msg_inq checking · fb6328bc

Jens Axboe authored Mar 06, 2024

Just check for larger than zero rather than check for non-zero and
not -1. This is easier to read, and also protects against any errants
< 0 values that aren't -1.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fb6328bc

io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE · 186daf23

Jens Axboe authored Mar 07, 2024

We only use the flag for this purpose, so rename it accordingly. This
further prevents various other use cases of it, keeping it clean and
consistent. Then we can also check it in one spot, when it's being
attempted recycled, and remove some dead code in io_kbuf_recycle_ring().
Signed-off-by: Jens Axboe <axboe@kernel.dk>

186daf23

io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io · 9817ad85

Jens Axboe authored Mar 07, 2024

Ensure that prep handlers always initialize sr->done_io before any
potential failure conditions, and with that, we now it's always been
set even for the failure case.

With that, we don't need to use the REQ_F_PARTIAL_IO flag to gate on that.
Additionally, we should not overwrite req->cqe.res unless sr->done_io is
actually positive.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9817ad85

io_uring/net: correctly handle multishot recvmsg retry setup · deaef31b

Jens Axboe authored Mar 07, 2024

If we loop for multishot receive on the initial attempt, and then abort
later on to wait for more, we miss a case where we should be copying the
io_async_msghdr from the stack to stable storage. This leads to the next
retry potentially failing, if the application had the msghdr on the
stack.

Cc: stable@vger.kernel.org
Fixes: 9bb66906 ("io_uring: support multishot in recvmsg")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

deaef31b

07 Mar, 2024 3 commits

io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler · b5311dbc

Jens Axboe authored Mar 07, 2024

This flag should not be persistent across retries, so ensure we clear
it before potentially attemting a retry.

Fixes: c3f9109d ("io_uring/kbuf: flag request if buffer pool is empty after buffer pick")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b5311dbc

io_uring: fix io_queue_proc modifying req->flags · 1a8ec63b

Pavel Begunkov authored Mar 07, 2024

With multiple poll entries __io_queue_proc() might be running in
parallel with poll handlers and possibly task_work, we should not be
carelessly modifying req->flags there. io_poll_double_prepare() handles
a similar case with locking but it's much easier to move it into
__io_arm_poll_handler().

Cc: stable@vger.kernel.org
Fixes: 595e5228 ("io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/455cc49e38cf32026fa1b49670be8c162c2cb583.1709834755.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

1a8ec63b

io_uring: fix mshot read defer taskrun cqe posting · 70581dcd

Pavel Begunkov authored Mar 06, 2024

We can't post CQEs from io-wq with DEFER_TASKRUN set, normal completions
are handled but aux should be explicitly disallowed by opcode handlers.

Cc: stable@vger.kernel.org
Fixes: fc68fcda ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6fb7cba6f5366da25f4d3eb95273f062309d97fa.1709740837.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

70581dcd

04 Mar, 2024 2 commits

io_uring/net: fix overflow check in io_recvmsg_mshot_prep() · 8ede3db5

Dan Carpenter authored Mar 01, 2024

The "controllen" variable is type size_t (unsigned long). Casting it
to int could lead to an integer underflow.

The check_add_overflow() function considers the type of the destination
which is type int. If we add two positive values and the result cannot
fit in an integer then that's counted as an overflow.

However, if we cast "controllen" to an int and it turns negative, then
negative values *can* fit into an int type so there is no overflow.

Good: 100 + (unsigned long)-4 = 96 <-- overflow
Bad: 100 + (int)-4 = 96 <-- no overflow

I deleted the cast of the sizeof() as well. That's not a bug but the
cast is unnecessary.

Fixes: 9b0fc3c0 ("io_uring: fix types in io_recvmsg_multishot_overflow")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/138bd2e2-ede8-4bcc-aa7b-f3d9de167a37@moroto.mountainSigned-off-by: Jens Axboe <axboe@kernel.dk>

8ede3db5

io_uring/net: correct the type of variable · 86bcacc9

Muhammad Usama Anjum authored Mar 01, 2024

The namelen is of type int. It shouldn't be made size_t which is
unsigned. The signed number is needed for error checking before use.

Fixes: c5597802 ("io_uring/net: move receive multishot out of the generic msghdr path")
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Link: https://lore.kernel.org/r/20240301144349.2807544-1-usama.anjum@collabora.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

86bcacc9

01 Mar, 2024 2 commits

io_uring/sqpoll: statistics of the true utilization of sq threads · 3fcb9d17

Xiaobing Li authored Feb 28, 2024

Count the running time and actual IO processing time of the sqpoll
thread, and output the statistical data to fdinfo.

Variable description:
"work_time" in the code represents the sum of the jiffies of the sq
thread actually processing IO, that is, how many milliseconds it
actually takes to process IO. "total_time" represents the total time
that the sq thread has elapsed from the beginning of the loop to the
current time point, that is, how many milliseconds it has spent in
total.

The test tool is fio, and its parameters are as follows:
[global]
ioengine=io_uring
direct=1
group_reporting
bs=128k
norandommap=1
randrepeat=0
refill_buffers
ramp_time=30s
time_based
runtime=1m
clocksource=clock_gettime
overwrite=1
log_avg_msec=1000
numjobs=1

[disk0]
filename=/dev/nvme0n1
rw=read
iodepth=16
hipri
sqthread_poll=1

The test results are as follows:
Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq
SqMask: 0x3
SqHead: 3197153
SqTail: 3197153
CachedSqHead:   3197153
SqThread:       9231
SqThreadCpu:    11
SqTotalTime:    18099614
SqWorkTime:     16748316

The test results corresponding to different iodepths are as follows:
|-----------|-------|-------|-------|------|-------|
|   iodepth |   1   |   4   |   8   |  16  |  64   |
|-----------|-------|-------|-------|------|-------|
|utilization| 2.9%  | 8.8%  | 10.9% | 92.9%| 84.4% |
|-----------|-------|-------|-------|------|-------|
|    idle   | 97.1% | 91.2% | 89.1% | 7.1% | 15.6% |
|-----------|-------|-------|-------|------|-------|
Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com>
Link: https://lore.kernel.org/r/20240228091251.543383-1-xiaobing.li@samsung.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3fcb9d17

io_uring/net: move recv/recvmsg flags out of retry loop · eb18c29d

Jens Axboe authored Feb 25, 2024

The flags don't change, just intialize them once rather than every loop
for multishot.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

eb18c29d

27 Feb, 2024 4 commits

io_uring/kbuf: flag request if buffer pool is empty after buffer pick · c3f9109d

Jens Axboe authored Feb 19, 2024

Normally we do an extra roundtrip for retries even if the buffer pool has
depleted, as we don't check that upfront. Rather than add this check, have
the buffer selection methods mark the request with REQ_F_BL_EMPTY if the
used buffer group is out of buffers after this selection. This is very
cheap to do once we're all the way inside there anyway, and it gives the
caller a chance to make better decisions on how to proceed.

For example, recv/recvmsg multishot could check this flag when it
decides whether to keep receiving or not.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c3f9109d

io_uring/net: improve the usercopy for sendmsg/recvmsg · 792060de

Jens Axboe authored Feb 26, 2024

We're spending a considerable amount of the sendmsg/recvmsg time just
copying in the message header. And for provided buffers, the known
single entry iovec.

Be a bit smarter about it and enable/disable user access around our
copying. In a test case that does both sendmsg and recvmsg, the
runtime before this change (averaged over multiple runs, very stable
times however):

Kernel		Time		Diff
====================================
-git		4720 usec
-git+commit	4311 usec	-8.7%

and looking at a profile diff, we see the following:

0.25%     +9.33%  [kernel.kallsyms]     [k] _copy_from_user
4.47%     -3.32%  [kernel.kallsyms]     [k] __io_msg_copy_hdr.constprop.0

where we drop more than 9% of _copy_from_user() time, and consequently
add time to __io_msg_copy_hdr() where the copies are now attributed to,
but with a net win of 6%.

In comparison, the same test case with send/recv runs in 3745 usec, which
is (expectedly) still quite a bit faster. But at least sendmsg/recvmsg is
now only ~13% slower, where it was ~21% slower before.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

792060de

io_uring/net: move receive multishot out of the generic msghdr path · c5597802

Jens Axboe authored Feb 27, 2024

Move the actual user_msghdr / compat_msghdr into the send and receive
sides, respectively, so we can move the uaddr receive handling into its
own handler, and ditto the multishot with buffer selection logic.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c5597802

io_uring/net: unify how recvmsg and sendmsg copy in the msghdr · 52307ac4

Jens Axboe authored Feb 19, 2024

For recvmsg, we roll our own since we support buffer selections. This
isn't the case for sendmsg right now, but in preparation for doing so,
make the recvmsg copy helpers generic so we can call them from the
sendmsg side as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

52307ac4

15 Feb, 2024 2 commits

io_uring/napi: enable even with a timeout of 0 · b4ccc4dd

Jens Axboe authored Feb 15, 2024

1 usec is not as short as it used to be, and it makes sense to allow 0
for a busy poll timeout - this means just do one loop to check if we
have anything available. Add a separate ->napi_enabled to check if napi
has been enabled or not.

While at it, move the writing of the ctx napi values after we've copied
the old values back to userspace. This ensures that if the call fails,
we'll be in the same state as we were before, rather than some
indeterminate state.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b4ccc4dd

io_uring: kill stale comment for io_cqring_overflow_kill() · 871760eb

Jens Axboe authored Feb 15, 2024

This function now deals only with discarding overflow entries on ring
free and exit, and it no longer returns whether we successfully flushed
all entries as there's no CQE posting involved anymore. Kill the
outdated comment.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

871760eb

14 Feb, 2024 3 commits

io_uring/sqpoll: use the correct check for pending task_work · c8d8fc3b

Jens Axboe authored Feb 14, 2024

A previous commit moved to using just the private task_work list for
SQPOLL, but it neglected to update the check for whether we have
pending task_work. Normally this is fine as we'll attempt to run it
unconditionally, but if we race with going to sleep AND task_work
being added, then we certainly need the right check here.

Fixes: af5d68f8 ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c8d8fc3b

io_uring: wake SQPOLL task when task_work is added to an empty queue · 78f9b61b

Jens Axboe authored Feb 14, 2024

If there's no current work on the list, we still need to potentially
wake the SQPOLL task if it is sleeping. This is ordered with the
wait queue addition in sqpoll, which adds to the wait queue before
checking for pending work items.

Fixes: af5d68f8 ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

78f9b61b

io_uring/napi: ensure napi polling is aborted when work is available · 428f1382

Jens Axboe authored Feb 14, 2024

While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and
stalls in packet delivery. Turns out that while
io_napi_busy_loop_should_end() aborts appropriately on regular
task_work, it does not abort if we have local task_work pending.

Move io_has_work() into the private io_uring.h header, and gate whether
we should continue polling on that as well. This makes NAPI polling on
send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well.

Fixes: 8d0c12a8 ("io-uring: add napi busy poll support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

428f1382

13 Feb, 2024 1 commit

io_uring: Don't include af_unix.h. · 3fb1764c

Kuniyuki Iwashima authored Feb 12, 2024

Changes to AF_UNIX trigger rebuild of io_uring, but io_uring does
not use AF_UNIX anymore.

Let's not include af_unix.h and instead include necessary headers.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240212234236.63714-1-kuniyu@amazon.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3fb1764c

09 Feb, 2024 9 commits

io_uring: add register/unregister napi function · ef1186c1

Stefan Roesch authored Jun 08, 2023

This adds an api to register and unregister the napi for io-uring. If
the arg value is specified when unregistering, the current napi setting
for the busy poll timeout is copied into the user structure. If this is
not required, NULL can be passed as the arg value.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230608163839.2891748-7-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>

ef1186c1

io-uring: add sqpoll support for napi busy poll · ff183d42

Stefan Roesch authored Jun 08, 2023

This adds the sqpoll support to the io-uring napi.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Suggested-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/20230608163839.2891748-6-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>

ff183d42

io-uring: add napi busy poll support · 8d0c12a8

Stefan Roesch authored Jun 08, 2023

This adds the napi busy polling support in io_uring.c. It adds a new
napi_list to the io_ring_ctx structure. This list contains the list of
napi_id's that are currently enabled for busy polling. The list is
synchronized by the new napi_lock spin lock. The current default napi
busy polling time is stored in napi_busy_poll_to. If napi busy polling
is not enabled, the value is 0.

In addition there is also a hash table. The hash table store the napi
id and the pointer to the above list nodes. The hash table is used to
speed up the lookup to the list elements. The hash table is synchronized
with rcu.

The NAPI_TIMEOUT is stored as a timeout to make sure that the time a
napi entry is stored in the napi list is limited.

The busy poll timeout is also stored as part of the io_wait_queue. This
is necessary as for sq polling the poll interval needs to be adjusted
and the napi callback allows only to pass in one value.

This has been tested with two simple programs from the liburing library
repository: the napi client and the napi server program. The client
sends a request, which has a timestamp in its payload and the server
replies with the same payload. The client calculates the roundtrip time
and stores it to calculate the results.

The client is running on host1 and the server is running on host 2 (in
the same rack). The measured times below are roundtrip times. They are
average times over 5 runs each. Each run measures 1 million roundtrips.

no rx coal rx coal: frames=88,usecs=33
Default 57us 56us

client_poll=100us 47us 46us

server_poll=100us 51us 46us

client_poll=100us+ 40us 40us
server_poll=100us

client_poll=100us+ 41us 39us
server_poll=100us+
prefer napi busy poll on client

client_poll=100us+ 41us 39us
server_poll=100us+
prefer napi busy poll on server

client_poll=100us+ 41us 39us
server_poll=100us+
prefer napi busy poll on client + server
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Suggested-by: Olivier Langlois <olivier@trillion01.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230608163839.2891748-5-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>

8d0c12a8

io-uring: move io_wait_queue definition to header file · 405b4dc1

Stefan Roesch authored Jun 08, 2023

This moves the definition of the io_wait_queue structure to the header
file so it can be also used from other files.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20230608163839.2891748-4-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>

405b4dc1

Merge branch 'for-io_uring-add-napi-busy-polling-support' of... · adaad279

Jens Axboe authored Feb 09, 2024

Merge branch 'for-io_uring-add-napi-busy-polling-support' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into for-6.9/io_uring

Pull netdev side of the io_uring napi support.

* 'for-io_uring-add-napi-busy-polling-support' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux:
  net: add napi_busy_loop_rcu()
  net: split off __napi_busy_poll from napi_busy_poll

adaad279

net: add napi_busy_loop_rcu() · b4e8ae5c

Stefan Roesch authored Feb 06, 2024

This adds the napi_busy_loop_rcu() function. This function assumes that
the calling function is already holding the rcu read lock and
napi_busy_loop() does not need to take the rcu read lock. Add a
NAPI_F_NO_SCHED flag, which tells __napi_busy_loop() to abort if we
need to reschedule rather than drop the RCU read lock and reschedule.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20230608163839.2891748-3-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b4e8ae5c

net: split off __napi_busy_poll from napi_busy_poll · 13d381b4

Stefan Roesch authored Feb 06, 2024

This splits off the key part of the napi_busy_poll function into its own
function, __napi_busy_poll, and changes the prefer_busy_poll bool to be
flag based to allow passing in more flags in the future.

This is done in preparation for an additional napi_busy_poll() function,
that doesn't take the rcu_read_lock(). The new function is introduced
in the next patch.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20230608163839.2891748-2-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

13d381b4

io_uring: add support for ftruncate · b4bb1900

Tony Solomonik authored Feb 02, 2024

Adds support for doing truncate through io_uring, eliminating
the need for applications to roll their own thread pool or offload
mechanism to be able to do non-blocking truncates.
Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com>
Link: https://lore.kernel.org/r/20240202121724.17461-3-tony.solomonik@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b4bb1900

Add do_ftruncate that truncates a struct file · 5f0d594c

Tony Solomonik authored Feb 02, 2024

do_sys_ftruncate receives a file descriptor, fgets the struct file, and
finally actually truncates the file.

do_ftruncate allows for passing in a file directly, with the caller
already holding a reference to it.
Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20240202121724.17461-2-tony.solomonik@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5f0d594c

08 Feb, 2024 5 commits

io_uring: Simplify the allocation of slab caches · a6e959bd

Kunwu Chan authored Jan 30, 2024

commit 0a31bd5f ("KMEM_CACHE(): simplify slab cache creation")
introduces a new macro.
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Link: https://lore.kernel.org/r/20240130100247.81460-1-chentao@kylinos.cnSigned-off-by: Jens Axboe <axboe@kernel.dk>

a6e959bd

io_uring: re-arrange struct io_ring_ctx to reduce padding · da08d2ed

Jens Axboe authored Feb 08, 2024

Nothing major here, just moving a few things around to reduce the
padding. This reduces the size on a non-debug kernel from 1536 to
1472 bytes, saving a full cacheline.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da08d2ed

io_uring/sqpoll: manage task_work privately · af5d68f8

Jens Axboe authored Feb 02, 2024

Decouple from task_work running, and cap the number of entries we process
at the time. If we exceed that number, push remaining entries to a retry
list that we'll process first next time.

We cap the number of entries to process at 8, which is fairly random.
We just want to get enough per-ctx batching here, while not processing
endlessly.

Since we manually run PF_IO_WORKER related task_work anyway as the task
never exits to userspace, with this we no longer need to add an actual
task_work item to the per-process list.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

af5d68f8

io_uring: pass in counter to handle_tw_list() rather than return it · 2708af1a

Jens Axboe authored Feb 02, 2024

No functional changes in this patch, just in preparation for returning
something other than count from this helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2708af1a

io_uring: cleanup handle_tw_list() calling convention · 42c0905f

Jens Axboe authored Feb 02, 2024

Now that we don't loop around task_work anymore, there's no point in
maintaining the ring and locked state outside of handle_tw_list(). Get
rid of passing in those pointers (and pointers to pointers) and just do
the management internally in handle_tw_list().
Signed-off-by: Jens Axboe <axboe@kernel.dk>

42c0905f