Commits · c3f9109dbc9e2cd0b2c3ba0536431eef282783e9 · Kirill Smelkov / linux

27 Feb, 2024 4 commits

io_uring/kbuf: flag request if buffer pool is empty after buffer pick · c3f9109d

Jens Axboe authored Feb 19, 2024

Normally we do an extra roundtrip for retries even if the buffer pool has
depleted, as we don't check that upfront. Rather than add this check, have
the buffer selection methods mark the request with REQ_F_BL_EMPTY if the
used buffer group is out of buffers after this selection. This is very
cheap to do once we're all the way inside there anyway, and it gives the
caller a chance to make better decisions on how to proceed.

For example, recv/recvmsg multishot could check this flag when it
decides whether to keep receiving or not.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c3f9109d

io_uring/net: improve the usercopy for sendmsg/recvmsg · 792060de

Jens Axboe authored Feb 26, 2024

We're spending a considerable amount of the sendmsg/recvmsg time just
copying in the message header. And for provided buffers, the known
single entry iovec.

Be a bit smarter about it and enable/disable user access around our
copying. In a test case that does both sendmsg and recvmsg, the
runtime before this change (averaged over multiple runs, very stable
times however):

Kernel		Time		Diff
====================================
-git		4720 usec
-git+commit	4311 usec	-8.7%

and looking at a profile diff, we see the following:

0.25%     +9.33%  [kernel.kallsyms]     [k] _copy_from_user
4.47%     -3.32%  [kernel.kallsyms]     [k] __io_msg_copy_hdr.constprop.0

where we drop more than 9% of _copy_from_user() time, and consequently
add time to __io_msg_copy_hdr() where the copies are now attributed to,
but with a net win of 6%.

In comparison, the same test case with send/recv runs in 3745 usec, which
is (expectedly) still quite a bit faster. But at least sendmsg/recvmsg is
now only ~13% slower, where it was ~21% slower before.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

792060de

io_uring/net: move receive multishot out of the generic msghdr path · c5597802

Jens Axboe authored Feb 27, 2024

Move the actual user_msghdr / compat_msghdr into the send and receive
sides, respectively, so we can move the uaddr receive handling into its
own handler, and ditto the multishot with buffer selection logic.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c5597802

io_uring/net: unify how recvmsg and sendmsg copy in the msghdr · 52307ac4

Jens Axboe authored Feb 19, 2024

For recvmsg, we roll our own since we support buffer selections. This
isn't the case for sendmsg right now, but in preparation for doing so,
make the recvmsg copy helpers generic so we can call them from the
sendmsg side as well.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

52307ac4

15 Feb, 2024 2 commits

io_uring/napi: enable even with a timeout of 0 · b4ccc4dd

Jens Axboe authored Feb 15, 2024

1 usec is not as short as it used to be, and it makes sense to allow 0
for a busy poll timeout - this means just do one loop to check if we
have anything available. Add a separate ->napi_enabled to check if napi
has been enabled or not.

While at it, move the writing of the ctx napi values after we've copied
the old values back to userspace. This ensures that if the call fails,
we'll be in the same state as we were before, rather than some
indeterminate state.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b4ccc4dd

io_uring: kill stale comment for io_cqring_overflow_kill() · 871760eb

Jens Axboe authored Feb 15, 2024

This function now deals only with discarding overflow entries on ring
free and exit, and it no longer returns whether we successfully flushed
all entries as there's no CQE posting involved anymore. Kill the
outdated comment.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

871760eb

14 Feb, 2024 3 commits

io_uring/sqpoll: use the correct check for pending task_work · c8d8fc3b

Jens Axboe authored Feb 14, 2024

A previous commit moved to using just the private task_work list for
SQPOLL, but it neglected to update the check for whether we have
pending task_work. Normally this is fine as we'll attempt to run it
unconditionally, but if we race with going to sleep AND task_work
being added, then we certainly need the right check here.

Fixes: af5d68f8 ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c8d8fc3b

io_uring: wake SQPOLL task when task_work is added to an empty queue · 78f9b61b

Jens Axboe authored Feb 14, 2024

If there's no current work on the list, we still need to potentially
wake the SQPOLL task if it is sleeping. This is ordered with the
wait queue addition in sqpoll, which adds to the wait queue before
checking for pending work items.

Fixes: af5d68f8 ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

78f9b61b

io_uring/napi: ensure napi polling is aborted when work is available · 428f1382

Jens Axboe authored Feb 14, 2024

While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and
stalls in packet delivery. Turns out that while
io_napi_busy_loop_should_end() aborts appropriately on regular
task_work, it does not abort if we have local task_work pending.

Move io_has_work() into the private io_uring.h header, and gate whether
we should continue polling on that as well. This makes NAPI polling on
send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well.

Fixes: 8d0c12a8 ("io-uring: add napi busy poll support")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

428f1382

13 Feb, 2024 1 commit

io_uring: Don't include af_unix.h. · 3fb1764c

Kuniyuki Iwashima authored Feb 12, 2024

Changes to AF_UNIX trigger rebuild of io_uring, but io_uring does
not use AF_UNIX anymore.

Let's not include af_unix.h and instead include necessary headers.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240212234236.63714-1-kuniyu@amazon.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3fb1764c

09 Feb, 2024 9 commits

io_uring: add register/unregister napi function · ef1186c1

Stefan Roesch authored Jun 08, 2023

This adds an api to register and unregister the napi for io-uring. If
the arg value is specified when unregistering, the current napi setting
for the busy poll timeout is copied into the user structure. If this is
not required, NULL can be passed as the arg value.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230608163839.2891748-7-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>

ef1186c1

io-uring: add sqpoll support for napi busy poll · ff183d42

Stefan Roesch authored Jun 08, 2023

This adds the sqpoll support to the io-uring napi.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Suggested-by: Olivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/20230608163839.2891748-6-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>

ff183d42

io-uring: add napi busy poll support · 8d0c12a8

Stefan Roesch authored Jun 08, 2023

This adds the napi busy polling support in io_uring.c. It adds a new
napi_list to the io_ring_ctx structure. This list contains the list of
napi_id's that are currently enabled for busy polling. The list is
synchronized by the new napi_lock spin lock. The current default napi
busy polling time is stored in napi_busy_poll_to. If napi busy polling
is not enabled, the value is 0.

In addition there is also a hash table. The hash table store the napi
id and the pointer to the above list nodes. The hash table is used to
speed up the lookup to the list elements. The hash table is synchronized
with rcu.

The NAPI_TIMEOUT is stored as a timeout to make sure that the time a
napi entry is stored in the napi list is limited.

The busy poll timeout is also stored as part of the io_wait_queue. This
is necessary as for sq polling the poll interval needs to be adjusted
and the napi callback allows only to pass in one value.

This has been tested with two simple programs from the liburing library
repository: the napi client and the napi server program. The client
sends a request, which has a timestamp in its payload and the server
replies with the same payload. The client calculates the roundtrip time
and stores it to calculate the results.

The client is running on host1 and the server is running on host 2 (in
the same rack). The measured times below are roundtrip times. They are
average times over 5 runs each. Each run measures 1 million roundtrips.

no rx coal rx coal: frames=88,usecs=33
Default 57us 56us

client_poll=100us 47us 46us

server_poll=100us 51us 46us

client_poll=100us+ 40us 40us
server_poll=100us

client_poll=100us+ 41us 39us
server_poll=100us+
prefer napi busy poll on client

client_poll=100us+ 41us 39us
server_poll=100us+
prefer napi busy poll on server

client_poll=100us+ 41us 39us
server_poll=100us+
prefer napi busy poll on client + server
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Suggested-by: Olivier Langlois <olivier@trillion01.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20230608163839.2891748-5-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>

8d0c12a8

io-uring: move io_wait_queue definition to header file · 405b4dc1

Stefan Roesch authored Jun 08, 2023

This moves the definition of the io_wait_queue structure to the header
file so it can be also used from other files.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20230608163839.2891748-4-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>

405b4dc1

Merge branch 'for-io_uring-add-napi-busy-polling-support' of... · adaad279

Jens Axboe authored Feb 09, 2024

Merge branch 'for-io_uring-add-napi-busy-polling-support' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux into for-6.9/io_uring

Pull netdev side of the io_uring napi support.

* 'for-io_uring-add-napi-busy-polling-support' of git://git.kernel.org/pub/scm/linux/kernel/git/kuba/linux:
  net: add napi_busy_loop_rcu()
  net: split off __napi_busy_poll from napi_busy_poll

adaad279

net: add napi_busy_loop_rcu() · b4e8ae5c

Stefan Roesch authored Feb 06, 2024

This adds the napi_busy_loop_rcu() function. This function assumes that
the calling function is already holding the rcu read lock and
napi_busy_loop() does not need to take the rcu read lock. Add a
NAPI_F_NO_SCHED flag, which tells __napi_busy_loop() to abort if we
need to reschedule rather than drop the RCU read lock and reschedule.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20230608163839.2891748-3-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

b4e8ae5c

net: split off __napi_busy_poll from napi_busy_poll · 13d381b4

Stefan Roesch authored Feb 06, 2024

This splits off the key part of the napi_busy_poll function into its own
function, __napi_busy_poll, and changes the prefer_busy_poll bool to be
flag based to allow passing in more flags in the future.

This is done in preparation for an additional napi_busy_poll() function,
that doesn't take the rcu_read_lock(). The new function is introduced
in the next patch.
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Link: https://lore.kernel.org/r/20230608163839.2891748-2-shr@devkernel.ioSigned-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

13d381b4

io_uring: add support for ftruncate · b4bb1900

Tony Solomonik authored Feb 02, 2024

Adds support for doing truncate through io_uring, eliminating
the need for applications to roll their own thread pool or offload
mechanism to be able to do non-blocking truncates.
Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com>
Link: https://lore.kernel.org/r/20240202121724.17461-3-tony.solomonik@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b4bb1900

Add do_ftruncate that truncates a struct file · 5f0d594c

Tony Solomonik authored Feb 02, 2024

do_sys_ftruncate receives a file descriptor, fgets the struct file, and
finally actually truncates the file.

do_ftruncate allows for passing in a file directly, with the caller
already holding a reference to it.
Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/r/20240202121724.17461-2-tony.solomonik@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5f0d594c

08 Feb, 2024 18 commits

io_uring: Simplify the allocation of slab caches · a6e959bd

Kunwu Chan authored Jan 30, 2024

commit 0a31bd5f ("KMEM_CACHE(): simplify slab cache creation")
introduces a new macro.
Use the new KMEM_CACHE() macro instead of direct kmem_cache_create
to simplify the creation of SLAB caches.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn>
Link: https://lore.kernel.org/r/20240130100247.81460-1-chentao@kylinos.cnSigned-off-by: Jens Axboe <axboe@kernel.dk>

a6e959bd

io_uring: re-arrange struct io_ring_ctx to reduce padding · da08d2ed

Jens Axboe authored Feb 08, 2024

Nothing major here, just moving a few things around to reduce the
padding. This reduces the size on a non-debug kernel from 1536 to
1472 bytes, saving a full cacheline.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

da08d2ed

io_uring/sqpoll: manage task_work privately · af5d68f8

Jens Axboe authored Feb 02, 2024

Decouple from task_work running, and cap the number of entries we process
at the time. If we exceed that number, push remaining entries to a retry
list that we'll process first next time.

We cap the number of entries to process at 8, which is fairly random.
We just want to get enough per-ctx batching here, while not processing
endlessly.

Since we manually run PF_IO_WORKER related task_work anyway as the task
never exits to userspace, with this we no longer need to add an actual
task_work item to the per-process list.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

af5d68f8

io_uring: pass in counter to handle_tw_list() rather than return it · 2708af1a

Jens Axboe authored Feb 02, 2024

No functional changes in this patch, just in preparation for returning
something other than count from this helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2708af1a

io_uring: cleanup handle_tw_list() calling convention · 42c0905f

Jens Axboe authored Feb 02, 2024

Now that we don't loop around task_work anymore, there's no point in
maintaining the ring and locked state outside of handle_tw_list(). Get
rid of passing in those pointers (and pointers to pointers) and just do
the management internally in handle_tw_list().
Signed-off-by: Jens Axboe <axboe@kernel.dk>

42c0905f

io_uring/poll: improve readability of poll reference decrementing · 3cdc4be1

Jens Axboe authored Feb 01, 2024

This overly long line is hard to read. Break it up by AND'ing the
ref mask first, then perform the atomic_sub_return() with the value
itself.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3cdc4be1

io_uring: remove unconditional looping in local task_work handling · 9fe3eaea

Jens Axboe authored Jan 31, 2024

If we have a ton of notifications coming in, we can be looping in here
for a long time. This can be problematic for various reasons, mostly
because we can starve userspace. If the application is waiting on N
events, then only re-run if we need more events.

Fixes: c0e0d6ba ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9fe3eaea

io_uring: remove next io_kiocb fetch in task_work running · 670d9d3d

Jens Axboe authored Jan 31, 2024

We just reversed the task_work list and that will have touched requests
as well, just get rid of this optimization as it should not make a
difference anymore.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

670d9d3d

io_uring: handle traditional task_work in FIFO order · 170539bd

Jens Axboe authored Jan 30, 2024

For local task_work, which is used if IORING_SETUP_DEFER_TASKRUN is set,
we reverse the order of the lockless list before processing the work.
This is done to process items in the order in which they were queued, as
the llist always adds to the head.

Do the same for traditional task_work, so we have the same behavior for
both types.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

170539bd

io_uring: remove 'loops' argument from trace_io_uring_task_work_run() · 4c98b891

Jens Axboe authored Jan 30, 2024

We no longer loop in task_work handling, hence delete the argument from
the tracepoint as it's always 1 and hence not very informative.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4c98b891

io_uring: remove looping around handling traditional task_work · 592b4805

Jens Axboe authored Jan 30, 2024

A previous commit added looping around handling traditional task_work
as an optimization, and while that may seem like a good idea, it's also
possible to run into application starvation doing so. If the task_work
generation is bursty, we can get very deep task_work queues, and we can
end up looping in here for a very long time.

One immediately observable problem with that is handling network traffic
using provided buffers, where flooding incoming traffic and looping
task_work handling will very quickly lead to buffer starvation as we
keep running task_work rather than returning to the application so it
can handle the associated CQEs and also provide buffers back.

Fixes: 3a0c037b ("io_uring: batch task_work")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

592b4805

io_uring/kbuf: cleanup passing back cflags · 8435c6f3

Jens Axboe authored Jan 29, 2024

We have various functions calculating the CQE cflags we need to pass
back, but it's all the same everywhere. Make a number of the putting
functions void, and just have the two main helps for this, io_put_kbuf()
and io_put_kbuf_comp() calculate the actual mask and pass it back.

While at it, cleanup how we put REQ_F_BUFFER_RING buffers. Before
this change, we would call into __io_put_kbuf() only to go right back
in to the header defined functions. As clearing this type of buffer
is just re-assigning the buf_index and incrementing the head, this
is very wasteful.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8435c6f3

io_uring/rw: remove dead file == NULL check · 949249e2

Jens Axboe authored Jan 28, 2024

Any read/write opcode has needs_file == true, which means that we
would've failed the request long before reaching the issue stage if we
didn't successfully assign a file. This check has been dead forever,
and is really a leftover from generic code.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

949249e2

io_uring: cleanup io_req_complete_post() · 4caa74fd

Jens Axboe authored Feb 07, 2024

Move the ctx declaration and assignment up to be generally available
in the function, as we use req->ctx at the top anyway.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4caa74fd

io_uring: mark the need to lock/unlock the ring as unlikely · bfe30bfd

Jens Axboe authored Jan 28, 2024

Any of the fast paths will already have this locked, this helper only
exists to deal with io-wq invoking request issue where we do not have
the ctx->uring_lock held already. This means that any common or fast
path will already have this locked, mark it as such.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bfe30bfd

io_uring: add io_file_can_poll() helper · 95041b93

Jens Axboe authored Jan 28, 2024

This adds a flag to avoid dipping dereferencing file and then f_op to
figure out if the file has a poll handler defined or not. We generally
call this at least twice for networked workloads, and if using ring
provided buffers, we do it on every buffer selection. Particularly the
latter is troublesome, as it's otherwise a very fast operation.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

95041b93

io_uring/cancel: don't default to setting req->work.cancel_seq · 521223d7

Jens Axboe authored Jan 28, 2024

Just leave it unset by default, avoiding dipping into the last
cacheline (which is otherwise untouched) for the fast path of using
poll to drive networked traffic. Add a flag that tells us if the
sequence is valid or not, and then we can defer actually assigning
the flag and sequence until someone runs cancelations.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

521223d7

io_uring: expand main struct io_kiocb flags to 64-bits · 4bcb982c

Jens Axboe authored Jan 28, 2024

We're out of space here, and none of the flags are easily reclaimable.
Bump it to 64-bits and re-arrange the struct a bit to avoid gaps.

Add a specific bitwise type for the request flags, io_request_flags_t.
This will help catch violations of casting this value to a smaller type
on 32-bit archs, like unsigned int.

This creates a hole in the io_kiocb, so move nr_tw up and rsrc_node down
to retain needing only cacheline 0 and 1 for non-polled opcodes.

No functional changes intended in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4bcb982c

07 Feb, 2024 1 commit

io_uring: use file_mnt_idmap helper · 5492a490

Alexander Mikhalitsyn authored Jan 29, 2024

Let's use file_mnt_idmap() as we do that across the tree.

No functional impact.

Cc: Christian Brauner <brauner@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: io-uring@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
Link: https://lore.kernel.org/r/20240129180024.219766-2-aleksandr.mikhalitsyn@canonical.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5492a490

04 Feb, 2024 2 commits

Linux 6.8-rc3 · 54be6c6c
Linus Torvalds authored Feb 04, 2024

54be6c6c

Merge tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 · 3f24fcda

Linus Torvalds authored Feb 04, 2024

Pull ext4 fixes from Ted Ts'o:
 "Miscellaneous bug fixes and cleanups in ext4's multi-block allocator
  and extent handling code"

* tag 'for-linus-6.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (23 commits)
  ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type
  ext4: make ext4_map_blocks() distinguish delalloc only extent
  ext4: add a hole extent entry in cache after punch
  ext4: correct the hole length returned by ext4_map_blocks()
  ext4: convert to exclusive lock while inserting delalloc extents
  ext4: refactor ext4_da_map_blocks()
  ext4: remove 'needed' in trace_ext4_discard_preallocations
  ext4: remove unnecessary parameter "needed" in ext4_discard_preallocations
  ext4: remove unused return value of ext4_mb_release_group_pa
  ext4: remove unused return value of ext4_mb_release_inode_pa
  ext4: remove unused return value of ext4_mb_release
  ext4: remove unused ext4_allocation_context::ac_groups_considered
  ext4: remove unneeded return value of ext4_mb_release_context
  ext4: remove unused parameter ngroup in ext4_mb_choose_next_group_*()
  ext4: remove unused return value of __mb_check_buddy
  ext4: mark the group block bitmap as corrupted before reporting an error
  ext4: avoid allocating blocks from corrupted group in ext4_mb_find_by_goal()
  ext4: avoid allocating blocks from corrupted group in ext4_mb_try_best_found()
  ext4: avoid dividing by 0 in mb_update_avg_fragment_size() when block bitmap corrupt
  ext4: avoid bb_free and bb_fragments inconsistency in mb_free_blocks()
  ...

3f24fcda