Commits · 8fcf4c48f44bd7b1b75db139f56ff1ad6477379e · Kirill Smelkov / linux

25 Jul, 2022 40 commits

io_uring: replace zero-length array with flexible-array member · 8fcf4c48

Gustavo A. R. Silva authored Jun 28, 2022

There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/78Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8fcf4c48

io_uring: remove ctx->refs pinning on enter · fbb8bb02

Pavel Begunkov authored Jun 25, 2022

io_uring_enter() takes ctx->refs, which was previously preventing racing
with register quiesce. However, as register now doesn't touch the refs,
we can freely kill extra ctx pinning and rely on the fact that we're
holding a file reference preventing the ring from being destroyed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a11c57ad33a1be53541fce90669c1b79cf4d8940.1656153286.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

fbb8bb02

io_uring: don't check file ops of registered rings · 3273c440

Pavel Begunkov authored Jun 25, 2022

Registered rings are per definitions io_uring files, so we don't need to
additionally verify them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/425cd64fd885b8e329a46c205ee811987691baaf.1656153286.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3273c440

io_uring: remove extra TIF_NOTIFY_SIGNAL check · ad8b261d

Pavel Begunkov authored Jun 25, 2022

io_run_task_work() accounts for TIF_NOTIFY_SIGNAL, so no need to have an
second check in io_run_task_work_sig().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/52ce41a592ad904511697f432141e5690fd4b968.1656153285.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ad8b261d

io_uring: fuse fallback_node and normal tw node · 3218e5d3

Pavel Begunkov authored Jun 25, 2022

Now as both normal and fallback paths use llist, just keep one node head
in struct io_task_work and kill off ->fallback_node.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d04ebde409f7b162fe247b361b4486b193293e46.1656153285.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3218e5d3

io_uring: improve io_fail_links() · 37c7bd31

Pavel Begunkov authored Jun 25, 2022

io_fail_links() is called with ->completion_lock held and for that
reason we'd want to keep it as small as we can. Instead of doing
__io_req_complete_post() for each linked request under the lock, fail
them in a task_work handler under ->uring_lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a2f68708b970a21f4e84ddfa7b3abd67a8fffb27.1656153285.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

37c7bd31

io_uring: move POLLFREE handling to separate function · fe991a76

Jens Axboe authored Jun 21, 2022

We really don't care about this at all in terms of performance. Outside
of having it already be marked unlikely(), shove it into a separate
__cold function.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

fe991a76

io_uring: kbuf: inline io_kbuf_recycle_ring() · 795bbbc8

Hao Xu authored Jun 23, 2022

Make io_kbuf_recycle_ring() inline since it is the fast path of
provided buffer.
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220623130126.179232-1-hao.xu@linux.devSigned-off-by: Jens Axboe <axboe@kernel.dk>

795bbbc8

io_uring: optimise submission side poll_refs · 49f1c68e

Pavel Begunkov authored Jun 23, 2022

The final poll_refs put in __io_arm_poll_handler() takes quite some
cycles. When we're arming from the original task context task_work won't
be run, so in this case we can assume that we won't race with task_works
and so not take the initial ownership ref.

One caveat is that after arming a poll we may race with it, so we have
to add a bunch of io_poll_get_ownership() hidden inside of
io_poll_can_finish_inline() whenever we want to complete arming inline.
For the same reason we can't just set REQ_F_DOUBLE_POLL in
__io_queue_proc() and so need to sync with the first poll entry by
taking its wq head lock.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8825315d7f5e182ac1578a031e546f79b1c97d01.1655990418.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

49f1c68e

io_uring: refactor poll arm error handling · de08356f

Pavel Begunkov authored Jun 23, 2022

__io_arm_poll_handler() errors parsing is a horror, in case it failed it
returns 0 and the caller is expected to look at ipt.error, which already
led us to a number of problems before.

When it returns a valid mask, leave it as it's not, i.e. return 1 and
store the mask in ipt.result_mask. In case of a failure that can be
handled inline return an error code (negative value), and return 0 if
__io_arm_poll_handler() took ownership of the request and will complete
it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/018cacdaef5fe95d7dc56b32e85d752cab7607f6.1655990418.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

de08356f

io_uring: change arm poll return values · 063a0079

Pavel Begunkov authored Jun 23, 2022

The rules for __io_arm_poll_handler()'s result parsing are complicated,
as the first step don't pass return a mask but pass back a positive
return code and fill ipt->result_mask.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/529e29e9f97f2e6e383ccd44234d8b576a83a921.1655990418.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

063a0079

io_uring: add a helper for apoll alloc · 5204aa8c

Pavel Begunkov authored Jun 23, 2022

Extract a helper function for apoll allocation, makes the code easier to
read.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2f93282b47dd678e805dd0d7097f66968ced495c.1655990418.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

5204aa8c

io_uring: remove events caching atavisms · 13a99017

Pavel Begunkov authored Jun 23, 2022

Remove events argument from *io_poll_execute(), it's not needed and not
used.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/12efd4e15c6a90cf9e5b59807cfcb57852b51dc7.1655990418.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

13a99017

io_uring: clean poll ->private flagging · 0638cd7b

Pavel Begunkov authored Jun 23, 2022

We store a req pointer in wqe->private but also take one bit to mark
double poll entries. Replace macro helpers with inline functions for
better type checking and also name the double flag.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9a61240555c64ac0b7a9b0eb59a9efeb638a35a4.1655990418.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0638cd7b

io_uring: add sync cancelation API through io_uring_register() · 78a861b9

Jens Axboe authored Jun 18, 2022

The io_uring cancelation API is async, like any other API that we expose
there. For the case of finding a request to cancel, or not finding one,
it is fully sync in that when submission returns, the CQE for both the
cancelation request and the targeted request have been posted to the
CQ ring.

However, if the targeted work is being executed by io-wq, the API can
only start the act of canceling it. This makes it difficult to use in
some circumstances, as the caller then has to wait for the CQEs to come
in and match on the same cancelation data there.

Provide a IORING_REGISTER_SYNC_CANCEL command for io_uring_register()
that does sync cancelations, always. For the io-wq case, it'll wait
for the cancelation to come in before returning. The only expected
returns from this API is:

0		Request found and canceled fine.
> 0		Requests found and canceled. Only happens if asked to
		cancel multiple requests, and if the work wasn't in
		progress.
-ENOENT		Request not found.
-ETIME		A timeout on the operation was requested, but the timeout
		expired before we could cancel.

and we won't get -EALREADY via this API.

If the timeout value passed in is -1 (tv_sec and tv_nsec), then that
means that no timeout is requested. Otherwise, the timespec passed in
is the amount of time the sync cancel will wait for a successful
cancelation.

Link: https://github.com/axboe/liburing/discussions/608Signed-off-by: Jens Axboe <axboe@kernel.dk>

78a861b9

io_uring: add IORING_ASYNC_CANCEL_FD_FIXED cancel flag · 7d8ca725

Jens Axboe authored Jun 18, 2022

In preparation for not having a request to pass in that carries this
state, add a separate cancelation flag that allows the caller to ask
for a fixed file for cancelation.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7d8ca725

io_uring: have cancelation API accept io_uring_task directly · 88f52eaa

Jens Axboe authored Jun 18, 2022

We just use the io_kiocb passed in to find the io_uring_task, and we
already pass in the ctx via cd->ctx anyway.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

88f52eaa

io_uring: kbuf: kill __io_kbuf_recycle() · 024b8fde

Hao Xu authored Jun 22, 2022

__io_kbuf_recycle() is only called in io_kbuf_recycle(). Kill it and
tweak the code so that the legacy pbuf and ring pbuf code become clear
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220622055551.642370-1-hao.xu@linux.devSigned-off-by: Jens Axboe <axboe@kernel.dk>

024b8fde

io_uring: trace task_work_run · c6dd763c

Dylan Yudaken authored Jun 22, 2022

trace task_work_run to help provide stats on how often task work is run
and what batch sizes are coming through.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-9-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c6dd763c

io_uring: add trace event for running task work · eccd8801

Dylan Yudaken authored Jun 22, 2022

This is useful for investigating if task_work is batching
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-8-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

eccd8801

io_uring: batch task_work · 3a0c037b

Dylan Yudaken authored Jun 22, 2022

Batching task work up is an important performance optimisation, as
task_work_add is expensive.

In order to keep the semantics replace the task_list with a fake node
while processing the old list, and then do a cmpxchg at the end to see if
there is more work.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-6-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

3a0c037b

io_uring: introduce llist helpers · 923d1592

Dylan Yudaken authored Jun 22, 2022

Introduce helpers to atomically switch llist.

Will later move this into common code
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-5-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

923d1592

io_uring: lockless task list · f88262e6

Dylan Yudaken authored Jun 22, 2022

With networking use cases we see contention on the spinlock used to
protect the task_list when multiple threads try and add completions at once.
Instead we can use a lockless list, and assume that the first caller to
add to the list is responsible for kicking off task work.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-4-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f88262e6

io_uring: remove __io_req_task_work_add · c34398a8

Dylan Yudaken authored Jun 22, 2022

this is no longer needed as there is only one caller
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-3-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c34398a8

io_uring: remove priority tw list optimisation · ed5ccb3b

Dylan Yudaken authored Jun 22, 2022

This optimisation has some built in assumptions that make it easy to
introduce bugs. It also does not have clear wins that make it worth keeping.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220622134028.2013417-2-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ed5ccb3b

io_uring: dedup io_run_task_work · 024f15e0

Pavel Begunkov authored Jun 21, 2022

We have an identical copy of io_run_task_work() for io-wq called
io_flush_signals(), deduplicate them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a157a4df5fa217b8bd03c73494f2fd0e24e44fbc.1655802465.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

024f15e0

io_uring: move list helpers to a separate file · a6b21fbb

Pavel Begunkov authored Jun 21, 2022

It's annoying to have io-wq.h as a dependency every time we want some of
struct io_wq_work_list helpers, move them into a separate file.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c1d891ce12b30767d1d2a3b7db2ca3abc1ecc4a2.1655802465.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a6b21fbb

io_uring: improve io_run_task_work() · 625d38b3

Pavel Begunkov authored Jun 21, 2022

Since SQPOLL now uses TWA_SIGNAL_NO_IPI, there won't be task work items
without TIF_NOTIFY_SIGNAL. Simplify io_run_task_work() by removing
task->task_works check. Even though looks it doesn't cause extra cache
bouncing, it's still nice to not touch it an extra time when it might be
not cached.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/75d4f34b0c671075892821a409e28da6cb1d64fe.1655802465.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

625d38b3

io_uring: optimize io_uring_task layout · 4a0fef62

Pavel Begunkov authored Jun 20, 2022

task_work bits of io_uring_task are split into two cache lines causing
extra cache bouncing, place them into a separate cache line. Also move
the most used submission path fields closer together, so there are hot.

Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4a0fef62

io_uring: add a warn_once for poll_find · bce5d70c

Pavel Begunkov authored Jun 20, 2022

io_poll_remove() expects poll_find() to search only for poll requests and
passes a flag for this. Just be a little bit extra cautious considering
lots of recent poll/cancellation changes and add a WARN_ON_ONCE checking
that we don't get an apoll'ed request.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ec9a66f1e22f99dcd02288d4e42f3cc6bb357804.1655684496.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

bce5d70c

io_uring: consistent naming for inline completion · 9da070b1

Pavel Begunkov authored Jun 20, 2022

Improve naming of the inline/deferred completion helper so it's
consistent with it's *_post counterpart. Add some comments and extra
lockdeps to ensure the locking is done right.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/797c619943dac06529e9d3fcb16e4c3cde6ad1a3.1655684496.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

9da070b1

io_uring: move io_import_fixed() · c059f785

Pavel Begunkov authored Jun 20, 2022

Move io_import_fixed() into rsrc.c where it belongs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d5becb21f332b4fef6a7cedd6a50e65e2371630.1655684496.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c059f785

io_uring: opcode independent fixed buf import · f337a84d

Pavel Begunkov authored Jun 20, 2022

Fixed buffers are generic infrastructure, make io_import_fixed() opcode
agnostic.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b1e765c8a1c2c913a05a28d2399fc53e1d3cf37a.1655684496.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f337a84d

io_uring: add io_commit_cqring_flush() · 46929b08

Pavel Begunkov authored Jun 20, 2022

Since __io_commit_cqring_flush users moved to different files, introduce
io_commit_cqring_flush() helper and encapsulate all flags testing details
inside.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0da03887435dd9869ffe46dcd3962bf104afcca3.1655684496.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

46929b08

io_uring: introduce locking helpers for CQE posting · 25399321

Pavel Begunkov authored Jun 20, 2022

spin_lock(&ctx->completion_lock);
/* post CQEs */
io_commit_cqring(ctx);
spin_unlock(&ctx->completion_lock);
io_cqring_ev_posted(ctx);

We have many places repeating this sequence, and the three function
unlock section is not perfect from the maintainance perspective and also
makes it harder to add new locking/sync trick.

Introduce two helpers. io_cq_lock(), which is simple and only grabs
->completion_lock, and io_cq_unlock_post() encapsulating the three call
section.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/fe0c682bf7f7b55d9be55b0d034be9c1949277dc.1655684496.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

25399321

io_uring: hide eventfd assumptions in eventfd paths · 305bef98

Pavel Begunkov authored Jun 20, 2022

Some io_uring-eventfd users assume that there won't be spurious wakeups.
That assumption has to be honoured by all io_cqring_ev_posted() callers,
which is inconvenient and from time to time leads to problems but should
be maintained to not break the userspace.

Instead of making the callers track whether a CQE was posted or not, hide
it inside io_eventfd_signal(). It saves ->cached_cq_tail it saw last time
and triggers the eventfd only when ->cached_cq_tail changed since then.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ffc66bae37a2513080b601e4370e147faaa72c5.1655684496.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

305bef98

io_uring: fix io_poll_remove_all clang warnings · b321823a

Pavel Begunkov authored Jun 20, 2022

clang complains on bitwise operations with bools, add a bit more
verbosity to better show that we want to call io_poll_remove_all_table()
twice but with different arguments.
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f11d21dcdf9233e0eeb15fa13b858a05a78eb310.1655684496.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b321823a

io_uring: improve task exit timeout cancellations · ba3cdb6f

Pavel Begunkov authored Jun 20, 2022

Don't spin trying to cancel timeouts that are reachable but not
cancellable, e.g. already executing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ab8a7440a60bbdf69ae514f672ad050e43dd1b03.1655684496.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

ba3cdb6f

io_uring: fix multi ctx cancellation · affa87db

Pavel Begunkov authored Jun 20, 2022

io_uring_try_cancel_requests() loops until there is nothing left to do
with the ring, however there might be several rings and they might have
dependencies between them, e.g. via poll requests.

Instead of cancelling rings one by one, try to cancel them all and only
then loop over if we still potenially some work to do.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8d491fe02d8ac4c77ff38061cf86b9a827e8845c.1655684496.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

affa87db

io_uring: remove ->flush_cqes optimisation · d9dee430

Pavel Begunkov authored Jun 19, 2022

It's not clear how widely used IOSQE_CQE_SKIP_SUCCESS is, and how often
->flush_cqes flag prevents from completion being flushed. Sometimes it's
high level of concurrency that enables it at least for one CQE, but
sometimes it doesn't save much because nobody waiting on the CQ.

Remove ->flush_cqes flag and the optimisation, it should benefit the
normal use case. Note, that there is no spurious eventfd problem with
that as checks for spuriousness were incorporated into
io_eventfd_signal().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/692e81eeddccc096f449a7960365fa7b4a18f8e6.1655637157.git.asml.silence@gmail.com
[axboe: remove now dead state->flush_cqes variable]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d9dee430