Commits · 9a321c98490c70653a4f0a10b28c45edbcf7a93d · Kirill Smelkov / linux

12 Apr, 2021 24 commits

io_uring: set proper FFS* flags on reg file update · 9a321c98

Pavel Begunkov authored Apr 01, 2021

Set FFS_* flags (e.g. FFS_ASYNC_READ) not only in initial registration
but also on registered files update. Not a bug, but may miss getting
profit out of the feature.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/df29a841a2d3d3695b509cdffce5070777d9d942.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

9a321c98

io_uring: deduplicate NOSIGNAL setting · 04411806

Pavel Begunkov authored Apr 01, 2021

Set MSG_NOSIGNAL and REQ_F_NOWAIT in send/recv prep routines and don't
duplicate it in all four send/recv handlers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e1133a3ed1c0e192975b7341ea4b0bf91f63b132.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

04411806

io_uring: put link timeout req consistently · df9727af

Pavel Begunkov authored Apr 01, 2021

Don't put linked timeout req in io_async_find_and_cancel() but do it in
io_link_timeout_fn(), so we have only one point for that and won't have
to do it differently as it's now (put vs put_deferred). Btw, improve a
bit io_async_find_and_cancel()'s locking.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d75b70957f245275ab7cba83e0ac9c1b86aae78a.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

df9727af

io_uring: simplify overflow handling · c4ea060e

Pavel Begunkov authored Apr 01, 2021

Overflowed CQEs doesn't lock requests anymore, so we don't care so much
about cancelling them, so kill cq_overflow_flushed and simplify the
code.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5799867aeba9e713c32f49aef78e5e1aef9fbc43.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

c4ea060e

io_uring: lock annotate timeouts and poll · e07785b0

Pavel Begunkov authored Apr 01, 2021

Add timeout and poll ->comletion_lock annotations for Sparse, makes life
easier while looking at the functions.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2345325643093d41543383ba985a735aeb899eac.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e07785b0

io_uring: kill unused forward decls · 47e90392

Pavel Begunkov authored Apr 01, 2021

Kill unused forward declarations for io_ring_file_put() and
io_queue_next(). Also btw rename the first one.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/64aa27c3f9662e14615cc119189f5eaf12989671.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

47e90392

io_uring: store reg buffer end instead of length · 4751f53d

Pavel Begunkov authored Apr 01, 2021

It's a bit more convenient for us to store a registered buffer end
address instead of length, see struct io_mapped_ubuf, as it allow to not
recompute it every time.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/39164403fe92f1dc437af134adeec2423cdf9395.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4751f53d

io_uring: improve import_fixed overflow checks · 75769e3f

Pavel Begunkov authored Apr 01, 2021

Replace a hand-coded overflow check with a specialised function. Even
though compilers are smart enough to generate identical binary (i.e.
check carry bit), but it's more foolproof and conveys the intention
better.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e437dcdc929bacbb6f11a4824ecbbf17225cb82a.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

75769e3f

io_uring: refactor io_async_cancel() · 0aec38fd

Pavel Begunkov authored Apr 01, 2021

Remove extra tctx==NULL checks that are already done by
io_async_cancel_one().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/70c2a8b958d942e86958a28af0452966ce1095b0.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0aec38fd

io_uring: remove unused hash_wait · e146a4a3

Pavel Begunkov authored Apr 01, 2021

No users of io_uring_ctx::hash_wait left, kill it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e25cb83c233a5f75f15275596b49fbafbea606fa.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e146a4a3

io_uring: better ref handling in poll_remove_one · 7394161c

Pavel Begunkov authored Apr 01, 2021

Instead of io_put_req() to drop not a final ref, use req_ref_put(),
which is slimmer and will also check the invariant.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/85b5774ce13ae55cc2e705abdc8cbafe1212f1bd.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

7394161c

io_uring: combine lock/unlock sections on exit · 89b5066e

Pavel Begunkov authored Apr 01, 2021

io_ring_exit_work() already does uring_lock lock/unlock, no need to
repeat it for lock waiting trick in io_ring_ctx_free(). Move the waiting
with comments and spinlocking into io_ring_exit_work.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a8ae0589b0ea64ad4791e2c282e4e9b713dd7024.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

89b5066e

io_uring: remove useless is_dying check on quiesce · 215c3902

Pavel Begunkov authored Apr 01, 2021

rsrc_data refs should always be valid for potential submitters,
io_rsrc_ref_quiesce() restores it before unlocking, so
percpu_ref_is_dying() check in io_sqe_files_unregister() does nothing
and misleading. Concurrent quiesce is prevented with
struct io_rsrc_data::quiesce.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bf97055e1748ee3a382e66daf384a469eb90b931.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

215c3902

io_uring: reuse io_rsrc_node_destroy() · 28a9fe25

Pavel Begunkov authored Apr 01, 2021

Reuse io_rsrc_node_destroy() in __io_rsrc_put_work(). Also move it to a
more appropriate place -- to the other node routines, and remove forward
declaration.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cccafba41aee1e5bb59988704885b1340aef3a27.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

28a9fe25

io_uring: ctx-wide rsrc nodes · a7f0ed5a

Pavel Begunkov authored Apr 01, 2021

If we're going to ever support multiple types of resources we need
shared rsrc nodes to not bloat requests, that is implemented in this
patch. It also gives a nicer API and saves one pointer dereference
in io_req_set_rsrc_node().

We may say that all requests bound to a resource belong to one and only
one rsrc node, and considering that nodes are removed and recycled
strictly in-order, this separates requests into generations, where
generation are changed on each node switch (i.e. io_rsrc_node_switch()).

The API is simple, io_rsrc_node_switch() switches to a new generation if
needed, and also optionally kills a passed in io_rsrc_data. Each call to
io_rsrc_node_switch() have to be preceded with
io_rsrc_node_switch_start(). The start function is idempotent and should
not necessarily be followed by switch.

One difference is that once a node was set it will always retain a valid
rsrc node, even on unregister. It may be a nuisance at the moment, but
makes much sense for multiple types of resources. Another thing changed
is that nodes are bound to/associated with a io_rsrc_data later just
before killing (i.e. switching).
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e9c693b4b9a2f47aa784b616ce29843021bb65a.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

a7f0ed5a

io_uring: refactor io_queue_rsrc_removal() · e7c78371

Pavel Begunkov authored Apr 01, 2021

Pass rsrc_node into io_queue_rsrc_removal() explicitly. Just a
simple preparation patch, makes following changes nicer.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/002889ce4de7baf287f2b010eef86ffe889174c6.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e7c78371

io_uring: move rsrc_put callback into io_rsrc_data · 40ae0ff7

Pavel Begunkov authored Apr 01, 2021

io_rsrc_node's callback operates only on a single io_rsrc_data and only
with its resources, so rsrc_put() callback is actually a property of
io_rsrc_data. Move it there, it makes code much nicecr.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9417c2fba3c09e8668f05747006a603d416d34b4.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

40ae0ff7

io_uring: encapsulate rsrc node manipulations · 82fbcfa9

Pavel Begunkov authored Apr 01, 2021

io_rsrc_node_get() and io_rsrc_node_set() are always used together,
merge them into one so most users don't even see io_rsrc_node and don't
need to care about it.

It helped to catch io_sqe_files_register() inferring rsrc data argument
for get and set differently, not a problem but a good sign.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0827b080b2e61b3dec795380f7e1a1995595d41f.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

82fbcfa9

io_uring: use rsrc prealloc infra for files reg · f3baed39

Pavel Begunkov authored Apr 01, 2021

Keep it consistent with update and use io_rsrc_node_prealloc() +
io_rsrc_node_get() in io_sqe_files_register() as well, that will be used
in future patches, not as error prone and allows to deduplicate
rsrc_node init.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cf87321e6be5e38f4dc7fe5079d2aa6945b1ace0.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

f3baed39

io_uring: simplify io_rsrc_node_ref_zero · 221aa924

Pavel Begunkov authored Apr 01, 2021

Replace queue_delayed_work() with mod_delayed_work() in
io_rsrc_node_ref_zero() as the later one can schedule a new work, and
cleanup it further for better readability.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3b2b23e3a1ea4bbf789cd61815d33e05d9ff945e.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

221aa924

io_uring: name rsrc bits consistently · b895c9a6

Pavel Begunkov authored Apr 01, 2021

Keep resource related structs' and functions' naming consistent, in
particular use "io_rsrc" prefix for everything.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/962f5acdf810f3a62831e65da3932cde24f6d9df.1617287883.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

b895c9a6

io-wq: cancel task_work on exit only targeting the current 'wq' · c80ca470

Jens Axboe authored Apr 01, 2021

With using task_work_cancel(), we're potentially canceling task_work
that isn't related to this specific io_wq. Use the newly added
task_work_cancel_match() to ensure that we only remove and cancel work
items that are specific to this io_wq.

Fixes: 685fe7fe ("io-wq: eliminate the need for a manager thread")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c80ca470

task_work: add helper for more targeted task_work canceling · c7aab1a7

Jens Axboe authored Apr 01, 2021

The only exported helper we have right now is task_work_cancel(), which
cancels any task_work from a given task where func matches the queued
work item. This is a bit too coarse for some use cases. Add a
task_work_cancel_match() that allows to more specifically target
individual work items outside of purely the callback function used.

task_work_cancel() can be trivially implemented on top of that, hence do
so.
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c7aab1a7

io_uring: fix race around poll update and poll triggering · b2e720ac

Jens Axboe authored Mar 31, 2021

Joakim reports that in some conditions he sees a multishot poll request
being canceled, and that it coincides with getting -EALREADY on
modification. As part of the poll update procedure, there's a small window
where the request is marked as canceled, and if this coincides with the
event actually triggering, then we can get a spurious -ECANCELED and
termination of the multishot request.

Don't mark the poll request as being canceled for update. We also don't
care if we race on removal unless it's a one-shot request, we can safely
updated for either case.

Fixes: b69de288 ("io_uring: allow events and user_data update of running poll requests")
Reported-by: Joakim Hassila <joj@mac.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b2e720ac

11 Apr, 2021 16 commits

io_uring: reg buffer overflow checks hardening · 50e96989

Pavel Begunkov authored Mar 24, 2021

We are safe with overflows in io_sqe_buffer_register() because it will
just yield alloc failure, but it's nicer to check explicitly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2b0625551be3d97b80a5fd21c8cd79dc1c91f0b5.1616624589.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

50e96989

io_uring: allow SQPOLL without CAP_SYS_ADMIN or CAP_SYS_NICE · 548d819d

Jens Axboe authored Mar 25, 2021

Now that we have any worker being attached to the original task as
threads, accounting of CPU time is directly attributed to the original
task as well. This means that we no longer have to restrict SQPOLL to
needing elevated privileges, as it's really no different from just having
the task spawn a busy looping thread in userspace.
Reported-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

548d819d

io-wq: eliminate the need for a manager thread · 685fe7fe

Jens Axboe authored Mar 08, 2021

io-wq relies on a manager thread to create/fork new workers, as needed.
But there's really no strong need for it anymore. We have the following
cases that fork a new worker:

1) Work queue. This is done from the task itself always, and it's trivial
to create a worker off that path, if needed.

2) All workers have gone to sleep, and we have more work. This is called
off the sched out path. For this case, use a task_work items to queue
a fork-worker operation.

3) Hashed work completion. Don't think we need to do anything off this
case. If need be, it could just use approach 2 as well.

Part of this change is incrementing the running worker count before the
fork, to avoid cases where we observe we need a worker and then queue
creation of one. Then new work comes in, we fork a new one. That last
queue operation should have waited for the previous worker to come up,
it's quite possible we don't even need it. Hence move the worker running
from before we fork it off to more efficiently handle that case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

685fe7fe

kernel: allow fork with TIF_NOTIFY_SIGNAL pending · 66ae0d1e

Jens Axboe authored Mar 22, 2021

fork() fails if signal_pending() is true, but there are two conditions
that can lead to that:

1) An actual signal is pending. We want fork to fail for that one, like
   we always have.

2) TIF_NOTIFY_SIGNAL is pending, because the task has pending task_work.
   We don't need to make it fail for that case.

Allow fork() to proceed if just task_work is pending, by changing the
signal_pending() check to task_sigpending().
Signed-off-by: Jens Axboe <axboe@kernel.dk>

66ae0d1e

io_uring: allow events and user_data update of running poll requests · b69de288

Jens Axboe authored Mar 17, 2021

This adds two new POLL_ADD flags, IORING_POLL_UPDATE_EVENTS and
IORING_POLL_UPDATE_USER_DATA. As with the other POLL_ADD flag, these are
masked into sqe->len. If set, the POLL_ADD will have the following
behavior:

- sqe->addr must contain the the user_data of the poll request that
  needs to be modified. This field is otherwise invalid for a POLL_ADD
  command.

- If IORING_POLL_UPDATE_EVENTS is set, sqe->poll_events must contain the
  new mask for the existing poll request. There are no checks for whether
  these are identical or not, if a matching poll request is found, then it
  is re-armed with the new mask.

- If IORING_POLL_UPDATE_USER_DATA is set, sqe->off must contain the new
  user_data for the existing poll request.

A POLL_ADD with any of these flags set may complete with any of the
following results:

1) 0, which means that we successfully found the existing poll request
   specified, and performed the re-arm procedure. Any error from that
   re-arm will be exposed as a completion event for that original poll
   request, not for the update request.
2) -ENOENT, if no existing poll request was found with the given
   user_data.
3) -EALREADY, if the existing poll request was already in the process of
   being removed/canceled/completing.
4) -EACCES, if an attempt was made to modify an internal poll request
   (eg not one originally issued ass IORING_OP_POLL_ADD).

The usual -EINVAL cases apply as well, if any invalid fields are set
in the sqe for this command type.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b69de288

io_uring: abstract out a io_poll_find_helper() · b2cb805f

Jens Axboe authored Mar 17, 2021

We'll need this helper for another purpose, for now just abstract it
out and have io_poll_cancel() use it for lookups.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b2cb805f

io_uring: terminate multishot poll for CQ ring overflow · 5082620f

Jens Axboe authored Feb 23, 2021

If we hit overflow and fail to allocate an overflow entry for the
completion, terminate the multishot poll mode.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

5082620f

io_uring: abstract out helper for removing poll waitqs/hashes · b2c3f7e1

Jens Axboe authored Feb 23, 2021

No functional changes in this patch, just preparation for kill multishot
poll on CQ overflow.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b2c3f7e1

io_uring: add multishot mode for IORING_OP_POLL_ADD · 88e41cf9

Jens Axboe authored Feb 22, 2021

The default io_uring poll mode is one-shot, where once the event triggers,
the poll command is completed and won't trigger any further events. If
we're doing repeated polling on the same file or socket, then it can be
more efficient to do multishot, where we keep triggering whenever the
event becomes true.

This deviates from the usual norm of having one CQE per SQE submitted. Add
a CQE flag, IORING_CQE_F_MORE, which tells the application to expect
further completion events from the submitted SQE. Right now the only user
of this is POLL_ADD in multishot mode.

Since sqe->poll_events is using the space that we normally use for adding
flags to commands, use sqe->len for the flag space for POLL_ADD. Multishot
mode is selected by setting IORING_POLL_ADD_MULTI in sqe->len. An
application should expect more CQEs for the specificed SQE if the CQE is
flagged with IORING_CQE_F_MORE. In multishot mode, only cancelation or an
error will terminate the poll request, in which case the flag will be
cleared.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

88e41cf9

io_uring: include cflags in completion trace event · 7471e1af

Jens Axboe authored Feb 22, 2021

We should be including the completion flags for better introspection on
exactly what completion event was logged.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7471e1af

io_uring: allocate memory for overflowed CQEs · 6c2450ae

Pavel Begunkov authored Feb 23, 2021

Instead of using a request itself for overflowed CQE stashing, allocate a
separate entry. The disadvantage is that the allocation may fail and it
will be accounted as lost (see rings->cq_overflow), so we lose reliability
in case of memory pressure if the application is driving the CQ ring into
overflow. However, it opens a way for for multiple CQEs per an SQE and
even generating SQE-less CQEs.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: use GFP_ATOMIC | __GFP_ACCOUNT]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

6c2450ae

io_uring: mask in error/nval/hangup consistently for poll · 464dca61

Jens Axboe authored Mar 19, 2021

Instead of masking these in as part of regular POLL_ADD prep, do it in
io_init_poll_iocb(), and include NVAL as that's generally unmaskable,
and RDHUP alongside the HUP that is already set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

464dca61

io_uring: optimise rw complete error handling · 9532b99b

Pavel Begunkov authored Mar 22, 2021

Expect read/write to succeed and create a hot path for this case, in
particular hide all error handling with resubmission under a single
check with the desired result.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9532b99b

io_uring: hide iter revert in resubmit_prep · ab454438

Pavel Begunkov authored Mar 22, 2021

Move iov_iter_revert() resetting iterator in case of -EIOCBQUEUED into
io_resubmit_prep(), so we don't do heavy revert in hot path, also saves
a couple of checks.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

ab454438

io_uring: don't alter iopoll reissue fail ret code · 8c130827

Pavel Begunkov authored Mar 22, 2021

When reissue_prep failed in io_complete_rw_iopoll(), we change return
code to -EIO to prevent io_iopoll_complete() from doing resubmission.
Mark requests with a new flag (i.e. REQ_F_DONT_REISSUE) instead and
retain the original return value.

It also removes io_rw_reissue() from io_iopoll_complete() that will be
used later.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

8c130827

io_uring: optimise kiocb_end_write for !ISREG · 1c98679d

Pavel Begunkov authored Mar 22, 2021

file_end_write() is only for regular files, so the function do a couple
of dereferences to get inode and check for it. However, we already have
REQ_F_ISREG at hand, just use it and inline file_end_write().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1c98679d