Commits · 0e7579ca732a39cc377e17509dda9bfc4f6ba78e · Kirill Smelkov / linux

19 May, 2022 1 commit

io_uring: fix incorrect __kernel_rwf_t cast · 0e7579ca

Vasily Averin authored May 19, 2022

Currently 'make C=1 fs/io_uring.o' generates sparse warning:

  CHECK   fs/io_uring.c
fs/io_uring.c: note: in included file (through
include/trace/trace_events.h, include/trace/define_trace.h, i
nclude/trace/events/io_uring.h):
./include/trace/events/io_uring.h:488:1:
 warning: incorrect type in assignment (different base types)
    expected unsigned int [usertype] op_flags
    got restricted __kernel_rwf_t const [usertype] rw_flags

This happen on cast of sqe->rw_flags which is defined as __kernel_rwf_t,
this type is bitwise and requires __force attribute for any casts.
However rw_flags is a member of the union, and its access can be safely
replaced by using of its neighbours, so let's use poll32_events to fix
the sparse warning.
Signed-off-by: Vasily Averin <vvs@openvz.org>
Link: https://lore.kernel.org/r/6f009241-a63f-ae43-a04b-62841aaef293@openvz.orgSigned-off-by: Jens Axboe <axboe@kernel.dk>

0e7579ca

18 May, 2022 13 commits

io_uring: disallow mixed provided buffer group registrations · 2fcabce2

Jens Axboe authored May 18, 2022

It's nonsensical to register a provided buffer ring, if a classic
provided buffer group with the same ID exists. Depending on the order of
which we decide what type to pick, the other type will never get used.
Explicitly disallow it and return an error if this is attempted.

Fixes: c7fb1942 ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

2fcabce2

io_uring: initialize io_buffer_list head when shared ring is unregistered · 1d0dbbfa

Jens Axboe authored May 18, 2022

We use ->buf_pages != 0 to tell if this is a shared buffer ring or a
classic provided buffer group. If we unregister the shared ring and
then attempt to use it, buf_pages is zero yet the classic list head
isn't properly initialized. This causes io_buffer_select() to think
that we have classic buffers available, but then we crash when we try
and get one from the list.

Just initialize the list if we unregister a shared buffer ring, leaving
it in a sane state for either re-registration or for attempting to use
it. And do the same for the initial setup from the classic path.

Fixes: c7fb1942 ("io_uring: add support for ring mapped supplied buffers")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1d0dbbfa

io_uring: add fully sparse buffer registration · 0184f08e

Pavel Begunkov authored May 18, 2022

Honour IORING_RSRC_REGISTER_SPARSE not only for direct files but fixed
buffers as well. It makes the rsrc API more consistent.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/66f429e4912fe39fb3318217ff33a2853d4544be.1652879898.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

0184f08e

io_uring: use rcu_dereference in io_close · 0bf1dbee

Christoph Hellwig authored May 18, 2022

Accessing the file table needs a rcu_dereference_protected().
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-7-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

0bf1dbee

io_uring: consistently use the EPOLL* defines · a294bef5

Christoph Hellwig authored May 18, 2022

POLL* are unannotated values for the userspace ABI, while everything
in-kernel should use EPOLL* and the __poll_t type.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-6-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

a294bef5

io_uring: make apoll_events a __poll_t · 58f5c8d3

Christoph Hellwig authored May 18, 2022

apoll_events is fed to vfs_poll and the poll tables, so it should be
a __poll_t.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-5-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

58f5c8d3

io_uring: drop a spurious inline on a forward declaration · ee67ba3b

Christoph Hellwig authored May 18, 2022

io_file_get_normal isn't marked inline, so don't claim it as such in the
forward declaration.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

ee67ba3b

io_uring: don't use ERR_PTR for user pointers · 984824db

Christoph Hellwig authored May 18, 2022

ERR_PTR abuses the high bits of a pointer to transport error information.
This is only safe for kernel pointers and not user pointers. Fix
io_buffer_select and its helpers to just return NULL for failure and get
rid of this abuse.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

984824db

io_uring: use a rwf_t for io_rw.flags · 20cbd21d

Christoph Hellwig authored May 18, 2022

Use the proper type.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220518084005.3255380-2-hch@lst.deSigned-off-by: Jens Axboe <axboe@kernel.dk>

20cbd21d

io_uring: add support for ring mapped supplied buffers · c7fb1942

Jens Axboe authored Apr 30, 2022

Provided buffers allow an application to supply io_uring with buffers
that can then be grabbed for a read/receive request, when the data
source is ready to deliver data. The existing scheme relies on using
IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use
in real world applications. It's pretty efficient if the application
is able to supply back batches of provided buffers when they have been
consumed and the application is ready to recycle them, but if
fragmentation occurs in the buffer space, it can become difficult to
supply enough buffers at the time. This hurts efficiency.

Add a register op, IORING_REGISTER_PBUF_RING, which allows an application
to setup a shared queue for each buffer group of provided buffers. The
application can then supply buffers simply by adding them to this ring,
and the kernel can consume then just as easily. The ring shares the head
with the application, the tail remains private in the kernel.

Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use
IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the
ring, they must use the mapped ring. Mapped provided buffer rings can
co-exist with normal provided buffers, just not within the same group ID.

To gauge overhead of the existing scheme and evaluate the mapped ring
approach, a simple NOP benchmark was written. It uses a ring of 128
entries, and submits/completes 32 at the time. 'Replenish' is how
many buffers are provided back at the time after they have been
consumed:

Test			Replenish			NOPs/sec
================================================================
No provided buffers	NA				~30M
Provided buffers	32				~16M
Provided buffers	 1				~10M
Ring buffers		32				~27M
Ring buffers		 1				~27M

The ring mapped buffers perform almost as well as not using provided
buffers at all, and they don't care if you provided 1 or more back at
the same time. This means application can just replenish as they go,
rather than need to batch and compact, further reducing overhead in the
application. The NOP benchmark above doesn't need to do any compaction,
so that overhead isn't even reflected in the above test.
Co-developed-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c7fb1942

io_uring: add io_pin_pages() helper · d8c2237d

Jens Axboe authored Apr 28, 2022

Abstract this out from io_sqe_buffer_register() so we can use it
elsewhere too without duplicating this code.

No intended functional changes in this patch.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d8c2237d

io_uring: add buffer selection support to IORING_OP_NOP · 3d200242

Jens Axboe authored May 05, 2022

Obviously not really useful since it's not transferring data, but it
is helpful in benchmarking overhead of provided buffers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

3d200242

io_uring: fix locking state for empty buffer group · e7637a49

Jens Axboe authored May 15, 2022

io_provided_buffer_select() must drop the submit lock, if needed, even
in the error handling case. Failure to do so will leave us with the
ctx->uring_lock held, causing spew like:

====================================
WARNING: iou-wrk-366/368 still has locks held!
5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted
------------------------------------
1 lock held by iou-wrk-366/368:
 #0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48

stack backtrace:
CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace.part.0+0xa4/0xd4
 show_stack+0x14/0x5c
 dump_stack_lvl+0x88/0xb0
 dump_stack+0x14/0x2c
 debug_check_no_locks_held+0x84/0x90
 try_to_freeze.isra.0+0x18/0x44
 get_signal+0x94/0x6ec
 io_wqe_worker+0x1d8/0x2b4
 ret_from_fork+0x10/0x20

and triggering later hangs off get_signal() because we attempt to
re-grab the lock.

Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com
Fixes: 149c69b0 ("io_uring: abstract out provided buffer list selection")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e7637a49

14 May, 2022 4 commits

io_uring: implement multishot mode for accept · 4e86a2c9

Hao Xu authored May 14, 2022

Refactor io_accept() to support multishot mode.

theoretical analysis:
  1) when connections come in fast
    - singleshot:
              add accept sqe(userspace) --> accept inline
                              ^                 |
                              |-----------------|
    - multishot:
             add accept sqe(userspace) --> accept inline
                                              ^     |
                                              |--*--|

    we do accept repeatedly in * place until get EAGAIN

  2) when connections come in at a low pressure
    similar thing like 1), we reduce a lot of userspace-kernel context
    switch and useless vfs_poll()

tests:
Did some tests, which goes in this way:

  server    client(multiple)
  accept    connect
  read      write
  write     read
  close     close

Basically, raise up a number of clients(on same machine with server) to
connect to the server, and then write some data to it, the server will
write those data back to the client after it receives them, and then
close the connection after write return. Then the client will read the
data and then close the connection. Here I test 10000 clients connect
one server, data size 128 bytes. And each client has a go routine for
it, so they come to the server in short time.
test 20 times before/after this patchset, time spent:(unit cycle, which
is the return value of clock())
before:
  1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
  +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
  +1934226+1914385)/20.0 = 1927633.75
after:
  1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
  +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
  +1871324+1940803)/20.0 = 1894750.45

(1927633.75 - 1894750.45) / 1927633.75 = 1.65%
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-5-haoxu.linux@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

4e86a2c9

io_uring: let fast poll support multishot · dbc2564c

Hao Xu authored May 14, 2022

For operations like accept, multishot is a useful feature, since we can
reduce a number of accept sqe. Let's integrate it to fast poll, it may
be good for other operations in the future.
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-4-haoxu.linux@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

dbc2564c

io_uring: add REQ_F_APOLL_MULTISHOT for requests · 227685eb

Hao Xu authored May 14, 2022

Add a flag to indicate multishot mode for fast poll. currently only
accept use it, but there may be more operations leveraging it in the
future. Also add a mask IO_APOLL_MULTI_POLLED which stands for
REQ_F_APOLL_MULTI | REQ_F_POLLED, to make the code short and cleaner.
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-3-haoxu.linux@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

227685eb

io_uring: add IORING_ACCEPT_MULTISHOT for accept · 390ed29b

Hao Xu authored May 14, 2022

add an accept_flag IORING_ACCEPT_MULTISHOT for accept, which is to
support multishot.
Signed-off-by: Hao Xu <howeyxu@tencent.com>
Link: https://lore.kernel.org/r/20220514142046.58072-2-haoxu.linux@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

390ed29b

13 May, 2022 8 commits

io_uring: only wake when the correct events are set · 1b1d7b4b

Dylan Yudaken authored May 12, 2022

The check for waking up a request compares the poll_t bits, however this
will always contain some common flags so this always wakes up.

For files with single wait queues such as sockets this can cause the
request to be sent to the async worker unnecesarily. Further if it is
non-blocking will complete the request with EAGAIN which is not desired.

Here exclude these common events, making sure to not exclude POLLERR which
might be important.

Fixes: d7718a9d ("io_uring: use poll driven retry for files that support it")
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220512091834.728610-3-dylany@fb.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

1b1d7b4b

io_uring: avoid io-wq -EAGAIN looping for !IOPOLL · e0deb6a0

Pavel Begunkov authored May 13, 2022

If an opcode handler semi-reliably returns -EAGAIN, io_wq_submit_work()
might continue busily hammer the same handler over and over again, which
is not ideal. The -EAGAIN handling in question was put there only for
IOPOLL, so restrict it to IOPOLL mode only where there is no other
recourse than to retry as we cannot wait.

Fixes: def596e9 ("io_uring: support for IO polling")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f168b4f24181942f3614dd8ff648221736f572e6.1652433740.git.asml.silence@gmail.comSigned-off-by: Jens Axboe <axboe@kernel.dk>

e0deb6a0

io_uring: add flag for allocating a fully sparse direct descriptor space · a8da73a3

Jens Axboe authored May 09, 2022

Currently to setup a fully sparse descriptor space upfront, the app needs
to alloate an array of the full size and memset it to -1 and then pass
that in. Make this a bit easier by allowing a flag that simply does
this internally rather than needing to copy each slot separately.

This works with IORING_REGISTER_FILES2 as the flag is set in struct
io_uring_rsrc_register, and is only allow when the type is
IORING_RSRC_FILE as this doesn't make sense for registered buffers.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a8da73a3

io_uring: bump max direct descriptor count to 1M · 09893e15

Jens Axboe authored May 09, 2022

We currently limit these to 32K, but since we're now backing the table
space with vmalloc when needed, there's no reason why we can't make it
bigger. The total space is limited by RLIMIT_NOFILE as well.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

09893e15

io_uring: allow allocated fixed files for accept · c30c3e00

Jens Axboe authored May 07, 2022

If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot,
then that's a hint to allocate a fixed file descriptor rather than have
one be passed in directly.

This can be useful for having io_uring manage the direct descriptor space,
and also allows multi-shot support to work with fixed files.

Normal accept direct requests will complete with 0 for success, and < 0
in case of error. If io_uring is asked to allocated the direct descriptor,
then the direct descriptor is returned in case of success.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c30c3e00

io_uring: allow allocated fixed files for openat/openat2 · 1339f24b

Jens Axboe authored May 07, 2022

If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot,
then that's a hint to allocate a fixed file descriptor rather than have
one be passed in directly.

This can be useful for having io_uring manage the direct descriptor space.

Normal open direct requests will complete with 0 for success, and < 0
in case of error. If io_uring is asked to allocated the direct descriptor,
then the direct descriptor is returned in case of success.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1339f24b

io_uring: add basic fixed file allocator · b70b8e33

Jens Axboe authored May 07, 2022

Applications currently always pick where they want fixed files to go.
In preparation for allowing these types of commands with multishot
support, add a basic allocator in the fixed file table.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b70b8e33

io_uring: track fixed files with a bitmap · d78bd8ad

Jens Axboe authored May 07, 2022

In preparation for adding a basic allocator for direct descriptors,
add helpers that set/clear whether a file slot is used.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

d78bd8ad

09 May, 2022 11 commits

io_uring: don't clear req->kbuf when buffer selection is done · 7ccba24d

Jens Axboe authored May 01, 2022

It's not needed as the REQ_F_BUFFER_SELECTED flag tracks the state of
whether or not kbuf is valid, so just drop it.
Suggested-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

7ccba24d

io_uring: eliminate the need to track provided buffer ID separately · 1dbd023e

Jens Axboe authored May 01, 2022

We have io_kiocb->buf_index which is used for either fixed buffers, or
for provided buffers. For the latter, it's used to hold the buffer group
ID for buffer selection. Post selection, req->kbuf->bid is used to get
the buffer ID.

Store the buffer ID, when selected, in req->buf_index. If we do end up
recycling the buffer, reset it back to the buffer group ID.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

1dbd023e

io_uring: move provided buffer state closer to submit state · 660cbfa2

Jens Axboe authored May 01, 2022

The timeout and other items that follow are less hot, so let's move the
provided buffer state above that.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

660cbfa2

io_uring: move provided and fixed buffers into the same io_kiocb area · a4f8d94c

Jens Axboe authored Apr 30, 2022

These are mutually exclusive - if you use provided buffers, then you
cannot use fixed buffers and vice versa. Move them into the same spot
in the io_kiocb, which is also advantageous for provided buffers as
they get near the submit side hot cacheline.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

a4f8d94c

io_uring: abstract out provided buffer list selection · 149c69b0

Jens Axboe authored Apr 30, 2022

In preparation for providing another way to select a buffer, move the
existing logic into a helper.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

149c69b0

io_uring: never call io_buffer_select() for a buffer re-select · b66e65f4

Jens Axboe authored Apr 30, 2022

Callers already have room to store the addr and length information,
clean it up by having the caller just assign the previously provided
data.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

b66e65f4

io_uring: get rid of hashed provided buffer groups · 9cfc7e94

Jens Axboe authored May 01, 2022

Use a plain array for any group ID that's less than 64, and punt
anything beyond that to an xarray. 64 fits in a page even for 4KB
page sizes and with the planned additions.

This makes the expected group usage faster by avoiding a hash and lookup
to find our list, and it uses less memory upfront by not allocating any
memory for provided buffers unless it's actually being used.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9cfc7e94

io_uring: always use req->buf_index for the provided buffer group · 4e906702

Jens Axboe authored Apr 28, 2022

The read/write opcodes use it already, but the recv/recvmsg do not. If
we switch them over and read and validate this at init time while we're
checking if the opcode supports it anyway, then we can do it in one spot
and we don't have to pass in a separate group ID for io_buffer_select().
Signed-off-by: Jens Axboe <axboe@kernel.dk>

4e906702

io_uring: ignore ->buf_index if REQ_F_BUFFER_SELECT isn't set · bb68d504

Jens Axboe authored Apr 29, 2022

There's no point in validity checking buf_index if the request doesn't
have REQ_F_BUFFER_SELECT set, as we will never use it for that case.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

bb68d504

io_uring: kill io_rw_buffer_select() wrapper · e5b00349

Jens Axboe authored Apr 28, 2022

After the recent changes, this is direct call to io_buffer_select()
anyway. With this change, there are no wrappers left for provided
buffer selection.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

e5b00349

io_uring: make io_buffer_select() return the user address directly · c54d52c2

Jens Axboe authored Apr 28, 2022

There's no point in having callers provide a kbuf, we're just returning
the address anyway.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

c54d52c2

05 May, 2022 3 commits

io_uring: kill io_recv_buffer_select() wrapper · 9396ed85

Jens Axboe authored Apr 28, 2022

It's just a thin wrapper around io_buffer_select(), get rid of it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

9396ed85

io_uring: use 'sr' vs 'req->sr_msg' consistently · 0a352aaa

Jens Axboe authored Apr 30, 2022

For all of send/sendmsg and recv/recvmsg we have the local 'sr' variable,
yet some cases still use req->sr_msg which sr points to. Use 'sr'
consistently.
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0a352aaa

io_uring: add POLL_FIRST support for send/sendmsg and recv/recvmsg · 0455d4cc

Jens Axboe authored Apr 26, 2022

If IORING_RECVSEND_POLL_FIRST is set for recv/recvmsg or send/sendmsg,
then we arm poll first rather than attempt a receive or send upfront.
This can be useful if we expect there to be no data (or space) available
for the request, as we can then avoid wasting time on the initial
issue attempt.
Reviewed-by: Hao Xu <howeyxu@tencent.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

0455d4cc