- 19 May, 2022 1 commit
-
-
Vasily Averin authored
Currently 'make C=1 fs/io_uring.o' generates sparse warning: CHECK fs/io_uring.c fs/io_uring.c: note: in included file (through include/trace/trace_events.h, include/trace/define_trace.h, i nclude/trace/events/io_uring.h): ./include/trace/events/io_uring.h:488:1: warning: incorrect type in assignment (different base types) expected unsigned int [usertype] op_flags got restricted __kernel_rwf_t const [usertype] rw_flags This happen on cast of sqe->rw_flags which is defined as __kernel_rwf_t, this type is bitwise and requires __force attribute for any casts. However rw_flags is a member of the union, and its access can be safely replaced by using of its neighbours, so let's use poll32_events to fix the sparse warning. Signed-off-by:
Vasily Averin <vvs@openvz.org> Link: https://lore.kernel.org/r/6f009241-a63f-ae43-a04b-62841aaef293@openvz.orgSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 18 May, 2022 13 commits
-
-
Jens Axboe authored
It's nonsensical to register a provided buffer ring, if a classic provided buffer group with the same ID exists. Depending on the order of which we decide what type to pick, the other type will never get used. Explicitly disallow it and return an error if this is attempted. Fixes: c7fb1942 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We use ->buf_pages != 0 to tell if this is a shared buffer ring or a classic provided buffer group. If we unregister the shared ring and then attempt to use it, buf_pages is zero yet the classic list head isn't properly initialized. This causes io_buffer_select() to think that we have classic buffers available, but then we crash when we try and get one from the list. Just initialize the list if we unregister a shared buffer ring, leaving it in a sane state for either re-registration or for attempting to use it. And do the same for the initial setup from the classic path. Fixes: c7fb1942 ("io_uring: add support for ring mapped supplied buffers") Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
Honour IORING_RSRC_REGISTER_SPARSE not only for direct files but fixed buffers as well. It makes the rsrc API more consistent. Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/66f429e4912fe39fb3318217ff33a2853d4544be.1652879898.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Accessing the file table needs a rcu_dereference_protected(). Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-7-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
POLL* are unannotated values for the userspace ABI, while everything in-kernel should use EPOLL* and the __poll_t type. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-6-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
apoll_events is fed to vfs_poll and the poll tables, so it should be a __poll_t. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-5-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
io_file_get_normal isn't marked inline, so don't claim it as such in the forward declaration. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-4-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
ERR_PTR abuses the high bits of a pointer to transport error information. This is only safe for kernel pointers and not user pointers. Fix io_buffer_select and its helpers to just return NULL for failure and get rid of this abuse. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-3-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Christoph Hellwig authored
Use the proper type. Signed-off-by:
Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220518084005.3255380-2-hch@lst.deSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Provided buffers allow an application to supply io_uring with buffers that can then be grabbed for a read/receive request, when the data source is ready to deliver data. The existing scheme relies on using IORING_OP_PROVIDE_BUFFERS to do that, but it can be difficult to use in real world applications. It's pretty efficient if the application is able to supply back batches of provided buffers when they have been consumed and the application is ready to recycle them, but if fragmentation occurs in the buffer space, it can become difficult to supply enough buffers at the time. This hurts efficiency. Add a register op, IORING_REGISTER_PBUF_RING, which allows an application to setup a shared queue for each buffer group of provided buffers. The application can then supply buffers simply by adding them to this ring, and the kernel can consume then just as easily. The ring shares the head with the application, the tail remains private in the kernel. Provided buffers setup with IORING_REGISTER_PBUF_RING cannot use IORING_OP_{PROVIDE,REMOVE}_BUFFERS for adding or removing entries to the ring, they must use the mapped ring. Mapped provided buffer rings can co-exist with normal provided buffers, just not within the same group ID. To gauge overhead of the existing scheme and evaluate the mapped ring approach, a simple NOP benchmark was written. It uses a ring of 128 entries, and submits/completes 32 at the time. 'Replenish' is how many buffers are provided back at the time after they have been consumed: Test Replenish NOPs/sec ================================================================ No provided buffers NA ~30M Provided buffers 32 ~16M Provided buffers 1 ~10M Ring buffers 32 ~27M Ring buffers 1 ~27M The ring mapped buffers perform almost as well as not using provided buffers at all, and they don't care if you provided 1 or more back at the same time. This means application can just replenish as they go, rather than need to batch and compact, further reducing overhead in the application. The NOP benchmark above doesn't need to do any compaction, so that overhead isn't even reflected in the above test. Co-developed-by:
Dylan Yudaken <dylany@fb.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Abstract this out from io_sqe_buffer_register() so we can use it elsewhere too without duplicating this code. No intended functional changes in this patch. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Obviously not really useful since it's not transferring data, but it is helpful in benchmarking overhead of provided buffers. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
io_provided_buffer_select() must drop the submit lock, if needed, even in the error handling case. Failure to do so will leave us with the ctx->uring_lock held, causing spew like: ==================================== WARNING: iou-wrk-366/368 still has locks held! 5.18.0-rc6-00294-gdf8dc7004331 #994 Not tainted ------------------------------------ 1 lock held by iou-wrk-366/368: #0: ffff0000c72598a8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_ring_submit_lock+0x20/0x48 stack backtrace: CPU: 4 PID: 368 Comm: iou-wrk-366 Not tainted 5.18.0-rc6-00294-gdf8dc7004331 #994 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace.part.0+0xa4/0xd4 show_stack+0x14/0x5c dump_stack_lvl+0x88/0xb0 dump_stack+0x14/0x2c debug_check_no_locks_held+0x84/0x90 try_to_freeze.isra.0+0x18/0x44 get_signal+0x94/0x6ec io_wqe_worker+0x1d8/0x2b4 ret_from_fork+0x10/0x20 and triggering later hangs off get_signal() because we attempt to re-grab the lock. Reported-by: syzbot+987d7bb19195ae45208c@syzkaller.appspotmail.com Fixes: 149c69b0 ("io_uring: abstract out provided buffer list selection") Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 14 May, 2022 4 commits
-
-
Hao Xu authored
Refactor io_accept() to support multishot mode. theoretical analysis: 1) when connections come in fast - singleshot: add accept sqe(userspace) --> accept inline ^ | |-----------------| - multishot: add accept sqe(userspace) --> accept inline ^ | |--*--| we do accept repeatedly in * place until get EAGAIN 2) when connections come in at a low pressure similar thing like 1), we reduce a lot of userspace-kernel context switch and useless vfs_poll() tests: Did some tests, which goes in this way: server client(multiple) accept connect read write write read close close Basically, raise up a number of clients(on same machine with server) to connect to the server, and then write some data to it, the server will write those data back to the client after it receives them, and then close the connection after write return. Then the client will read the data and then close the connection. Here I test 10000 clients connect one server, data size 128 bytes. And each client has a go routine for it, so they come to the server in short time. test 20 times before/after this patchset, time spent:(unit cycle, which is the return value of clock()) before: 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 +1934226+1914385)/20.0 = 1927633.75 after: 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 +1871324+1940803)/20.0 = 1894750.45 (1927633.75 - 1894750.45) / 1927633.75 = 1.65% Signed-off-by:
Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-5-haoxu.linux@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hao Xu authored
For operations like accept, multishot is a useful feature, since we can reduce a number of accept sqe. Let's integrate it to fast poll, it may be good for other operations in the future. Signed-off-by:
Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-4-haoxu.linux@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hao Xu authored
Add a flag to indicate multishot mode for fast poll. currently only accept use it, but there may be more operations leveraging it in the future. Also add a mask IO_APOLL_MULTI_POLLED which stands for REQ_F_APOLL_MULTI | REQ_F_POLLED, to make the code short and cleaner. Signed-off-by:
Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-3-haoxu.linux@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Hao Xu authored
add an accept_flag IORING_ACCEPT_MULTISHOT for accept, which is to support multishot. Signed-off-by:
Hao Xu <howeyxu@tencent.com> Link: https://lore.kernel.org/r/20220514142046.58072-2-haoxu.linux@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 13 May, 2022 8 commits
-
-
Dylan Yudaken authored
The check for waking up a request compares the poll_t bits, however this will always contain some common flags so this always wakes up. For files with single wait queues such as sockets this can cause the request to be sent to the async worker unnecesarily. Further if it is non-blocking will complete the request with EAGAIN which is not desired. Here exclude these common events, making sure to not exclude POLLERR which might be important. Fixes: d7718a9d ("io_uring: use poll driven retry for files that support it") Signed-off-by:
Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/20220512091834.728610-3-dylany@fb.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Pavel Begunkov authored
If an opcode handler semi-reliably returns -EAGAIN, io_wq_submit_work() might continue busily hammer the same handler over and over again, which is not ideal. The -EAGAIN handling in question was put there only for IOPOLL, so restrict it to IOPOLL mode only where there is no other recourse than to retry as we cannot wait. Fixes: def596e9 ("io_uring: support for IO polling") Signed-off-by:
Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/f168b4f24181942f3614dd8ff648221736f572e6.1652433740.git.asml.silence@gmail.comSigned-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Currently to setup a fully sparse descriptor space upfront, the app needs to alloate an array of the full size and memset it to -1 and then pass that in. Make this a bit easier by allowing a flag that simply does this internally rather than needing to copy each slot separately. This works with IORING_REGISTER_FILES2 as the flag is set in struct io_uring_rsrc_register, and is only allow when the type is IORING_RSRC_FILE as this doesn't make sense for registered buffers. Reviewed-by:
Hao Xu <howeyxu@tencent.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We currently limit these to 32K, but since we're now backing the table space with vmalloc when needed, there's no reason why we can't make it bigger. The total space is limited by RLIMIT_NOFILE as well. Reviewed-by:
Hao Xu <howeyxu@tencent.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot, then that's a hint to allocate a fixed file descriptor rather than have one be passed in directly. This can be useful for having io_uring manage the direct descriptor space, and also allows multi-shot support to work with fixed files. Normal accept direct requests will complete with 0 for success, and < 0 in case of error. If io_uring is asked to allocated the direct descriptor, then the direct descriptor is returned in case of success. Reviewed-by:
Hao Xu <howeyxu@tencent.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If the application passes in IORING_FILE_INDEX_ALLOC as the file_slot, then that's a hint to allocate a fixed file descriptor rather than have one be passed in directly. This can be useful for having io_uring manage the direct descriptor space. Normal open direct requests will complete with 0 for success, and < 0 in case of error. If io_uring is asked to allocated the direct descriptor, then the direct descriptor is returned in case of success. Reviewed-by:
Hao Xu <howeyxu@tencent.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Applications currently always pick where they want fixed files to go. In preparation for allowing these types of commands with multishot support, add a basic allocator in the fixed file table. Reviewed-by:
Hao Xu <howeyxu@tencent.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
In preparation for adding a basic allocator for direct descriptors, add helpers that set/clear whether a file slot is used. Reviewed-by:
Hao Xu <howeyxu@tencent.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 09 May, 2022 11 commits
-
-
Jens Axboe authored
It's not needed as the REQ_F_BUFFER_SELECTED flag tracks the state of whether or not kbuf is valid, so just drop it. Suggested-by:
Dylan Yudaken <dylany@fb.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
We have io_kiocb->buf_index which is used for either fixed buffers, or for provided buffers. For the latter, it's used to hold the buffer group ID for buffer selection. Post selection, req->kbuf->bid is used to get the buffer ID. Store the buffer ID, when selected, in req->buf_index. If we do end up recycling the buffer, reset it back to the buffer group ID. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
The timeout and other items that follow are less hot, so let's move the provided buffer state above that. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
These are mutually exclusive - if you use provided buffers, then you cannot use fixed buffers and vice versa. Move them into the same spot in the io_kiocb, which is also advantageous for provided buffers as they get near the submit side hot cacheline. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
In preparation for providing another way to select a buffer, move the existing logic into a helper. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Callers already have room to store the addr and length information, clean it up by having the caller just assign the previously provided data. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
Use a plain array for any group ID that's less than 64, and punt anything beyond that to an xarray. 64 fits in a page even for 4KB page sizes and with the planned additions. This makes the expected group usage faster by avoiding a hash and lookup to find our list, and it uses less memory upfront by not allocating any memory for provided buffers unless it's actually being used. Suggested-by:
Pavel Begunkov <asml.silence@gmail.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
The read/write opcodes use it already, but the recv/recvmsg do not. If we switch them over and read and validate this at init time while we're checking if the opcode supports it anyway, then we can do it in one spot and we don't have to pass in a separate group ID for io_buffer_select(). Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
There's no point in validity checking buf_index if the request doesn't have REQ_F_BUFFER_SELECT set, as we will never use it for that case. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
After the recent changes, this is direct call to io_buffer_select() anyway. With this change, there are no wrappers left for provided buffer selection. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
There's no point in having callers provide a kbuf, we're just returning the address anyway. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
- 05 May, 2022 3 commits
-
-
Jens Axboe authored
It's just a thin wrapper around io_buffer_select(), get rid of it. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
For all of send/sendmsg and recv/recvmsg we have the local 'sr' variable, yet some cases still use req->sr_msg which sr points to. Use 'sr' consistently. Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-
Jens Axboe authored
If IORING_RECVSEND_POLL_FIRST is set for recv/recvmsg or send/sendmsg, then we arm poll first rather than attempt a receive or send upfront. This can be useful if we expect there to be no data (or space) available for the request, as we can then avoid wasting time on the initial issue attempt. Reviewed-by:
Hao Xu <howeyxu@tencent.com> Signed-off-by:
Jens Axboe <axboe@kernel.dk>
-