1. 10 May, 2024 2 commits
  2. 09 May, 2024 2 commits
    • Jens Axboe's avatar
      io_uring/net: add IORING_ACCEPT_POLL_FIRST flag · d3da8e98
      Jens Axboe authored
      Similarly to how polling first is supported for receive, it makes sense
      to provide the same for accept. An accept operation does a lot of
      expensive setup, like allocating an fd, a socket/inode, etc. If no
      connection request is already pending, this is wasted and will just be
      cleaned up and freed, only to retry via the usual poll trigger.
      
      Add IORING_ACCEPT_POLL_FIRST, which tells accept to only initiate the
      accept request if poll says we have something to accept.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d3da8e98
    • Jens Axboe's avatar
      io_uring/net: add IORING_ACCEPT_DONTWAIT flag · 7dcc758c
      Jens Axboe authored
      This allows the caller to perform a non-blocking attempt, similarly to
      how recvmsg has MSG_DONTWAIT. If set, and we get -EAGAIN on a connection
      attempt, propagate the result to userspace rather than arm poll and
      wait for a retry.
      Suggested-by: default avatarNorman Maurer <norman_maurer@apple.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7dcc758c
  3. 08 May, 2024 1 commit
  4. 07 May, 2024 1 commit
    • Breno Leitao's avatar
      io_uring/io-wq: Use set_bit() and test_bit() at worker->flags · 8a565304
      Breno Leitao authored
      Utilize set_bit() and test_bit() on worker->flags within io_uring/io-wq
      to address potential data races.
      
      The structure io_worker->flags may be accessed through various data
      paths, leading to concurrency issues. When KCSAN is enabled, it reveals
      data races occurring in io_worker_handle_work and
      io_wq_activate_free_worker functions.
      
      	 BUG: KCSAN: data-race in io_worker_handle_work / io_wq_activate_free_worker
      	 write to 0xffff8885c4246404 of 4 bytes by task 49071 on cpu 28:
      	 io_worker_handle_work (io_uring/io-wq.c:434 io_uring/io-wq.c:569)
      	 io_wq_worker (io_uring/io-wq.c:?)
      <snip>
      
      	 read to 0xffff8885c4246404 of 4 bytes by task 49024 on cpu 5:
      	 io_wq_activate_free_worker (io_uring/io-wq.c:? io_uring/io-wq.c:285)
      	 io_wq_enqueue (io_uring/io-wq.c:947)
      	 io_queue_iowq (io_uring/io_uring.c:524)
      	 io_req_task_submit (io_uring/io_uring.c:1511)
      	 io_handle_tw_list (io_uring/io_uring.c:1198)
      <snip>
      
      Line numbers against commit 18daea77 ("Merge tag 'for-linus' of
      git://git.kernel.org/pub/scm/virt/kvm/kvm").
      
      These races involve writes and reads to the same memory location by
      different tasks running on different CPUs. To mitigate this, refactor
      the code to use atomic operations such as set_bit(), test_bit(), and
      clear_bit() instead of basic "and" and "or" operations. This ensures
      thread-safe manipulation of worker flags.
      
      Also, move `create_index` to avoid holes in the structure.
      Signed-off-by: default avatarBreno Leitao <leitao@debian.org>
      Link: https://lore.kernel.org/r/20240507170002.2269003-1-leitao@debian.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a565304
  5. 01 May, 2024 2 commits
    • Jens Axboe's avatar
      io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring · 59b28a6e
      Jens Axboe authored
      Move the posting outside the checking and locking, it's cleaner that
      way.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      59b28a6e
    • Gabriel Krisman Bertazi's avatar
      io_uring: Require zeroed sqe->len on provided-buffers send · 79996b45
      Gabriel Krisman Bertazi authored
      When sending from a provided buffer, we set sr->len to be the smallest
      between the actual buffer size and sqe->len.  But, now that we
      disconnect the buffer from the submission request, we can get in a
      situation where the buffers and requests mismatch, and only part of a
      buffer gets sent.  Assume:
      
      * buf[1]->len = 128; buf[2]->len = 256
      * sqe[1]->len = 128; sqe[2]->len = 256
      
      If sqe1 runs first, it picks buff[1] and it's all good. But, if sqe[2]
      runs first, sqe[1] picks buff[2], and the last half of buff[2] is
      never sent.
      
      While arguably the use-case of different-length sends is questionable,
      it has already raised confusion with potential users of this
      feature. Let's make the interface less tricky by forcing the length to
      only come from the buffer ring entry itself.
      
      Fixes: ac5f71a3 ("io_uring/net: add provided buffer support for IORING_OP_SEND")
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      79996b45
  6. 30 Apr, 2024 2 commits
  7. 26 Apr, 2024 1 commit
  8. 25 Apr, 2024 1 commit
    • Jens Axboe's avatar
      io_uring/rw: reinstate thread check for retries · 039a2e80
      Jens Axboe authored
      Allowing retries for everything is arguably the right thing to do, now
      that every command type is async read from the start. But it's exposed a
      few issues around missing check for a retry (which cca65713 exposed),
      and the fixup commit for that isn't necessarily 100% sound in terms of
      iov_iter state.
      
      For now, just revert these two commits. This unfortunately then re-opens
      the fact that -EAGAIN can get bubbled to userspace for some cases where
      the kernel very well could just sanely retry them. But until we have all
      the conditions covered around that, we cannot safely enable that.
      
      This reverts commit df604d2a.
      This reverts commit cca65713.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      039a2e80
  9. 23 Apr, 2024 3 commits
  10. 22 Apr, 2024 7 commits
    • Pavel Begunkov's avatar
      net: add callback for setting a ubuf_info to skb · 65bada80
      Pavel Begunkov authored
      At the moment an skb can only have one ubuf_info associated with it,
      which might be a performance problem for zerocopy sends in cases like
      TCP via io_uring. Add a callback for assigning ubuf_info to skb, this
      way we will implement smarter assignment later like linking ubuf_info
      together.
      
      Note, it's an optional callback, which should be compatible with
      skb_zcopy_set(), that's because the net stack might potentially decide
      to clone an skb and take another reference to ubuf_info whenever it
      wishes. Also, a correct implementation should always be able to bind to
      an skb without prior ubuf_info, otherwise we could end up in a situation
      when the send would not be able to progress.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/all/b7918aadffeb787c84c9e72e34c729dc04f3a45d.1713369317.git.asml.silence@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      65bada80
    • Pavel Begunkov's avatar
      net: extend ubuf_info callback to ops structure · 7ab4f16f
      Pavel Begunkov authored
      We'll need to associate additional callbacks with ubuf_info, introduce
      a structure holding ubuf_info callbacks. Apart from a more smarter
      io_uring notification management introduced in next patches, it can be
      used to generalise msg_zerocopy_put_abort() and also store
      ->sg_from_iter, which is currently passed in struct msghdr.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/all/a62015541de49c0e2a8a0377a1d5d0a5aeb07016.1713369317.git.asml.silence@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7ab4f16f
    • Jens Axboe's avatar
      io_uring/net: support bundles for recv · 2f9c9515
      Jens Axboe authored
      If IORING_OP_RECV is used with provided buffers, the caller may also set
      IORING_RECVSEND_BUNDLE to turn it into a multi-buffer recv. This grabs
      buffers available and receives into them, posting a single completion for
      all of it.
      
      This can be used with multishot receive as well, or without it.
      
      Now that both send and receive support bundles, add a feature flag for
      it as well. If IORING_FEAT_RECVSEND_BUNDLE is set after registering the
      ring, then the kernel supports bundles for recv and send.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2f9c9515
    • Jens Axboe's avatar
      io_uring/net: support bundles for send · a05d1f62
      Jens Axboe authored
      If IORING_OP_SEND is used with provided buffers, the caller may also
      set IORING_RECVSEND_BUNDLE to turn it into a multi-buffer send. The idea
      is that an application can fill outgoing buffers in a provided buffer
      group, and then arm a single send that will service them all. Once
      there are no more buffers to send, or if the requested length has
      been sent, the request posts a single completion for all the buffers.
      
      This only enables it for IORING_OP_SEND, IORING_OP_SENDMSG is coming
      in a separate patch. However, this patch does do a lot of the prep
      work that makes wiring up the sendmsg variant pretty trivial. They
      share the prep side.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a05d1f62
    • Jens Axboe's avatar
      io_uring/kbuf: add helpers for getting/peeking multiple buffers · 35c8711c
      Jens Axboe authored
      Our provided buffer interface only allows selection of a single buffer.
      Add an API that allows getting/peeking multiple buffers at the same time.
      
      This is only implemented for the ring provided buffers. It could be added
      for the legacy provided buffers as well, but since it's strongly
      encouraged to use the new interface, let's keep it simpler and just
      provide it for the new API. The legacy interface will always just select
      a single buffer.
      
      There are two new main functions:
      
      io_buffers_select(), which selects up as many buffers as it can. The
      caller supplies the iovec array, and io_buffers_select() may allocate a
      bigger array if the 'out_len' being passed in is non-zero and bigger
      than what fits in the provided iovec. Buffers grabbed with this helper
      are permanently assigned.
      
      io_buffers_peek(), which works like io_buffers_select(), except they can
      be recycled, if needed. Callers using either of these functions should
      call io_put_kbufs() rather than io_put_kbuf() at completion time. The
      peek interface must be called with the ctx locked from peek to
      completion.
      
      This add a bit state for the request:
      
      - REQ_F_BUFFERS_COMMIT, which means that the the buffers have been
        peeked and should be committed to the buffer ring head when they are
        put as part of completion. Prior to this, req->buf_list was cleared to
        NULL when committed.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      35c8711c
    • Jens Axboe's avatar
      io_uring/net: add provided buffer support for IORING_OP_SEND · ac5f71a3
      Jens Axboe authored
      It's pretty trivial to wire up provided buffer support for the send
      side, just like how it's done the receive side. This enables setting up
      a buffer ring that an application can use to push pending sends to,
      and then have a send pick a buffer from that ring.
      
      One of the challenges with async IO and networking sends is that you
      can get into reordering conditions if you have more than one inflight
      at the same time. Consider the following scenario where everything is
      fine:
      
      1) App queues sendA for socket1
      2) App queues sendB for socket1
      3) App does io_uring_submit()
      4) sendA is issued, completes successfully, posts CQE
      5) sendB is issued, completes successfully, posts CQE
      
      All is fine. Requests are always issued in-order, and both complete
      inline as most sends do.
      
      However, if we're flooding socket1 with sends, the following could
      also result from the same sequence:
      
      1) App queues sendA for socket1
      2) App queues sendB for socket1
      3) App does io_uring_submit()
      4) sendA is issued, socket1 is full, poll is armed for retry
      5) Space frees up in socket1, this triggers sendA retry via task_work
      6) sendB is issued, completes successfully, posts CQE
      7) sendA is retried, completes successfully, posts CQE
      
      Now we've sent sendB before sendA, which can make things unhappy. If
      both sendA and sendB had been using provided buffers, then it would look
      as follows instead:
      
      1) App queues dataA for sendA, queues sendA for socket1
      2) App queues dataB for sendB queues sendB for socket1
      3) App does io_uring_submit()
      4) sendA is issued, socket1 is full, poll is armed for retry
      5) Space frees up in socket1, this triggers sendA retry via task_work
      6) sendB is issued, picks first buffer (dataA), completes successfully,
         posts CQE (which says "I sent dataA")
      7) sendA is retried, picks first buffer (dataB), completes successfully,
         posts CQE (which says "I sent dataB")
      
      Now we've sent the data in order, and everybody is happy.
      
      It's worth noting that this also opens the door for supporting multishot
      sends, as provided buffers would be a prerequisite for that. Those can
      trigger either when new buffers are added to the outgoing ring, or (if
      stalled due to lack of space) when space frees up in the socket.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ac5f71a3
    • Jens Axboe's avatar
      io_uring/net: add generic multishot retry helper · 3e747ded
      Jens Axboe authored
      This is just moving io_recv_prep_retry() higher up so it can get used
      for sends as well, and rename it to be generically useful for both
      sends and receives.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3e747ded
  11. 17 Apr, 2024 3 commits
  12. 15 Apr, 2024 15 commits