1. 24 Mar, 2022 2 commits
    • Dylan Yudaken's avatar
      io_uring: fix async accept on O_NONBLOCK sockets · a73825ba
      Dylan Yudaken authored
      Do not set REQ_F_NOWAIT if the socket is non blocking. When enabled this
      causes the accept to immediately post a CQE with EAGAIN, which means you
      cannot perform an accept SQE on a NONBLOCK socket asynchronously.
      
      By removing the flag if there is no pending accept then poll is armed as
      usual and when a connection comes in the CQE is posted.
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Link: https://lore.kernel.org/r/20220324143435.2875844-1-dylany@fb.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a73825ba
    • Jens Axboe's avatar
      io_uring: remove IORING_CQE_F_MSG · 7ef66d18
      Jens Axboe authored
      This was introduced with the message ring opcode, but isn't strictly
      required for the request itself. The sender can encode what is needed
      in user_data, which is passed to the receiver. It's unclear if having
      a separate flag that essentially says "This CQE did not originate from
      an SQE on this ring" provides any real utility to applications. While
      we can always re-introduce a flag to provide this information, we cannot
      take it away at a later point in time.
      
      Remove the flag while we still can, before it's in a released kernel.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7ef66d18
  2. 23 Mar, 2022 5 commits
    • Jens Axboe's avatar
      io_uring: add flag for disabling provided buffer recycling · 8a3e8ee5
      Jens Axboe authored
      If we need to continue doing this IO, then we don't want a potentially
      selected buffer recycled. Add a flag for that.
      
      Set this for recv/recvmsg if they do partial IO.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a3e8ee5
    • Jens Axboe's avatar
      io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly · 7ba89d2a
      Jens Axboe authored
      We currently don't attempt to get the full asked for length even if
      MSG_WAITALL is set, if we get a partial receive. If we do see a partial
      receive, then just note how many bytes we did and return -EAGAIN to
      get it retried.
      
      The iov is advanced appropriately for the vector based case, and we
      manually bump the buffer and remainder for the non-vector case.
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarConstantine Gavrilov <constantine.gavrilov@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7ba89d2a
    • Jens Axboe's avatar
      io_uring: don't recycle provided buffer if punted to async worker · 4d55f238
      Jens Axboe authored
      We only really need to recycle the buffer when going async for a file
      type that has an indefinite reponse time (eg non-file/bdev). And for
      files that to arm poll, the async worker will arm poll anyway and the
      buffer will get recycled there.
      
      In that latter case, we're not holding ctx->uring_lock. Ensure we take
      the issue_flags into account and acquire it if we need to.
      
      Fixes: b1c62645 ("io_uring: recycle provided buffers if request goes async")
      Reported-by: default avatarStefan Roesch <shr@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4d55f238
    • Jens Axboe's avatar
      io_uring: fix assuming triggered poll waitqueue is the single poll · d89a4fac
      Jens Axboe authored
      syzbot reports a recent regression:
      
      BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650 kernel/sched/wait.c:101
      Read of size 8 at addr ffff888011e8a130 by task syz-executor413/3618
      
      CPU: 0 PID: 3618 Comm: syz-executor413 Tainted: G        W         5.17.0-syzkaller-01402-g8565d644 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x303 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       __wake_up_common+0x637/0x650 kernel/sched/wait.c:101
       __wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:138
       tty_release+0x657/0x1200 drivers/tty/tty_io.c:1781
       __fput+0x286/0x9f0 fs/file_table.c:317
       task_work_run+0xdd/0x1a0 kernel/task_work.c:164
       exit_task_work include/linux/task_work.h:32 [inline]
       do_exit+0xaff/0x29d0 kernel/exit.c:806
       do_group_exit+0xd2/0x2f0 kernel/exit.c:936
       __do_sys_exit_group kernel/exit.c:947 [inline]
       __se_sys_exit_group kernel/exit.c:945 [inline]
       __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:945
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f439a1fac69
      
      which is due to leaving the request on the waitqueue mistakenly. The
      reproducer is using a tty device, which means we end up arming the same
      poll queue twice (it uses the same poll waitqueue for both), but in
      io_poll_wake() we always just clear REQ_F_SINGLE_POLL regardless of which
      entry triggered. This leaves one waitqueue potentially armed after we're
      done, which then blows up in tty when the waitqueue is attempted removed.
      
      We have no room to store this information, so simply encode it in the
      wait_queue_entry->private where we store the io_kiocb request pointer.
      
      Fixes: 91eac1c6 ("io_uring: cache poll/double-poll state with a request flag")
      Reported-by: syzbot+09ad4050dd3a120bfccd@syzkaller.appspotmail.com
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d89a4fac
    • Jens Axboe's avatar
      io_uring: bump poll refs to full 31-bits · e2c0cb7c
      Jens Axboe authored
      The previous commit:
      
      1bc84c40088 ("io_uring: remove poll entry from list when canceling all")
      
      removed a potential overflow condition for the poll references. They
      are currently limited to 20-bits, even if we have 31-bits available. The
      upper bit is used to mark for cancelation.
      
      Bump the poll ref space to 31-bits, making that kind of situation much
      harder to trigger in general. We'll separately add overflow checking
      and handling.
      
      Fixes: aa43477b ("io_uring: poll rework")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e2c0cb7c
  3. 22 Mar, 2022 1 commit
    • Jens Axboe's avatar
      io_uring: remove poll entry from list when canceling all · 61bc84c4
      Jens Axboe authored
      When the ring is exiting, as part of the shutdown, poll requests are
      removed. But io_poll_remove_all() does not remove entries when finding
      them, and since completions are done out-of-band, we can find and remove
      the same entry multiple times.
      
      We do guard the poll execution by poll ownership, but that does not
      exclude us from reissuing a new one once the previous removal ownership
      goes away.
      
      This can race with poll execution as well, where we then end up seeing
      req->apoll be NULL because a previous task_work requeue finished the
      request.
      
      Remove the poll entry when we find it and get ownership of it. This
      prevents multiple invocations from finding it.
      
      Fixes: aa43477b ("io_uring: poll rework")
      Reported-by: default avatarDylan Yudaken <dylany@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61bc84c4
  4. 21 Mar, 2022 1 commit
  5. 20 Mar, 2022 2 commits
    • Jens Axboe's avatar
      io_uring: ensure that fsnotify is always called · f63cf519
      Jens Axboe authored
      Ensure that we call fsnotify_modify() if we write a file, and that we
      do fsnotify_access() if we read it. This enables anyone using inotify
      on the file to get notified.
      
      Ditto for fallocate, ensure that fsnotify_modify() is called.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f63cf519
    • Jens Axboe's avatar
      io_uring: recycle provided before arming poll · abdad709
      Jens Axboe authored
      We currently have a race where we recycle the selected buffer if poll
      returns IO_APOLL_OK. But that's too late, as the poll could already be
      triggering or have triggered. If that race happens, then we're putting a
      buffer that's already being used.
      
      Fix this by recycling before we arm poll. This does mean that we'll
      sometimes almost instantly re-select the buffer, but it's rare enough in
      testing that it should not pose a performance issue.
      
      Fixes: b1c62645 ("io_uring: recycle provided buffers if request goes async")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      abdad709
  6. 18 Mar, 2022 2 commits
  7. 17 Mar, 2022 9 commits
  8. 16 Mar, 2022 4 commits
    • Jens Axboe's avatar
      io_uring: cache poll/double-poll state with a request flag · 91eac1c6
      Jens Axboe authored
      With commit "io_uring: cache req->apoll->events in req->cflags" applied,
      we now have just io_poll_remove_entries() dipping into req->apoll when
      it isn't strictly necessary.
      
      Mark poll and double-poll with a flag, so we know if we need to look
      at apoll->double_poll. This avoids pulling in those cachelines if we
      don't need them. The common case is that the poll wake handler already
      removed these entries while hot off the completion path.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      91eac1c6
    • Jens Axboe's avatar
      io_uring: cache req->apoll->events in req->cflags · 81459350
      Jens Axboe authored
      When we arm poll on behalf of a different type of request, like a network
      receive, then we allocate req->apoll as our poll entry. Running network
      workloads shows io_poll_check_events() as the most expensive part of
      io_uring, and it's all due to having to pull in req->apoll instead of
      just the request which we have hot already.
      
      Cache poll->events in req->cflags, which isn't used until the request
      completes anyway. This isn't strictly needed for regular poll, where
      req->poll.events is used and thus already hot, but for the sake of
      unification we do it all around.
      
      This saves 3-4% of overhead in certain request workloads.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      81459350
    • Jens Axboe's avatar
      io_uring: move req->poll_refs into previous struct hole · 521d61fc
      Jens Axboe authored
      This serves two purposes:
      
      - We now have the last cacheline mostly unused for generic workloads,
        instead of having to pull in the poll refs explicitly for workloads
        that rely on poll arming.
      
      - It shrinks the io_kiocb from 232 to 224 bytes.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      521d61fc
    • Dylan Yudaken's avatar
      io_uring: make tracing format consistent · 052ebf1f
      Dylan Yudaken authored
      Make the tracing formatting for user_data and flags consistent.
      
      Having consistent formatting allows one for example to grep for a specific
      user_data/flags and be able to trace a single sqe through easily.
      
      Change user_data to 0x%llx and flags to 0x%x everywhere. The '0x' is
      useful to disambiguate for example "user_data 100".
      
      Additionally remove the '=' for flags in io_uring_req_failed, again for consistency.
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Link: https://lore.kernel.org/r/20220316095204.2191498-1-dylany@fb.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      052ebf1f
  9. 15 Mar, 2022 1 commit
    • Jens Axboe's avatar
      io_uring: recycle apoll_poll entries · 4d9237e3
      Jens Axboe authored
      Particularly for networked workloads, io_uring intensively uses its
      poll based backend to get a notification when data/space is available.
      Profiling workloads, we see 3-4% of alloc+free that is directly attributed
      to just the apoll allocation and free (and the rest being skb alloc+free).
      
      For the fast path, we have ctx->uring_lock held already for both issue
      and the inline completions, and we can utilize that to avoid any extra
      locking needed to have a basic recycling cache for the apoll entries on
      both the alloc and free side.
      
      Double poll still requires an allocation. But those are rare and not
      a fast path item.
      
      With the simple cache in place, we see a 3-4% reduction in overhead for
      the workload.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4d9237e3
  10. 12 Mar, 2022 1 commit
  11. 10 Mar, 2022 12 commits
    • Jens Axboe's avatar
      io_uring: allow submissions to continue on error · bcbb7bf6
      Jens Axboe authored
      By default, io_uring will stop submitting a batch of requests if we run
      into an error submitting a request. This isn't strictly necessary, as
      the error result is passed out-of-band via a CQE anyway. And it can be
      a bit confusing for some applications.
      
      Provide a way to setup a ring that will continue submitting on error,
      when the error CQE has been posted.
      
      There's still one case that will break out of submission. If we fail
      allocating a request, then we'll still return -ENOMEM. We could in theory
      post a CQE for that condition too even if we never got a request. Leave
      that for a potential followup.
      Reported-by: default avatarDylan Yudaken <dylany@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bcbb7bf6
    • Jens Axboe's avatar
      io_uring: recycle provided buffers if request goes async · b1c62645
      Jens Axboe authored
      If we are using provided buffers, it's less than useful to have a buffer
      selected and pinned if a request needs to go async or arms poll for
      notification trigger on when we can process it.
      
      Recycle the buffer in those events, so we don't pin it for the duration
      of the request.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b1c62645
    • Jens Axboe's avatar
      io_uring: ensure reads re-import for selected buffers · 2be2eb02
      Jens Axboe authored
      If we drop buffers between scheduling a retry, then we need to re-import
      when we start the request again.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2be2eb02
    • Jens Axboe's avatar
      io_uring: retry early for reads if we can poll · 9af177ee
      Jens Axboe authored
      Most of the logic in io_read() deals with regular files, and in some ways
      it would make sense to split the handling into S_IFREG and others. But
      at least for retry, we don't need to bother setting up a bunch of state
      just to abort in the loop later. In particular, don't bother forcing
      setup of async data for a normal non-vectored read when we don't need it.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9af177ee
    • Olivier Langlois's avatar
      io_uring: Add support for napi_busy_poll · adc8682e
      Olivier Langlois authored
      The sqpoll thread can be used for performing the napi busy poll in a
      similar way that it does io polling for file systems supporting direct
      access bypassing the page cache.
      
      The other way that io_uring can be used for napi busy poll is by
      calling io_uring_enter() to get events.
      
      If the user specify a timeout value, it is distributed between polling
      and sleeping by using the systemwide setting
      /proc/sys/net/core/busy_poll.
      
      The changes have been tested with this program:
      https://github.com/lano1106/io_uring_udp_ping
      
      and the result is:
      Without sqpoll:
      NAPI busy loop disabled:
      rtt min/avg/max/mdev = 40.631/42.050/58.667/1.547 us
      NAPI busy loop enabled:
      rtt min/avg/max/mdev = 30.619/31.753/61.433/1.456 us
      
      With sqpoll:
      NAPI busy loop disabled:
      rtt min/avg/max/mdev = 42.087/44.438/59.508/1.533 us
      NAPI busy loop enabled:
      rtt min/avg/max/mdev = 35.779/37.347/52.201/0.924 us
      Co-developed-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: default avatarHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: default avatarOlivier Langlois <olivier@trillion01.com>
      Link: https://lore.kernel.org/r/810bd9408ffc510ff08269e78dca9df4af0b9e4e.1646777484.git.olivier@trillion01.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      adc8682e
    • Olivier Langlois's avatar
      io_uring: minor io_cqring_wait() optimization · 950e79dd
      Olivier Langlois authored
      Move up the block manipulating the sig variable to execute code
      that may encounter an error and exit first before continuing
      executing the rest of the function and avoid useless computations
      Signed-off-by: default avatarOlivier Langlois <olivier@trillion01.com>
      Link: https://lore.kernel.org/r/84513f7cc1b1fb31d8f4cb910aee033391d036b4.1646777484.git.olivier@trillion01.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      950e79dd
    • Jens Axboe's avatar
      io_uring: add support for IORING_OP_MSG_RING command · 4f57f06c
      Jens Axboe authored
      This adds support for IORING_OP_MSG_RING, which allows an SQE to signal
      another ring. That allows either waking up someone waiting on the ring,
      or even passing a 64-bit value via the user_data field in the CQE.
      
      sqe->fd must contain the fd of a ring that should receive the CQE.
      sqe->off will be propagated to the cqe->user_data on the target ring,
      and sqe->len will be propagated to cqe->res. The results CQE will have
      IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated
      from a messaging request rather than a SQE issued locally on that ring.
      This effectively allows passing a 64-bit and a 32-bit quantify between
      the two rings.
      
      This request type has the following request specific error cases:
      
      - -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is
        of the io_uring type.
      - -EOVERFLOW. Set if we were not able to deliver a request to the target
        ring.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4f57f06c
    • Jens Axboe's avatar
      io_uring: speedup provided buffer handling · cc3cec83
      Jens Axboe authored
      In testing high frequency workloads with provided buffers, we spend a
      lot of time in allocating and freeing the buffer units themselves.
      Rather than repeatedly free and alloc them, add a recycling cache
      instead. There are two caches:
      
      - ctx->io_buffers_cache. This is the one we grab from in the submission
        path, and it's protected by ctx->uring_lock. For inline completions,
        we can recycle straight back to this cache and not need any extra
        locking.
      
      - ctx->io_buffers_comp. If we're not under uring_lock, then we use this
        list to recycle buffers. It's protected by the completion_lock.
      
      On adding a new buffer, check io_buffers_cache. If it's empty, check if
      we can splice entries from the io_buffers_comp_cache.
      
      This reduces about 5-10% of overhead from provided buffers, bringing it
      pretty close to the non-provided path.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cc3cec83
    • Jens Axboe's avatar
      io_uring: add support for registering ring file descriptors · e7a6c00d
      Jens Axboe authored
      Lots of workloads use multiple threads, in which case the file table is
      shared between them. This makes getting and putting the ring file
      descriptor for each io_uring_enter(2) system call more expensive, as it
      involves an atomic get and put for each call.
      
      Similarly to how we allow registering normal file descriptors to avoid
      this overhead, add support for an io_uring_register(2) API that allows
      to register the ring fds themselves:
      
      1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update
         structs, and registers them with the task.
      2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update
         structs, and unregisters them.
      
      When a ring fd is registered, it is internally represented by an offset.
      This offset is returned to the application, and the application then
      uses this offset and sets IORING_ENTER_REGISTERED_RING for the
      io_uring_enter(2) system call. This works just like using a registered
      file descriptor, rather than a real one, in an SQE, where
      IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal
      offset/descriptor rather than a real file descriptor.
      
      In initial testing, this provides a nice bump in performance for
      threaded applications in real world cases where the batch count (eg
      number of requests submitted per io_uring_enter(2) invocation) is low.
      In a microbenchmark, submitting NOP requests, we see the following
      increases in performance:
      
      Requests per syscall	Baseline	Registered	Increase
      ----------------------------------------------------------------
      1			 ~7030K		 ~8080K		+15%
      2			~13120K		~14800K		+13%
      4			~22740K		~25300K		+11%
      Co-developed-by: default avatarXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e7a6c00d
    • Dylan Yudaken's avatar
      io_uring: documentation fixup · 63c36549
      Dylan Yudaken authored
      Fix incorrect name reference in comment. ki_filp does not exist in the
      struct, but file does.
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Link: https://lore.kernel.org/r/20220224105157.1332353-1-dylany@fb.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      63c36549
    • Dylan Yudaken's avatar
      io_uring: do not recalculate ppos unnecessarily · b4aec400
      Dylan Yudaken authored
      There is a slight optimisation to be had by calculating the correct pos
      pointer inside io_kiocb_update_pos and then using that later.
      
      It seems code size drops by a bit:
      000000000000a1b0 0000000000000400 t io_read
      000000000000a5b0 0000000000000319 t io_write
      
      vs
      000000000000a1b0 00000000000003f6 t io_read
      000000000000a5b0 0000000000000310 t io_write
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b4aec400
    • Dylan Yudaken's avatar
      io_uring: update kiocb->ki_pos at execution time · d34e1e5b
      Dylan Yudaken authored
      Update kiocb->ki_pos at execution time rather than in io_prep_rw().
      io_prep_rw() happens before the job is enqueued to a worker and so the
      offset might be read multiple times before being executed once.
      
      Ensures that the file position in a set of _linked_ SQEs will be only
      obtained after earlier SQEs have completed, and so will include their
      incremented file position.
      Signed-off-by: default avatarDylan Yudaken <dylany@fb.com>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d34e1e5b