1. 08 Mar, 2024 2 commits
    • Jens Axboe's avatar
      io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io · 9817ad85
      Jens Axboe authored
      Ensure that prep handlers always initialize sr->done_io before any
      potential failure conditions, and with that, we now it's always been
      set even for the failure case.
      
      With that, we don't need to use the REQ_F_PARTIAL_IO flag to gate on that.
      Additionally, we should not overwrite req->cqe.res unless sr->done_io is
      actually positive.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9817ad85
    • Jens Axboe's avatar
      io_uring/net: correctly handle multishot recvmsg retry setup · deaef31b
      Jens Axboe authored
      If we loop for multishot receive on the initial attempt, and then abort
      later on to wait for more, we miss a case where we should be copying the
      io_async_msghdr from the stack to stable storage. This leads to the next
      retry potentially failing, if the application had the msghdr on the
      stack.
      
      Cc: stable@vger.kernel.org
      Fixes: 9bb66906 ("io_uring: support multishot in recvmsg")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      deaef31b
  2. 07 Mar, 2024 3 commits
  3. 04 Mar, 2024 2 commits
  4. 01 Mar, 2024 2 commits
    • Xiaobing Li's avatar
      io_uring/sqpoll: statistics of the true utilization of sq threads · 3fcb9d17
      Xiaobing Li authored
      Count the running time and actual IO processing time of the sqpoll
      thread, and output the statistical data to fdinfo.
      
      Variable description:
      "work_time" in the code represents the sum of the jiffies of the sq
      thread actually processing IO, that is, how many milliseconds it
      actually takes to process IO. "total_time" represents the total time
      that the sq thread has elapsed from the beginning of the loop to the
      current time point, that is, how many milliseconds it has spent in
      total.
      
      The test tool is fio, and its parameters are as follows:
      [global]
      ioengine=io_uring
      direct=1
      group_reporting
      bs=128k
      norandommap=1
      randrepeat=0
      refill_buffers
      ramp_time=30s
      time_based
      runtime=1m
      clocksource=clock_gettime
      overwrite=1
      log_avg_msec=1000
      numjobs=1
      
      [disk0]
      filename=/dev/nvme0n1
      rw=read
      iodepth=16
      hipri
      sqthread_poll=1
      
      The test results are as follows:
      Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq
      SqMask: 0x3
      SqHead: 3197153
      SqTail: 3197153
      CachedSqHead:   3197153
      SqThread:       9231
      SqThreadCpu:    11
      SqTotalTime:    18099614
      SqWorkTime:     16748316
      
      The test results corresponding to different iodepths are as follows:
      |-----------|-------|-------|-------|------|-------|
      |   iodepth |   1   |   4   |   8   |  16  |  64   |
      |-----------|-------|-------|-------|------|-------|
      |utilization| 2.9%  | 8.8%  | 10.9% | 92.9%| 84.4% |
      |-----------|-------|-------|-------|------|-------|
      |    idle   | 97.1% | 91.2% | 89.1% | 7.1% | 15.6% |
      |-----------|-------|-------|-------|------|-------|
      Signed-off-by: default avatarXiaobing Li <xiaobing.li@samsung.com>
      Link: https://lore.kernel.org/r/20240228091251.543383-1-xiaobing.li@samsung.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3fcb9d17
    • Jens Axboe's avatar
      io_uring/net: move recv/recvmsg flags out of retry loop · eb18c29d
      Jens Axboe authored
      The flags don't change, just intialize them once rather than every loop
      for multishot.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eb18c29d
  5. 27 Feb, 2024 4 commits
    • Jens Axboe's avatar
      io_uring/kbuf: flag request if buffer pool is empty after buffer pick · c3f9109d
      Jens Axboe authored
      Normally we do an extra roundtrip for retries even if the buffer pool has
      depleted, as we don't check that upfront. Rather than add this check, have
      the buffer selection methods mark the request with REQ_F_BL_EMPTY if the
      used buffer group is out of buffers after this selection. This is very
      cheap to do once we're all the way inside there anyway, and it gives the
      caller a chance to make better decisions on how to proceed.
      
      For example, recv/recvmsg multishot could check this flag when it
      decides whether to keep receiving or not.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c3f9109d
    • Jens Axboe's avatar
      io_uring/net: improve the usercopy for sendmsg/recvmsg · 792060de
      Jens Axboe authored
      We're spending a considerable amount of the sendmsg/recvmsg time just
      copying in the message header. And for provided buffers, the known
      single entry iovec.
      
      Be a bit smarter about it and enable/disable user access around our
      copying. In a test case that does both sendmsg and recvmsg, the
      runtime before this change (averaged over multiple runs, very stable
      times however):
      
      Kernel		Time		Diff
      ====================================
      -git		4720 usec
      -git+commit	4311 usec	-8.7%
      
      and looking at a profile diff, we see the following:
      
      0.25%     +9.33%  [kernel.kallsyms]     [k] _copy_from_user
      4.47%     -3.32%  [kernel.kallsyms]     [k] __io_msg_copy_hdr.constprop.0
      
      where we drop more than 9% of _copy_from_user() time, and consequently
      add time to __io_msg_copy_hdr() where the copies are now attributed to,
      but with a net win of 6%.
      
      In comparison, the same test case with send/recv runs in 3745 usec, which
      is (expectedly) still quite a bit faster. But at least sendmsg/recvmsg is
      now only ~13% slower, where it was ~21% slower before.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      792060de
    • Jens Axboe's avatar
      io_uring/net: move receive multishot out of the generic msghdr path · c5597802
      Jens Axboe authored
      Move the actual user_msghdr / compat_msghdr into the send and receive
      sides, respectively, so we can move the uaddr receive handling into its
      own handler, and ditto the multishot with buffer selection logic.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c5597802
    • Jens Axboe's avatar
      io_uring/net: unify how recvmsg and sendmsg copy in the msghdr · 52307ac4
      Jens Axboe authored
      For recvmsg, we roll our own since we support buffer selections. This
      isn't the case for sendmsg right now, but in preparation for doing so,
      make the recvmsg copy helpers generic so we can call them from the
      sendmsg side as well.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      52307ac4
  6. 15 Feb, 2024 2 commits
    • Jens Axboe's avatar
      io_uring/napi: enable even with a timeout of 0 · b4ccc4dd
      Jens Axboe authored
      1 usec is not as short as it used to be, and it makes sense to allow 0
      for a busy poll timeout - this means just do one loop to check if we
      have anything available. Add a separate ->napi_enabled to check if napi
      has been enabled or not.
      
      While at it, move the writing of the ctx napi values after we've copied
      the old values back to userspace. This ensures that if the call fails,
      we'll be in the same state as we were before, rather than some
      indeterminate state.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b4ccc4dd
    • Jens Axboe's avatar
      io_uring: kill stale comment for io_cqring_overflow_kill() · 871760eb
      Jens Axboe authored
      This function now deals only with discarding overflow entries on ring
      free and exit, and it no longer returns whether we successfully flushed
      all entries as there's no CQE posting involved anymore. Kill the
      outdated comment.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      871760eb
  7. 14 Feb, 2024 3 commits
    • Jens Axboe's avatar
      io_uring/sqpoll: use the correct check for pending task_work · c8d8fc3b
      Jens Axboe authored
      A previous commit moved to using just the private task_work list for
      SQPOLL, but it neglected to update the check for whether we have
      pending task_work. Normally this is fine as we'll attempt to run it
      unconditionally, but if we race with going to sleep AND task_work
      being added, then we certainly need the right check here.
      
      Fixes: af5d68f8 ("io_uring/sqpoll: manage task_work privately")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c8d8fc3b
    • Jens Axboe's avatar
      io_uring: wake SQPOLL task when task_work is added to an empty queue · 78f9b61b
      Jens Axboe authored
      If there's no current work on the list, we still need to potentially
      wake the SQPOLL task if it is sleeping. This is ordered with the
      wait queue addition in sqpoll, which adds to the wait queue before
      checking for pending work items.
      
      Fixes: af5d68f8 ("io_uring/sqpoll: manage task_work privately")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      78f9b61b
    • Jens Axboe's avatar
      io_uring/napi: ensure napi polling is aborted when work is available · 428f1382
      Jens Axboe authored
      While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and
      stalls in packet delivery. Turns out that while
      io_napi_busy_loop_should_end() aborts appropriately on regular
      task_work, it does not abort if we have local task_work pending.
      
      Move io_has_work() into the private io_uring.h header, and gate whether
      we should continue polling on that as well. This makes NAPI polling on
      send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well.
      
      Fixes: 8d0c12a8 ("io-uring: add napi busy poll support")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      428f1382
  8. 13 Feb, 2024 1 commit
  9. 09 Feb, 2024 9 commits
  10. 08 Feb, 2024 12 commits