1. 09 Mar, 2024 1 commit
    • Gabriel Krisman Bertazi's avatar
      io_uring: Fix sqpoll utilization check racing with dying sqpoll · 606559dc
      Gabriel Krisman Bertazi authored
      Commit 3fcb9d17 ("io_uring/sqpoll: statistics of the true
      utilization of sq threads"), currently in Jens for-next branch, peeks at
      io_sq_data->thread to report utilization statistics. But, If
      io_uring_show_fdinfo races with sqpoll terminating, even though we hold
      the ctx lock, sqd->thread might be NULL and we hit the Oops below.
      
      Note that we could technically just protect the getrusage() call and the
      sq total/work time calculations.  But showing some sq
      information (pid/cpu) and not other information (utilization) is more
      confusing than not reporting anything, IMO.  So let's hide it all if we
      happen to race with a dying sqpoll.
      
      This can be triggered consistently in my vm setup running
      sqpoll-cancel-hang.t in a loop.
      
      BUG: kernel NULL pointer dereference, address: 00000000000007b0
      PGD 0 P4D 0
      Oops: 0000 [#1] PREEMPT SMP NOPTI
      CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted 6.8.0-rc3-g3fcb9d17-dirty #69
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022
      RIP: 0010:getrusage+0x21/0x3e0
      Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
      RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282
      RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60
      RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000
      R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000
      R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000
      FS:  00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0
      PKRU: 55555554
      Call Trace:
       <TASK>
       ? __die_body+0x1a/0x60
       ? page_fault_oops+0x154/0x440
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? do_user_addr_fault+0x174/0x7c0
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? exc_page_fault+0x63/0x140
       ? asm_exc_page_fault+0x22/0x30
       ? getrusage+0x21/0x3e0
       ? seq_printf+0x4e/0x70
       io_uring_show_fdinfo+0x9db/0xa10
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? vsnprintf+0x101/0x4d0
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? seq_vprintf+0x34/0x50
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? seq_printf+0x4e/0x70
       ? seq_show+0x16b/0x1d0
       ? __pfx_io_uring_show_fdinfo+0x10/0x10
       seq_show+0x16b/0x1d0
       seq_read_iter+0xd7/0x440
       seq_read+0x102/0x140
       vfs_read+0xae/0x320
       ? srso_alias_return_thunk+0x5/0xfbef5
       ? __do_sys_newfstat+0x35/0x60
       ksys_read+0xa5/0xe0
       do_syscall_64+0x50/0x110
       entry_SYSCALL_64_after_hwframe+0x6e/0x76
      RIP: 0033:0x7f786ec1db4d
      Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec
      RSP: 002b:00007ffcb361a4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      RAX: ffffffffffffffda RBX: 000055a4c8fe42f0 RCX: 00007f786ec1db4d
      RDX: 0000000000000400 RSI: 000055a4c8fe48a0 RDI: 0000000000000006
      RBP: 00007f786ecfb0b0 R08: 00007f786ecfb2a8 R09: 0000000000000001
      R10: 0000000000000000 R11: 0000000000000246 R12: 00007f786ecfaf60
      R13: 000055a4c8fe42f0 R14: 0000000000000000 R15: 00007ffcb361a628
       </TASK>
      Modules linked in:
      CR2: 00000000000007b0
      ---[ end trace 0000000000000000 ]---
      RIP: 0010:getrusage+0x21/0x3e0
      Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89
      RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282
      RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60
      RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000
      R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000
      R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000
      FS:  00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0
      PKRU: 55555554
      Kernel panic - not syncing: Fatal exception
      Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      
      Fixes: 3fcb9d17 ("io_uring/sqpoll: statistics of the true utilization of sq threads")
      Signed-off-by: default avatarGabriel Krisman Bertazi <krisman@suse.de>
      Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      606559dc
  2. 08 Mar, 2024 8 commits
  3. 07 Mar, 2024 3 commits
  4. 04 Mar, 2024 2 commits
  5. 01 Mar, 2024 2 commits
    • Xiaobing Li's avatar
      io_uring/sqpoll: statistics of the true utilization of sq threads · 3fcb9d17
      Xiaobing Li authored
      Count the running time and actual IO processing time of the sqpoll
      thread, and output the statistical data to fdinfo.
      
      Variable description:
      "work_time" in the code represents the sum of the jiffies of the sq
      thread actually processing IO, that is, how many milliseconds it
      actually takes to process IO. "total_time" represents the total time
      that the sq thread has elapsed from the beginning of the loop to the
      current time point, that is, how many milliseconds it has spent in
      total.
      
      The test tool is fio, and its parameters are as follows:
      [global]
      ioengine=io_uring
      direct=1
      group_reporting
      bs=128k
      norandommap=1
      randrepeat=0
      refill_buffers
      ramp_time=30s
      time_based
      runtime=1m
      clocksource=clock_gettime
      overwrite=1
      log_avg_msec=1000
      numjobs=1
      
      [disk0]
      filename=/dev/nvme0n1
      rw=read
      iodepth=16
      hipri
      sqthread_poll=1
      
      The test results are as follows:
      Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq
      SqMask: 0x3
      SqHead: 3197153
      SqTail: 3197153
      CachedSqHead:   3197153
      SqThread:       9231
      SqThreadCpu:    11
      SqTotalTime:    18099614
      SqWorkTime:     16748316
      
      The test results corresponding to different iodepths are as follows:
      |-----------|-------|-------|-------|------|-------|
      |   iodepth |   1   |   4   |   8   |  16  |  64   |
      |-----------|-------|-------|-------|------|-------|
      |utilization| 2.9%  | 8.8%  | 10.9% | 92.9%| 84.4% |
      |-----------|-------|-------|-------|------|-------|
      |    idle   | 97.1% | 91.2% | 89.1% | 7.1% | 15.6% |
      |-----------|-------|-------|-------|------|-------|
      Signed-off-by: default avatarXiaobing Li <xiaobing.li@samsung.com>
      Link: https://lore.kernel.org/r/20240228091251.543383-1-xiaobing.li@samsung.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3fcb9d17
    • Jens Axboe's avatar
      io_uring/net: move recv/recvmsg flags out of retry loop · eb18c29d
      Jens Axboe authored
      The flags don't change, just intialize them once rather than every loop
      for multishot.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eb18c29d
  6. 27 Feb, 2024 4 commits
    • Jens Axboe's avatar
      io_uring/kbuf: flag request if buffer pool is empty after buffer pick · c3f9109d
      Jens Axboe authored
      Normally we do an extra roundtrip for retries even if the buffer pool has
      depleted, as we don't check that upfront. Rather than add this check, have
      the buffer selection methods mark the request with REQ_F_BL_EMPTY if the
      used buffer group is out of buffers after this selection. This is very
      cheap to do once we're all the way inside there anyway, and it gives the
      caller a chance to make better decisions on how to proceed.
      
      For example, recv/recvmsg multishot could check this flag when it
      decides whether to keep receiving or not.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c3f9109d
    • Jens Axboe's avatar
      io_uring/net: improve the usercopy for sendmsg/recvmsg · 792060de
      Jens Axboe authored
      We're spending a considerable amount of the sendmsg/recvmsg time just
      copying in the message header. And for provided buffers, the known
      single entry iovec.
      
      Be a bit smarter about it and enable/disable user access around our
      copying. In a test case that does both sendmsg and recvmsg, the
      runtime before this change (averaged over multiple runs, very stable
      times however):
      
      Kernel		Time		Diff
      ====================================
      -git		4720 usec
      -git+commit	4311 usec	-8.7%
      
      and looking at a profile diff, we see the following:
      
      0.25%     +9.33%  [kernel.kallsyms]     [k] _copy_from_user
      4.47%     -3.32%  [kernel.kallsyms]     [k] __io_msg_copy_hdr.constprop.0
      
      where we drop more than 9% of _copy_from_user() time, and consequently
      add time to __io_msg_copy_hdr() where the copies are now attributed to,
      but with a net win of 6%.
      
      In comparison, the same test case with send/recv runs in 3745 usec, which
      is (expectedly) still quite a bit faster. But at least sendmsg/recvmsg is
      now only ~13% slower, where it was ~21% slower before.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      792060de
    • Jens Axboe's avatar
      io_uring/net: move receive multishot out of the generic msghdr path · c5597802
      Jens Axboe authored
      Move the actual user_msghdr / compat_msghdr into the send and receive
      sides, respectively, so we can move the uaddr receive handling into its
      own handler, and ditto the multishot with buffer selection logic.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c5597802
    • Jens Axboe's avatar
      io_uring/net: unify how recvmsg and sendmsg copy in the msghdr · 52307ac4
      Jens Axboe authored
      For recvmsg, we roll our own since we support buffer selections. This
      isn't the case for sendmsg right now, but in preparation for doing so,
      make the recvmsg copy helpers generic so we can call them from the
      sendmsg side as well.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      52307ac4
  7. 15 Feb, 2024 2 commits
    • Jens Axboe's avatar
      io_uring/napi: enable even with a timeout of 0 · b4ccc4dd
      Jens Axboe authored
      1 usec is not as short as it used to be, and it makes sense to allow 0
      for a busy poll timeout - this means just do one loop to check if we
      have anything available. Add a separate ->napi_enabled to check if napi
      has been enabled or not.
      
      While at it, move the writing of the ctx napi values after we've copied
      the old values back to userspace. This ensures that if the call fails,
      we'll be in the same state as we were before, rather than some
      indeterminate state.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b4ccc4dd
    • Jens Axboe's avatar
      io_uring: kill stale comment for io_cqring_overflow_kill() · 871760eb
      Jens Axboe authored
      This function now deals only with discarding overflow entries on ring
      free and exit, and it no longer returns whether we successfully flushed
      all entries as there's no CQE posting involved anymore. Kill the
      outdated comment.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      871760eb
  8. 14 Feb, 2024 3 commits
    • Jens Axboe's avatar
      io_uring/sqpoll: use the correct check for pending task_work · c8d8fc3b
      Jens Axboe authored
      A previous commit moved to using just the private task_work list for
      SQPOLL, but it neglected to update the check for whether we have
      pending task_work. Normally this is fine as we'll attempt to run it
      unconditionally, but if we race with going to sleep AND task_work
      being added, then we certainly need the right check here.
      
      Fixes: af5d68f8 ("io_uring/sqpoll: manage task_work privately")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c8d8fc3b
    • Jens Axboe's avatar
      io_uring: wake SQPOLL task when task_work is added to an empty queue · 78f9b61b
      Jens Axboe authored
      If there's no current work on the list, we still need to potentially
      wake the SQPOLL task if it is sleeping. This is ordered with the
      wait queue addition in sqpoll, which adds to the wait queue before
      checking for pending work items.
      
      Fixes: af5d68f8 ("io_uring/sqpoll: manage task_work privately")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      78f9b61b
    • Jens Axboe's avatar
      io_uring/napi: ensure napi polling is aborted when work is available · 428f1382
      Jens Axboe authored
      While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and
      stalls in packet delivery. Turns out that while
      io_napi_busy_loop_should_end() aborts appropriately on regular
      task_work, it does not abort if we have local task_work pending.
      
      Move io_has_work() into the private io_uring.h header, and gate whether
      we should continue polling on that as well. This makes NAPI polling on
      send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well.
      
      Fixes: 8d0c12a8 ("io-uring: add napi busy poll support")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      428f1382
  9. 13 Feb, 2024 1 commit
  10. 09 Feb, 2024 9 commits
  11. 08 Feb, 2024 5 commits