1. 29 Sep, 2022 4 commits
    • Jens Axboe's avatar
      io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL · 46a525e1
      Jens Axboe authored
      This isn't a reliable mechanism to tell if we have task_work pending, we
      really should be looking at whether we have any items queued. This is
      problematic if forward progress is gated on running said task_work. One
      such example is reading from a pipe, where the write side has been closed
      right before the read is started. The fput() of the file queues TWA_RESUME
      task_work, and we need that task_work to be run before ->release() is
      called for the pipe. If ->release() isn't called, then the read will sit
      forever waiting on data that will never arise.
      
      Fix this by io_run_task_work() so it checks if we have task_work pending
      rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
      doesn't work for task_work that is queued without TWA_SIGNAL.
      Reported-by: default avatarChristiano Haesbaert <haesbaert@haesbaert.org>
      Cc: stable@vger.kernel.org
      Link: https://github.com/axboe/liburing/issues/665Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      46a525e1
    • Jens Axboe's avatar
      io_uring/rw: defer fsnotify calls to task context · b000145e
      Jens Axboe authored
      We can't call these off the kiocb completion as that might be off
      soft/hard irq context. Defer the calls to when we process the
      task_work for this request. That avoids valid complaints like:
      
      stack backtrace:
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_usage_bug kernel/locking/lockdep.c:3961 [inline]
       valid_state kernel/locking/lockdep.c:3973 [inline]
       mark_lock_irq kernel/locking/lockdep.c:4176 [inline]
       mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632
       mark_lock kernel/locking/lockdep.c:4596 [inline]
       mark_usage kernel/locking/lockdep.c:4527 [inline]
       __lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007
       lock_acquire kernel/locking/lockdep.c:5666 [inline]
       lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631
       __fs_reclaim_acquire mm/page_alloc.c:4674 [inline]
       fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688
       might_alloc include/linux/sched/mm.h:271 [inline]
       slab_pre_alloc_hook mm/slab.h:700 [inline]
       slab_alloc mm/slab.c:3278 [inline]
       __kmem_cache_alloc_lru mm/slab.c:3471 [inline]
       kmem_cache_alloc+0x39/0x520 mm/slab.c:3491
       fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline]
       fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline]
       fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948
       send_to_group fs/notify/fsnotify.c:360 [inline]
       fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570
       __fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230
       fsnotify_parent include/linux/fsnotify.h:77 [inline]
       fsnotify_file include/linux/fsnotify.h:99 [inline]
       fsnotify_access include/linux/fsnotify.h:309 [inline]
       __io_complete_rw_common+0x485/0x720 io_uring/rw.c:195
       io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228
       iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline]
       iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178
       bio_endio+0x5f9/0x780 block/bio.c:1564
       req_bio_endio block/blk-mq.c:695 [inline]
       blk_update_request+0x3fc/0x1300 block/blk-mq.c:825
       scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541
       scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971
       scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438
       blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022
       __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571
       invoke_softirq kernel/softirq.c:445 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
       common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240
      
      Fixes: f63cf519 ("io_uring: ensure that fsnotify is always called")
      Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/
      Reported-by: syzbot+dfcc5f4da15868df7d4d@syzkaller.appspotmail.com
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b000145e
    • Stefan Metzmacher's avatar
      io_uring/net: fix fast_iov assignment in io_setup_async_msg() · 3e4cb6eb
      Stefan Metzmacher authored
      I hit a very bad problem during my tests of SENDMSG_ZC.
      BUG(); in first_iovec_segment() triggered very easily.
      The problem was io_setup_async_msg() in the partial retry case,
      which seems to happen more often with _ZC.
      
      iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset
      being only relative to the first element.
      
      Which means kmsg->msg.msg_iter.iov is no longer the
      same as kmsg->fast_iov.
      
      But this would rewind the copy to be the start of
      async_msg->fast_iov, which means the internal
      state of sync_msg->msg.msg_iter is inconsitent.
      
      I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608
      and got a short writes with:
      - ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244
      - ret=-EAGAIN => io_uring_poll_arm
      - ret=4911225 min_ret=5713448 => remaining 802223  sr->done_io=7586469
      - ret=-EAGAIN => io_uring_poll_arm
      - ret=802223  min_ret=802223  => res=8388692
      
      While this was easily triggered with SENDMSG_ZC (queued for 6.1),
      it was a potential problem starting with 7ba89d2a
      in 5.18 for IORING_OP_RECVMSG.
      And also with 4c3c0943 in 5.19
      for IORING_OP_SENDMSG.
      
      However 257e84a5 introduced the critical
      code into io_setup_async_msg() in 5.11.
      
      Fixes: 7ba89d2a ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly")
      Fixes: 257e84a5 ("io_uring: refactor sendmsg/recvmsg iov managing")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarStefan Metzmacher <metze@samba.org>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3e4cb6eb
    • Pavel Begunkov's avatar
      io_uring/net: fix non-zc send with address · 04360d3e
      Pavel Begunkov authored
      We're currently ignoring the dest address with non-zerocopy send because
      even though we copy it from the userspace shortly after ->msg_name gets
      zeroed. Move msghdr init earlier.
      
      Fixes: 516e82f0 ("io_uring/net: support non-zerocopy sendto")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/176ced5e8568aa5d300ca899b7f05b303ebc49fd.1664409532.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      04360d3e
  2. 28 Sep, 2022 1 commit
  3. 27 Sep, 2022 2 commits
  4. 26 Sep, 2022 1 commit
    • Pavel Begunkov's avatar
      io_uring/net: fix cleanup double free free_iov init · 4c17a496
      Pavel Begunkov authored
      Having ->async_data doesn't mean it's initialised and previously we vere
      relying on setting F_CLEANUP at the right moment. With zc sendmsg
      though, we set F_CLEANUP early in prep when we alloc a notif and so we
      may allocate async_data, fail in copy_msg_hdr() leaving
      struct io_async_msghdr not initialised correctly but with F_CLEANUP
      set, which causes a ->free_iov double free and probably other nastiness.
      
      Always initialise ->free_iov. Also, now it might point to fast_iov when
      fails, so avoid freeing it during cleanups.
      
      Reported-by: syzbot+edfd15cd4246a3fc615a@syzkaller.appspotmail.com
      Fixes: 493108d9 ("io_uring/net: zerocopy sendmsg")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4c17a496
  5. 23 Sep, 2022 3 commits
  6. 22 Sep, 2022 1 commit
    • Jens Axboe's avatar
      io_uring: ensure local task_work marks task as running · ec7fd256
      Jens Axboe authored
      io_uring will run task_work from contexts that have been prepared for
      waiting, and in doing so it'll implicitly set the task running again
      to avoid issues with blocking conditions. The new deferred local
      task_work doesn't do that, which can result in spews on this being
      an invalid condition:
      
      

[  112.917576] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000ad64af64>] prepare_to_wait_exclusive+0x3f/0xd0
      [  112.983088] WARNING: CPU: 1 PID: 190 at kernel/sched/core.c:9819 __might_sleep+0x5a/0x60
      [  112.987240] Modules linked in:
      [  112.990504] CPU: 1 PID: 190 Comm: io_uring Not tainted 6.0.0-rc6+ #1617
      [  113.053136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
      [  113.133650] RIP: 0010:__might_sleep+0x5a/0x60
      [  113.136507] Code: ee 48 89 df 5b 31 d2 5d e9 33 ff ff ff 48 8b 90 30 0b 00 00 48 c7 c7 90 de 45 82 c6 05 20 8b 79 01 01 48 89 d1 e8 3a 49 77 00 <0f> 0b eb d1 66 90 0f 1f 44 00 00 9c 58 f6 c4 02 74 35 65 8b 05 ed
      [  113.223940] RSP: 0018:ffffc90000537ca0 EFLAGS: 00010286
      [  113.232903] RAX: 0000000000000000 RBX: ffffffff8246782c RCX: ffffffff8270bcc8
      IOPS=133.15K, BW=520MiB/s, IOS/call=32/31
      [  113.353457] RDX: ffffc90000537b50 RSI: 00000000ffffdfff RDI: 0000000000000001
      [  113.358970] RBP: 00000000000003bc R08: 0000000000000000 R09: c0000000ffffdfff
      [  113.361746] R10: 0000000000000001 R11: ffffc90000537b48 R12: ffff888103f97280
      [  113.424038] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
      [  113.428009] FS:  00007f67ae7fc700(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000
      [  113.432794] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  113.503186] CR2: 00007f67b8b9b3b0 CR3: 0000000102b9b005 CR4: 0000000000770ee0
      [  113.507291] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  113.512669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  113.574374] PKRU: 55555554
      [  113.576800] Call Trace:
      [  113.578325]  <TASK>
      [  113.579799]  set_page_dirty_lock+0x1b/0x90
      [  113.582411]  __bio_release_pages+0x141/0x160
      [  113.673078]  ? set_next_entity+0xd7/0x190
      [  113.675632]  blk_rq_unmap_user+0xaa/0x210
      [  113.678398]  ? timerqueue_del+0x2a/0x40
      [  113.679578]  nvme_uring_task_cb+0x94/0xb0
      [  113.683025]  __io_run_local_work+0x8a/0x150
      [  113.743724]  ? io_cqring_wait+0x33d/0x500
      [  113.746091]  io_run_local_work.part.76+0x2e/0x60
      [  113.750091]  io_cqring_wait+0x2e7/0x500
      [  113.752395]  ? trace_event_raw_event_io_uring_req_failed+0x180/0x180
      [  113.823533]  __x64_sys_io_uring_enter+0x131/0x3c0
      [  113.827382]  ? switch_fpu_return+0x49/0xc0
      [  113.830753]  do_syscall_64+0x34/0x80
      [  113.832620]  entry_SYSCALL_64_after_hwframe+0x5e/0xc8
      
      Ensure that we mark current as TASK_RUNNING for deferred task_work
      as well.
      
      Fixes: c0e0d6ba ("io_uring: add IORING_SETUP_DEFER_TASKRUN")
      Reported-by: default avatarStefan Roesch <shr@fb.com>
      Reviewed-by: default avatarDylan Yudaken <dylany@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ec7fd256
  7. 21 Sep, 2022 28 commits