1. 30 Sep, 2022 3 commits
    • Jens Axboe's avatar
      block: kill deprecated BUG_ON() in the flush handling · e73a625b
      Jens Axboe authored
      We've never had any useful reports from this BUG_ON(), and in fact a
      number of the BUG_ON()'s in the flush handling need to be turned into
      more graceful handling.
      
      In preparation for allowing batched completions of the end_io handling,
      where we can enter the flush completion with queuelist having been reused
      for the batch, get rid of this BUG_ON().
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e73a625b
    • Jens Axboe's avatar
      Merge branch 'for-6.1/io_uring' into for-6.1/passthrough · 5853a7b5
      Jens Axboe authored
      * for-6.1/io_uring: (56 commits)
        io_uring/net: fix notif cqe reordering
        io_uring/net: don't update msg_name if not provided
        io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
        io_uring/rw: defer fsnotify calls to task context
        io_uring/net: fix fast_iov assignment in io_setup_async_msg()
        io_uring/net: fix non-zc send with address
        io_uring/net: don't skip notifs for failed requests
        io_uring/rw: don't lose short results on io_setup_async_rw()
        io_uring/rw: fix unexpected link breakage
        io_uring/net: fix cleanup double free free_iov init
        io_uring: fix CQE reordering
        io_uring/net: fix UAF in io_sendrecv_fail()
        selftest/net: adjust io_uring sendzc notif handling
        io_uring: ensure local task_work marks task as running
        io_uring/net: zerocopy sendmsg
        io_uring/net: combine fail handlers
        io_uring/net: rename io_sendzc()
        io_uring/net: support non-zerocopy sendto
        io_uring/net: refactor io_setup_async_addr
        io_uring/net: don't lose partial send_zc on fail
        ...
      5853a7b5
    • Jens Axboe's avatar
      Merge branch 'for-6.1/block' into for-6.1/passthrough · 736feaa3
      Jens Axboe authored
      * for-6.1/block: (162 commits)
        sbitmap: fix lockup while swapping
        block: add rationale for not using blk_mq_plug() when applicable
        block: adapt blk_mq_plug() to not plug for writes that require a zone lock
        s390/dasd: use blk_mq_alloc_disk
        blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep
        nvmet: don't look at the request_queue in nvmet_bdev_set_limits
        nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
        blk-mq: use quiesced elevator switch when reinitializing queues
        block: replace blk_queue_nowait with bdev_nowait
        nvme: remove nvme_ctrl_init_connect_q
        nvme-loop: use the tagset alloc/free helpers
        nvme-loop: store the generic nvme_ctrl in set->driver_data
        nvme-loop: initialize sqsize later
        nvme-fc: use the tagset alloc/free helpers
        nvme-fc: store the generic nvme_ctrl in set->driver_data
        nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
        nvme-rdma: use the tagset alloc/free helpers
        nvme-rdma: store the generic nvme_ctrl in set->driver_data
        nvme-tcp: use the tagset alloc/free helpers
        nvme-tcp: store the generic nvme_ctrl in set->driver_data
        ...
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      736feaa3
  2. 29 Sep, 2022 9 commits
    • Hugh Dickins's avatar
      sbitmap: fix lockup while swapping · 30514bd2
      Hugh Dickins authored
      Commit 4acb8341 ("sbitmap: fix batched wait_cnt accounting")
      is a big improvement: without it, I had to revert to before commit
      040b83fc ("sbitmap: fix possible io hung due to lost wakeup")
      to avoid the high system time and freezes which that had introduced.
      
      Now okay on the NVME laptop, but 4acb8341 is a disaster for heavy
      swapping (kernel builds in low memory) on another: soon locking up in
      sbitmap_queue_wake_up() (into which __sbq_wake_up() is inlined), cycling
      around with waitqueue_active() but wait_cnt 0 .  Here is a backtrace,
      showing the common pattern of outer sbitmap_queue_wake_up() interrupted
      before setting wait_cnt 0 back to wake_batch (in some cases other CPUs
      are idle, in other cases they're spinning for a lock in dd_bio_merge()):
      
      sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
      __blk_mq_free_request < blk_mq_free_request < __blk_mq_end_request <
      scsi_end_request < scsi_io_completion < scsi_finish_command <
      scsi_complete < blk_complete_reqs < blk_done_softirq < __do_softirq <
      __irq_exit_rcu < irq_exit_rcu < common_interrupt < asm_common_interrupt <
      _raw_spin_unlock_irqrestore < __wake_up_common_lock < __wake_up <
      sbitmap_queue_wake_up < sbitmap_queue_clear < blk_mq_put_tag <
      __blk_mq_free_request < blk_mq_free_request < dd_bio_merge <
      blk_mq_sched_bio_merge < blk_mq_attempt_bio_merge < blk_mq_submit_bio <
      __submit_bio < submit_bio_noacct_nocheck < submit_bio_noacct <
      submit_bio < __swap_writepage < swap_writepage < pageout <
      shrink_folio_list < evict_folios < lru_gen_shrink_lruvec <
      shrink_lruvec < shrink_node < do_try_to_free_pages < try_to_free_pages <
      __alloc_pages_slowpath < __alloc_pages < folio_alloc < vma_alloc_folio <
      do_anonymous_page < __handle_mm_fault < handle_mm_fault <
      do_user_addr_fault < exc_page_fault < asm_exc_page_fault
      
      See how the process-context sbitmap_queue_wake_up() has been interrupted,
      after bringing wait_cnt down to 0 (and in this example, after doing its
      wakeups), before advancing wake_index and refilling wake_cnt: an
      interrupt-context sbitmap_queue_wake_up() of the same sbq gets stuck.
      
      I have almost no grasp of all the possible sbitmap races, and their
      consequences: but __sbq_wake_up() can do nothing useful while wait_cnt 0,
      so it is better if sbq_wake_ptr() skips on to the next ws in that case:
      which fixes the lockup and shows no adverse consequence for me.
      
      The check for wait_cnt being 0 is obviously racy, and ultimately can lead
      to lost wakeups: for example, when there is only a single waitqueue with
      waiters.  However, lost wakeups are unlikely to matter in these cases,
      and a proper fix requires redesign (and benchmarking) of the batched
      wakeup code: so let's plug the hole with this bandaid for now.
      Signed-off-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Link: https://lore.kernel.org/r/9c2038a7-cdc5-5ee-854c-fbc6168bf16@google.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      30514bd2
    • Pavel Begunkov's avatar
      io_uring/net: fix notif cqe reordering · 108893dd
      Pavel Begunkov authored
      send zc is not restricted to !IO_URING_F_UNLOCKED anymore and so
      we can't use task-tw ordering trick to order notification cqes
      with requests completions. In this case leave it alone and let
      io_send_zc_cleanup() flush it.
      
      Cc: stable@vger.kernel.org
      Fixes: 53bdc88a ("io_uring/notif: order notif vs send CQEs")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/0031f3a00d492e814a4a0935a2029a46d9c9ba06.1664486545.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      108893dd
    • Pavel Begunkov's avatar
      io_uring/net: don't update msg_name if not provided · 6f10ae8a
      Pavel Begunkov authored
      io_sendmsg_copy_hdr() may clear msg->msg_name if the userspace didn't
      provide it, we should retain NULL in this case.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/97d49f61b5ec76d0900df658cfde3aa59ff22121.1664486545.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6f10ae8a
    • Jens Axboe's avatar
      io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL · 46a525e1
      Jens Axboe authored
      This isn't a reliable mechanism to tell if we have task_work pending, we
      really should be looking at whether we have any items queued. This is
      problematic if forward progress is gated on running said task_work. One
      such example is reading from a pipe, where the write side has been closed
      right before the read is started. The fput() of the file queues TWA_RESUME
      task_work, and we need that task_work to be run before ->release() is
      called for the pipe. If ->release() isn't called, then the read will sit
      forever waiting on data that will never arise.
      
      Fix this by io_run_task_work() so it checks if we have task_work pending
      rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously
      doesn't work for task_work that is queued without TWA_SIGNAL.
      Reported-by: default avatarChristiano Haesbaert <haesbaert@haesbaert.org>
      Cc: stable@vger.kernel.org
      Link: https://github.com/axboe/liburing/issues/665Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      46a525e1
    • Jens Axboe's avatar
      io_uring/rw: defer fsnotify calls to task context · b000145e
      Jens Axboe authored
      We can't call these off the kiocb completion as that might be off
      soft/hard irq context. Defer the calls to when we process the
      task_work for this request. That avoids valid complaints like:
      
      stack backtrace:
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_usage_bug kernel/locking/lockdep.c:3961 [inline]
       valid_state kernel/locking/lockdep.c:3973 [inline]
       mark_lock_irq kernel/locking/lockdep.c:4176 [inline]
       mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632
       mark_lock kernel/locking/lockdep.c:4596 [inline]
       mark_usage kernel/locking/lockdep.c:4527 [inline]
       __lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007
       lock_acquire kernel/locking/lockdep.c:5666 [inline]
       lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631
       __fs_reclaim_acquire mm/page_alloc.c:4674 [inline]
       fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688
       might_alloc include/linux/sched/mm.h:271 [inline]
       slab_pre_alloc_hook mm/slab.h:700 [inline]
       slab_alloc mm/slab.c:3278 [inline]
       __kmem_cache_alloc_lru mm/slab.c:3471 [inline]
       kmem_cache_alloc+0x39/0x520 mm/slab.c:3491
       fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline]
       fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline]
       fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948
       send_to_group fs/notify/fsnotify.c:360 [inline]
       fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570
       __fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230
       fsnotify_parent include/linux/fsnotify.h:77 [inline]
       fsnotify_file include/linux/fsnotify.h:99 [inline]
       fsnotify_access include/linux/fsnotify.h:309 [inline]
       __io_complete_rw_common+0x485/0x720 io_uring/rw.c:195
       io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228
       iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline]
       iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178
       bio_endio+0x5f9/0x780 block/bio.c:1564
       req_bio_endio block/blk-mq.c:695 [inline]
       blk_update_request+0x3fc/0x1300 block/blk-mq.c:825
       scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541
       scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971
       scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438
       blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022
       __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571
       invoke_softirq kernel/softirq.c:445 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:662
       common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240
      
      Fixes: f63cf519 ("io_uring: ensure that fsnotify is always called")
      Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/
      Reported-by: syzbot+dfcc5f4da15868df7d4d@syzkaller.appspotmail.com
      Reported-by: default avatarJan Kara <jack@suse.cz>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b000145e
    • Pankaj Raghav's avatar
      block: add rationale for not using blk_mq_plug() when applicable · 110fdb44
      Pankaj Raghav authored
      There are two places in the block layer at the moment where
      blk_mq_plug() helper could be used instead of directly accessing the
      plug from struct current. In both these cases, directly accessing the plug
      should not have any consequences for zoned devices.
      
      Make the intent explicit by adding comments instead of introducing unwanted
      checks with blk_mq_plug() helper.[1]
      
      [1] https://lore.kernel.org/linux-block/f6e54907-1035-2b2c-6387-ed178be05ccb@kernel.dk/Signed-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Suggested-by: default avatarJens Axboe <axboe@kernel.dk>
      Link: https://lore.kernel.org/r/20220929144141.140077-1-p.raghav@samsung.com
      [axboe: fixup multi-line comment style]
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      110fdb44
    • Pankaj Raghav's avatar
      block: adapt blk_mq_plug() to not plug for writes that require a zone lock · 8cafdb5a
      Pankaj Raghav authored
      The current implementation of blk_mq_plug() disables plugging for all
      operations that involves a transfer to the device as we just check if
      the last bit in op_is_write() function.
      
      Modify blk_mq_plug() to disable plugging only for REQ_OP_WRITE and
      REQ_OP_WRITE_ZEROS as they might require a zone lock.
      Suggested-by: default avatarChristoph Hellwig <hch@lst.de>
      Suggested-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarPankaj Raghav <p.raghav@samsung.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20220929074745.103073-2-p.raghav@samsung.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8cafdb5a
    • Stefan Metzmacher's avatar
      io_uring/net: fix fast_iov assignment in io_setup_async_msg() · 3e4cb6eb
      Stefan Metzmacher authored
      I hit a very bad problem during my tests of SENDMSG_ZC.
      BUG(); in first_iovec_segment() triggered very easily.
      The problem was io_setup_async_msg() in the partial retry case,
      which seems to happen more often with _ZC.
      
      iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset
      being only relative to the first element.
      
      Which means kmsg->msg.msg_iter.iov is no longer the
      same as kmsg->fast_iov.
      
      But this would rewind the copy to be the start of
      async_msg->fast_iov, which means the internal
      state of sync_msg->msg.msg_iter is inconsitent.
      
      I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608
      and got a short writes with:
      - ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244
      - ret=-EAGAIN => io_uring_poll_arm
      - ret=4911225 min_ret=5713448 => remaining 802223  sr->done_io=7586469
      - ret=-EAGAIN => io_uring_poll_arm
      - ret=802223  min_ret=802223  => res=8388692
      
      While this was easily triggered with SENDMSG_ZC (queued for 6.1),
      it was a potential problem starting with 7ba89d2a
      in 5.18 for IORING_OP_RECVMSG.
      And also with 4c3c0943 in 5.19
      for IORING_OP_SENDMSG.
      
      However 257e84a5 introduced the critical
      code into io_setup_async_msg() in 5.11.
      
      Fixes: 7ba89d2a ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly")
      Fixes: 257e84a5 ("io_uring: refactor sendmsg/recvmsg iov managing")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarStefan Metzmacher <metze@samba.org>
      Reviewed-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.orgSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3e4cb6eb
    • Pavel Begunkov's avatar
      io_uring/net: fix non-zc send with address · 04360d3e
      Pavel Begunkov authored
      We're currently ignoring the dest address with non-zerocopy send because
      even though we copy it from the userspace shortly after ->msg_name gets
      zeroed. Move msghdr init earlier.
      
      Fixes: 516e82f0 ("io_uring/net: support non-zerocopy sendto")
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/176ced5e8568aa5d300ca899b7f05b303ebc49fd.1664409532.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      04360d3e
  3. 28 Sep, 2022 3 commits
    • Jens Axboe's avatar
      Merge tag 'nvme-6.1-2022-09-28' of git://git.infradead.org/nvme into for-6.1/block · dfdcbf1f
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "nvme updates for Linux 6.1
      
       - handle effects after freeing the request (Keith Busch)
       - copy firmware_rev on each init (Keith Busch)
       - restrict management ioctls to admin (Keith Busch)
       - ensure subsystem reset is single threaded (Keith Busch)
       - report the actual number of tagset maps in nvme-pci (Keith Busch)
       - small fabrics authentication fixups (Christoph Hellwig)
       - add common code for tagset allocation and freeing (Christoph Hellwig)
       - stop using the request_queue in nvmet (Christoph Hellwig)
       - set min_align_mask before calculating max_hw_sectors
         (Rishabh Bhatnagar)
       - send a rediscover uevent when a persistent discovery controller
         reconnects (Sagi Grimberg)
       - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi)"
      
      * tag 'nvme-6.1-2022-09-28' of git://git.infradead.org/nvme: (31 commits)
        nvmet: don't look at the request_queue in nvmet_bdev_set_limits
        nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all
        nvme: remove nvme_ctrl_init_connect_q
        nvme-loop: use the tagset alloc/free helpers
        nvme-loop: store the generic nvme_ctrl in set->driver_data
        nvme-loop: initialize sqsize later
        nvme-fc: use the tagset alloc/free helpers
        nvme-fc: store the generic nvme_ctrl in set->driver_data
        nvme-fc: keep ctrl->sqsize in sync with opts->queue_size
        nvme-rdma: use the tagset alloc/free helpers
        nvme-rdma: store the generic nvme_ctrl in set->driver_data
        nvme-tcp: use the tagset alloc/free helpers
        nvme-tcp: store the generic nvme_ctrl in set->driver_data
        nvme-tcp: remove the unused queue_size member in nvme_tcp_queue
        nvme: add common helpers to allocate and free tagsets
        nvme-auth: add a MAINTAINERS entry
        nvmet: add helpers to set the result field for connect commands
        nvme: improve the NVME_CONNECT_AUTHREQ* definitions
        nvmet-auth: don't try to cancel a non-initialized work_struct
        nvmet-tcp: remove nvmet_tcp_finish_cmd
        ...
      dfdcbf1f
    • Christoph Hellwig's avatar
      s390/dasd: use blk_mq_alloc_disk · c68f4f4e
      Christoph Hellwig authored
      As far as I can tell there is no need for the staged setup in
      dasd, so allocate the tagset and the disk with the queue in
      dasd_gendisk_alloc.
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Signed-off-by: default avatarStefan Haberland <sth@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220928143945.1687114-2-sth@linux.ibm.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c68f4f4e
    • Pavel Begunkov's avatar
      io_uring/net: don't skip notifs for failed requests · 6ae91ac9
      Pavel Begunkov authored
      We currently only add a notification CQE when the send succeded, i.e.
      cqe.res >= 0. However, it'd be more robust to do buffer notifications
      for failed requests as well in case drivers decide do something fanky.
      
      Always return a buffer notification after initial prep, don't hide it.
      This behaviour is better aligned with documentation and the patch also
      helps the userspace to respect it.
      
      Cc: stable@vger.kernel.org # 6.0
      Suggested-by: default avatarStefan Metzmacher <metze@samba.org>
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/9c8bead87b2b980fcec441b8faef52188b4a6588.1664292100.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6ae91ac9
  4. 27 Sep, 2022 25 commits