1. 09 Nov, 2022 3 commits
  2. 08 Nov, 2022 2 commits
    • Serge Semin's avatar
      block: sed-opal: kmalloc the cmd/resp buffers · f829230d
      Serge Semin authored
      In accordance with [1] the DMA-able memory buffers must be
      cacheline-aligned otherwise the cache writing-back and invalidation
      performed during the mapping may cause the adjacent data being lost. It's
      specifically required for the DMA-noncoherent platforms [2]. Seeing the
      opal_dev.{cmd,resp} buffers are implicitly used for DMAs in the NVME and
      SCSI/SD drivers in framework of the nvme_sec_submit() and sd_sec_submit()
      methods respectively they must be cacheline-aligned to prevent the denoted
      problem. One of the option to guarantee that is to kmalloc the buffers
      [2]. Let's explicitly allocate them then instead of embedding into the
      opal_dev structure instance.
      
      Note this fix was inspired by the commit c94b7f9b ("nvme-hwmon:
      kmalloc the NVME SMART log buffer").
      
      [1] Documentation/core-api/dma-api.rst
      [2] Documentation/core-api/dma-api-howto.rst
      
      Fixes: 455a7b23 ("block: Add Sed-opal library")
      Signed-off-by: default avatarSerge Semin <Sergey.Semin@baikalelectronics.ru>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20221107203944.31686-1-Sergey.Semin@baikalelectronics.ruSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f829230d
    • Yu Kuai's avatar
      block, bfq: fix null pointer dereference in bfq_bio_bfqg() · f02be900
      Yu Kuai authored
      Out test found a following problem in kernel 5.10, and the same problem
      should exist in mainline:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000094
      PGD 0 P4D 0
      Oops: 0000 [#1] SMP
      CPU: 7 PID: 155 Comm: kworker/7:1 Not tainted 5.10.0-01932-g19e0ace2ca1d-dirty 4
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-b4
      Workqueue: kthrotld blk_throtl_dispatch_work_fn
      RIP: 0010:bfq_bio_bfqg+0x52/0xc0
      Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b
      RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002
      RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000
      RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18
      RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10
      R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000
      R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       bfq_bic_update_cgroup+0x3c/0x350
       ? ioc_create_icq+0x42/0x270
       bfq_init_rq+0xfd/0x1060
       bfq_insert_requests+0x20f/0x1cc0
       ? ioc_create_icq+0x122/0x270
       blk_mq_sched_insert_requests+0x86/0x1d0
       blk_mq_flush_plug_list+0x193/0x2a0
       blk_flush_plug_list+0x127/0x170
       blk_finish_plug+0x31/0x50
       blk_throtl_dispatch_work_fn+0x151/0x190
       process_one_work+0x27c/0x5f0
       worker_thread+0x28b/0x6b0
       ? rescuer_thread+0x590/0x590
       kthread+0x153/0x1b0
       ? kthread_flush_work+0x170/0x170
       ret_from_fork+0x1f/0x30
      Modules linked in:
      CR2: 0000000000000094
      ---[ end trace e2e59ac014314547 ]---
      RIP: 0010:bfq_bio_bfqg+0x52/0xc0
      Code: 94 00 00 00 00 75 2e 48 8b 40 30 48 83 05 35 06 c8 0b 01 48 85 c0 74 3d 4b
      RSP: 0018:ffffc90001a1fba0 EFLAGS: 00010002
      RAX: ffff888100d60400 RBX: ffff8881132e7000 RCX: 0000000000000000
      RDX: 0000000000000017 RSI: ffff888103580a18 RDI: ffff888103580a18
      RBP: ffff8881132e7000 R08: 0000000000000000 R09: ffffc90001a1fe10
      R10: 0000000000000a20 R11: 0000000000034320 R12: 0000000000000000
      R13: ffff888103580a18 R14: ffff888114447000 R15: 0000000000000000
      FS:  0000000000000000(0000) GS:ffff88881fdc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000094 CR3: 0000000100cdb000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Root cause is quite complex:
      
      1) use bfq elevator for the test device.
      2) create a cgroup CG
      3) config blk throtl in CG
      
         blkg_conf_prep
          blkg_create
      
      4) create a thread T1 and issue async io in CG:
      
         bio_init
          bio_associate_blkg
         ...
         submit_bio
          submit_bio_noacct
           blk_throtl_bio -> io is throttled
           // io submit is done
      
      5) switch elevator:
      
         bfq_exit_queue
          blkcg_deactivate_policy
           list_for_each_entry(blkg, &q->blkg_list, q_node)
            blkg->pd[] = NULL
            // bfq policy is removed
      
      5) thread t1 exist, then remove the cgroup CG:
      
         blkcg_unpin_online
          blkcg_destroy_blkgs
           blkg_destroy
            list_del_init(&blkg->q_node)
            // blkg is removed from queue list
      
      6) switch elevator back to bfq
      
       bfq_init_queue
        bfq_create_group_hierarchy
         blkcg_activate_policy
          list_for_each_entry_reverse(blkg, &q->blkg_list)
           // blkg is removed from list, hence bfq policy is still NULL
      
      7) throttled io is dispatched to bfq:
      
       bfq_insert_requests
        bfq_init_rq
         bfq_bic_update_cgroup
          bfq_bio_bfqg
           bfqg = blkg_to_bfqg(blkg)
           // bfqg is NULL because bfq policy is NULL
      
      The problem is only possible in bfq because only bfq can be deactivated and
      activated while queue is online, while others can only be deactivated while
      the device is removed.
      
      Fix the problem in bfq by checking if blkg is online before calling
      blkg_to_bfqg().
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20221108103434.2853269-1-yukuai1@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f02be900
  3. 01 Nov, 2022 1 commit
  4. 31 Oct, 2022 6 commits
  5. 28 Oct, 2022 1 commit
  6. 27 Oct, 2022 3 commits
    • Jens Axboe's avatar
      Merge tag 'nvme-6.1-2022-10-27' of git://git.infradead.org/nvme into block-6.1 · dea31328
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 6.1
      
       - make the multipath dma alignment to match the non-multipath one
         (Keith Busch)
       - fix a bogus use of sg_init_marker() (Nam Cao)
       - fix circulr locking in nvme-tcp (Sagi Grimberg)"
      
      * tag 'nvme-6.1-2022-10-27' of git://git.infradead.org/nvme:
        nvme-multipath: set queue dma alignment to 3
        nvme-tcp: fix possible circular locking when deleting a controller under memory pressure
        nvme-tcp: replace sg_init_marker() with sg_init_table()
      dea31328
    • Ming Lei's avatar
      blk-mq: don't add non-pt request with ->end_io to batch · 2d87d455
      Ming Lei authored
      dm-rq implements ->end_io callback for request issued to underlying queue,
      and it isn't passthrough request.
      
      Commit ab3e1d3b ("block: allow end_io based requests in the completion
      batch handling") doesn't clear rq->bio and rq->__data_len for request
      with ->end_io in blk_mq_end_request_batch(), and this way is actually
      dangerous, but so far it is only for nvme passthrough request.
      
      dm-rq needs to clean up remained bios in case of partial completion,
      and req->bio is required, then use-after-free is triggered, so the
      underlying clone request can't be completed in blk_mq_end_request_batch.
      
      Fix panic by not adding such request into batch list, and the issue
      can be triggered simply by exposing nvme pci to dm-mpath simply.
      
      Fixes: ab3e1d3b ("block: allow end_io based requests in the completion batch handling")
      Cc: dm-devel@redhat.com
      Cc: Mike Snitzer <snitzer@kernel.org>
      Reported-by: default avatarChanghui Zhong <czhong@redhat.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20221027085709.513175-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2d87d455
    • Yang Yingliang's avatar
      rbd: fix possible memory leak in rbd_sysfs_init() · 7f21735f
      Yang Yingliang authored
      If device_register() returns error in rbd_sysfs_init(), name of kobject
      which is allocated in dev_set_name() called in device_add() is leaked.
      
      As comment of device_add() says, it should call put_device() to drop
      the reference count that was set in device_initialize() when it fails,
      so the name can be freed in kobject_cleanup().
      
      Fault injection test can trigger this problem:
      
      unreferenced object 0xffff88810173aa78 (size 8):
        comm "modprobe", pid 247, jiffies 4294714278 (age 31.789s)
        hex dump (first 8 bytes):
          72 62 64 00 81 88 ff ff                          rbd.....
        backtrace:
          [<00000000f58fae56>] __kmalloc_node_track_caller+0x44/0x1b0
          [<00000000bdd44fe7>] kstrdup+0x3a/0x70
          [<00000000f7844d0b>] kstrdup_const+0x63/0x80
          [<000000001b0a0eeb>] kvasprintf_const+0x10b/0x190
          [<00000000a47bd894>] kobject_set_name_vargs+0x56/0x150
          [<00000000d5edbf18>] dev_set_name+0xab/0xe0
          [<00000000f5153e80>] device_add+0x106/0x1f20
      
      Fixes: dfc5606d ("rbd: replace the rbd sysfs interface")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: default avatarAlex Elder <elder@linaro.org>
      Link: https://lore.kernel.org/r/20221027091918.2294132-1-yangyingliang@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7f21735f
  7. 25 Oct, 2022 3 commits
    • Keith Busch's avatar
      nvme-multipath: set queue dma alignment to 3 · fe8714b0
      Keith Busch authored
      NVMe spec requires all transports support dword aligned addresses, which
      is already set in the namespace request_queue. Set the same limit in the
      multipath device's request_queue as well.
      Signed-off-by: default avatarKeith Busch <kbusch@kernel.org>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      fe8714b0
    • Sagi Grimberg's avatar
      nvme-tcp: fix possible circular locking when deleting a controller under memory pressure · 83e1226b
      Sagi Grimberg authored
      When destroying a queue, when calling sock_release, the network stack
      might need to allocate an skb to send a FIN/RST. When that happens
      during memory pressure, there is a need to reclaim memory, which
      in turn may ask the nvme-tcp device to write out dirty pages, however
      this is not possible due to a ctrl teardown that is going on.
      
      Set PF_MEMALLOC to the task that releases the socket to grant access
      to PF_MEMALLOC reserves. In addition, do the same for the nvme-tcp
      thread as this may also originate from the swap itself and should
      be more resilient to memory pressure situations.
      
      This fixes the following lockdep complaint:
      --
      ======================================================
       WARNING: possible circular locking dependency detected
       6.0.0-rc2+ #25 Tainted: G        W
       ------------------------------------------------------
       kswapd0/92 is trying to acquire lock:
       ffff888114003240 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_sendpage+0x23/0xa0
      
       but task is already holding lock:
       ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (fs_reclaim){+.+.}-{0:0}:
              fs_reclaim_acquire+0x11e/0x160
              kmem_cache_alloc_node+0x44/0x530
              __alloc_skb+0x158/0x230
              tcp_send_active_reset+0x7e/0x730
              tcp_disconnect+0x1272/0x1ae0
              __tcp_close+0x707/0xd90
              tcp_close+0x26/0x80
              inet_release+0xfa/0x220
              sock_release+0x85/0x1a0
              nvme_tcp_free_queue+0x1fd/0x470 [nvme_tcp]
              nvme_do_delete_ctrl+0x130/0x13d [nvme_core]
              nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
              kernfs_fop_write_iter+0x356/0x530
              vfs_write+0x4e8/0xce0
              ksys_write+0xfd/0x1d0
              do_syscall_64+0x58/0x80
              entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
       -> #0 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
              __lock_acquire+0x2a0c/0x5690
              lock_acquire+0x18e/0x4f0
              lock_sock_nested+0x37/0xc0
              tcp_sendpage+0x23/0xa0
              inet_sendpage+0xad/0x120
              kernel_sendpage+0x156/0x440
              nvme_tcp_try_send+0x48a/0x2630 [nvme_tcp]
              nvme_tcp_queue_rq+0xefb/0x17e0 [nvme_tcp]
              __blk_mq_try_issue_directly+0x452/0x660
              blk_mq_plug_issue_direct.constprop.0+0x207/0x700
              blk_mq_flush_plug_list+0x6f5/0xc70
              __blk_flush_plug+0x264/0x410
              blk_finish_plug+0x4b/0xa0
              shrink_lruvec+0x1263/0x1ea0
              shrink_node+0x736/0x1a80
              balance_pgdat+0x740/0x10d0
              kswapd+0x5f2/0xaf0
              kthread+0x256/0x2f0
              ret_from_fork+0x1f/0x30
      
      other info that might help us debug this:
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(fs_reclaim);
                                     lock(sk_lock-AF_INET-NVME);
                                     lock(fs_reclaim);
        lock(sk_lock-AF_INET-NVME);
      
       *** DEADLOCK ***
      
      3 locks held by kswapd0/92:
       #0: ffffffff97e95ca0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0x987/0x10d0
       #1: ffff88811f21b0b0 (q->srcu){....}-{0:0}, at: blk_mq_flush_plug_list+0x6b3/0xc70
       #2: ffff888170b11470 (&queue->send_mutex){+.+.}-{3:3}, at: nvme_tcp_queue_rq+0xeb9/0x17e0 [nvme_tcp]
      
      Fixes: 3f2304f8 ("nvme-tcp: add NVMe over TCP host driver")
      Reported-by: default avatarDaniel Wagner <dwagner@suse.de>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Tested-by: default avatarDaniel Wagner <dwagner@suse.de>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      83e1226b
    • Nam Cao's avatar
      nvme-tcp: replace sg_init_marker() with sg_init_table() · 5fa9add6
      Nam Cao authored
      In nvme_tcp_ddgst_update(), sg_init_marker() is called with an
      uninitialized scatterlist. This is probably fine, but gcc complains:
      
        CC [M]  drivers/nvme/host/tcp.o
      In file included from ./include/linux/dma-mapping.h:10,
                       from ./include/linux/skbuff.h:31,
                       from ./include/net/net_namespace.h:43,
                       from ./include/linux/netdevice.h:38,
                       from ./include/net/sock.h:46,
                       from drivers/nvme/host/tcp.c:12:
      In function ‘sg_mark_end’,
          inlined from ‘sg_init_marker’ at ./include/linux/scatterlist.h:356:2,
          inlined from ‘nvme_tcp_ddgst_update’ at drivers/nvme/host/tcp.c:390:2:
      ./include/linux/scatterlist.h:234:11: error: ‘sg.page_link’ is used uninitialized [-Werror=uninitialized]
        234 |         sg->page_link |= SG_END;
            |         ~~^~~~~~~~~~~
      drivers/nvme/host/tcp.c: In function ‘nvme_tcp_ddgst_update’:
      drivers/nvme/host/tcp.c:388:28: note: ‘sg’ declared here
        388 |         struct scatterlist sg;
            |                            ^~
      cc1: all warnings being treated as errors
      
      Use sg_init_table() instead, which basically memset the scatterlist to
      zero first before calling sg_init_marker().
      Signed-off-by: default avatarNam Cao <namcaov@gmail.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      5fa9add6
  8. 22 Oct, 2022 1 commit
  9. 20 Oct, 2022 8 commits
  10. 19 Oct, 2022 8 commits
  11. 18 Oct, 2022 1 commit
  12. 16 Oct, 2022 1 commit
  13. 12 Oct, 2022 2 commits
    • Jens Axboe's avatar
      Merge tag 'nvme-6.1-2022-10-12' of git://git.infradead.org/nvme into block-6.1 · 3bc429c1
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 6.1
      
       - add NVME_QUIRK_BOGUS_NID for Lexar NM760 (Abhijit)
       - avoid the deepest sleep state on ZHITAI TiPro5000 SSDs (Xi Ruoyao)
       - fix possible hang caused during ctrl deletion (Sagi Grimberg)
       - fix possible hang in live ns resize with ANA access (Sagi Grimberg)"
      
      * tag 'nvme-6.1-2022-10-12' of git://git.infradead.org/nvme:
        nvme-multipath: fix possible hang in live ns resize with ANA access
        nvme-pci: avoid the deepest sleep state on ZHITAI TiPro5000 SSDs
        nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM760
        nvme-tcp: fix possible hang caused during ctrl deletion
        nvme-rdma: fix possible hang caused during ctrl deletion
      3bc429c1
    • Sagi Grimberg's avatar
      nvme-multipath: fix possible hang in live ns resize with ANA access · 72e3b888
      Sagi Grimberg authored
      When we revalidate paths as part of ns size change (as of commit
      e7d65803), it is possible that during the path revalidation, the
      only paths that is IO capable (i.e. optimized/non-optimized) are the
      ones that ns resize was not yet informed to the host, which will cause
      inflight requests to be requeued (as we have available paths but none
      are IO capable). These requests on the requeue list are waiting for
      someone to resubmit them at some point.
      
      The IO capable paths will eventually notify the ns resize change to the
      host, but there is nothing that will kick the requeue list to resubmit
      the queued requests.
      
      Fix this by always kicking the requeue list, and if no IO capable path
      exists, these requests will be queued again.
      
      A typical log that indicates that IOs are requeued:
      --
      nvme nvme1: creating 4 I/O queues.
      nvme nvme1: new ctrl: "testnqn1"
      nvme nvme2: creating 4 I/O queues.
      nvme nvme2: mapped 4/0/0 default/read/poll queues.
      nvme nvme2: new ctrl: NQN "testnqn1", addr 127.0.0.1:8009
      nvme nvme1: rescanning namespaces.
      nvme1n1: detected capacity change from 2097152 to 4194304
      block nvme1n1: no usable path - requeuing I/O
      block nvme1n1: no usable path - requeuing I/O
      block nvme1n1: no usable path - requeuing I/O
      block nvme1n1: no usable path - requeuing I/O
      block nvme1n1: no usable path - requeuing I/O
      block nvme1n1: no usable path - requeuing I/O
      block nvme1n1: no usable path - requeuing I/O
      block nvme1n1: no usable path - requeuing I/O
      block nvme1n1: no usable path - requeuing I/O
      block nvme1n1: no usable path - requeuing I/O
      nvme nvme2: rescanning namespaces.
      --
      Reported-by: default avatarYogev Cohen <yogev@lightbitslabs.com>
      Fixes: e7d65803 ("nvme-multipath: revalidate paths during rescan")
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Cc: <stable@vger.kernel.org> # v5.15+
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      72e3b888