1. 07 Apr, 2023 1 commit
    • Yu Kuai's avatar
      block: don't set GD_NEED_PART_SCAN if scan partition failed · 3723091e
      Yu Kuai authored
      Currently if disk_scan_partitions() failed, GD_NEED_PART_SCAN will still
      set, and partition scan will be proceed again when blkdev_get_by_dev()
      is called. However, this will cause a problem that re-assemble partitioned
      raid device will creat partition for underlying disk.
      
      Test procedure:
      
      mdadm -CR /dev/md0 -l 1 -n 2 /dev/sda /dev/sdb -e 1.0
      sgdisk -n 0:0:+100MiB /dev/md0
      blockdev --rereadpt /dev/sda
      blockdev --rereadpt /dev/sdb
      mdadm -S /dev/md0
      mdadm -A /dev/md0 /dev/sda /dev/sdb
      
      Test result: underlying disk partition and raid partition can be
      observed at the same time
      
      Note that this can still happen in come corner cases that
      GD_NEED_PART_SCAN can be set for underlying disk while re-assemble raid
      device.
      
      Fixes: e5cfefa9 ("block: fix scan partition for exclusively open device again")
      Reviewed-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3723091e
  2. 06 Apr, 2023 3 commits
  3. 05 Apr, 2023 1 commit
  4. 04 Apr, 2023 1 commit
  5. 31 Mar, 2023 1 commit
  6. 30 Mar, 2023 2 commits
    • Jens Axboe's avatar
      Merge tag 'nvme-6.3-2023-03-31' of git://git.infradead.org/nvme into block-6.3 · 1a06ed2d
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 6.3
      
       - mark Lexar NM760 as IGNORE_DEV_SUBNQN (Juraj Pecigos)
       - fix a possible UAF when failing to allocate an TCP io queue
         (Sagi Grimberg)"
      
      * tag 'nvme-6.3-2023-03-31' of git://git.infradead.org/nvme:
        nvme-tcp: fix a possible UAF when failing to allocate an io queue
        nvme-pci: mark Lexar NM760 as IGNORE_DEV_SUBNQN
      1a06ed2d
    • Sagi Grimberg's avatar
      nvme-tcp: fix a possible UAF when failing to allocate an io queue · 88eaba80
      Sagi Grimberg authored
      When we allocate a nvme-tcp queue, we set the data_ready callback before
      we actually need to use it. This creates the potential that if a stray
      controller sends us data on the socket before we connect, we can trigger
      the io_work and start consuming the socket.
      
      In this case reported: we failed to allocate one of the io queues, and
      as we start releasing the queues that we already allocated, we get
      a UAF [1] from the io_work which is running before it should really.
      
      Fix this by setting the socket ops callbacks only before we start the
      queue, so that we can't accidentally schedule the io_work in the
      initialization phase before the queue started. While we are at it,
      rename nvme_tcp_restore_sock_calls to pair with nvme_tcp_setup_sock_ops.
      
      [1]:
      [16802.107284] nvme nvme4: starting error recovery
      [16802.109166] nvme nvme4: Reconnecting in 10 seconds...
      [16812.173535] nvme nvme4: failed to connect socket: -111
      [16812.173745] nvme nvme4: Failed reconnect attempt 1
      [16812.173747] nvme nvme4: Reconnecting in 10 seconds...
      [16822.413555] nvme nvme4: failed to connect socket: -111
      [16822.413762] nvme nvme4: Failed reconnect attempt 2
      [16822.413765] nvme nvme4: Reconnecting in 10 seconds...
      [16832.661274] nvme nvme4: creating 32 I/O queues.
      [16833.919887] BUG: kernel NULL pointer dereference, address: 0000000000000088
      [16833.920068] nvme nvme4: Failed reconnect attempt 3
      [16833.920094] #PF: supervisor write access in kernel mode
      [16833.920261] nvme nvme4: Reconnecting in 10 seconds...
      [16833.920368] #PF: error_code(0x0002) - not-present page
      [16833.921086] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
      [16833.921191] RIP: 0010:_raw_spin_lock_bh+0x17/0x30
      ...
      [16833.923138] Call Trace:
      [16833.923271]  <TASK>
      [16833.923402]  lock_sock_nested+0x1e/0x50
      [16833.923545]  nvme_tcp_try_recv+0x40/0xa0 [nvme_tcp]
      [16833.923685]  nvme_tcp_io_work+0x68/0xa0 [nvme_tcp]
      [16833.923824]  process_one_work+0x1e8/0x390
      [16833.923969]  worker_thread+0x53/0x3d0
      [16833.924104]  ? process_one_work+0x390/0x390
      [16833.924240]  kthread+0x124/0x150
      [16833.924376]  ? set_kthread_struct+0x50/0x50
      [16833.924518]  ret_from_fork+0x1f/0x30
      [16833.924655]  </TASK>
      Reported-by: default avatarYanjun Zhang <zhangyanjun@cestc.cn>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Tested-by: default avatarYanjun Zhang <zhangyanjun@cestc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      88eaba80
  7. 29 Mar, 2023 1 commit
    • Yu Kuai's avatar
      md: fix regression for null-ptr-deference in __md_stop() · 433279be
      Yu Kuai authored
      Commit 3e453522 ("md: Free resources in __md_stop") tried to fix
      null-ptr-deference for 'active_io' by moving percpu_ref_exit() to
      __md_stop(), however, the commit also moving 'writes_pending' to
      __md_stop(), and this will cause mdadm tests broken:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000038
      Oops: 0000 [#1] PREEMPT SMP
      CPU: 15 PID: 17830 Comm: mdadm Not tainted 6.3.0-rc3-next-20230324-00009-g520d37
      RIP: 0010:free_percpu+0x465/0x670
      Call Trace:
       <TASK>
       __percpu_ref_exit+0x48/0x70
       percpu_ref_exit+0x1a/0x90
       __md_stop+0xe9/0x170
       do_md_stop+0x1e1/0x7b0
       md_ioctl+0x90c/0x1aa0
       blkdev_ioctl+0x19b/0x400
       vfs_ioctl+0x20/0x50
       __x64_sys_ioctl+0xba/0xe0
       do_syscall_64+0x6c/0xe0
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      And the problem can be reporduced 100% by following test:
      
      mdadm -CR /dev/md0 -l1 -n1 /dev/sda --force
      echo inactive > /sys/block/md0/md/array_state
      echo read-auto  > /sys/block/md0/md/array_state
      echo inactive > /sys/block/md0/md/array_state
      
      Root cause:
      
      // start raid
      raid1_run
       mddev_init_writes_pending
        percpu_ref_init
      
      // inactive raid
      array_state_store
       do_md_stop
        __md_stop
         percpu_ref_exit
      
      // start raid again
      array_state_store
       do_md_run
        raid1_run
         mddev_init_writes_pending
          if (mddev->writes_pending.percpu_count_ptr)
          // won't reinit
      
      // inactive raid again
      ...
      percpu_ref_exit
      -> null-ptr-deference
      
      Before the commit, 'writes_pending' is exited when mddev is freed, and
      it's safe to restart raid because mddev_init_writes_pending() already make
      sure that 'writes_pending' will only be initialized once.
      
      Fix the prblem by moving 'writes_pending' back, it's a litter hard to find
      the relationship between alloc memory and free memory, however, code
      changes is much less and we lived with this for a long time already.
      
      Fixes: 3e453522 ("md: Free resources in __md_stop")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230328094400.1448955-1-yukuai1@huaweicloud.com
      433279be
  8. 28 Mar, 2023 1 commit
    • Juraj Pecigos's avatar
      nvme-pci: mark Lexar NM760 as IGNORE_DEV_SUBNQN · 1231363a
      Juraj Pecigos authored
      A system with more than one of these SSDs will only have one usable.
      The kernel fails to detect more than one nvme device due to duplicate
      cntlids.
      
      before:
      [    9.395229] nvme 0000:01:00.0: platform quirk: setting simple suspend
      [    9.395262] nvme nvme0: pci function 0000:01:00.0
      [    9.395282] nvme 0000:03:00.0: platform quirk: setting simple suspend
      [    9.395305] nvme nvme1: pci function 0000:03:00.0
      [    9.409873] nvme nvme0: Duplicate cntlid 1 with nvme1, subsys nqn.2022-07.com.siliconmotion:nvm-subsystem-sn-                    , rejecting
      [    9.409982] nvme nvme0: Removing after probe failure status: -22
      [    9.427487] nvme nvme1: allocated 64 MiB host memory buffer.
      [    9.445088] nvme nvme1: 16/0/0 default/read/poll queues
      [    9.449898] nvme nvme1: Ignoring bogus Namespace Identifiers
      
      after:
      [    1.161890] nvme 0000:01:00.0: platform quirk: setting simple suspend
      [    1.162660] nvme nvme0: pci function 0000:01:00.0
      [    1.162684] nvme 0000:03:00.0: platform quirk: setting simple suspend
      [    1.162707] nvme nvme1: pci function 0000:03:00.0
      [    1.191354] nvme nvme0: allocated 64 MiB host memory buffer.
      [    1.193378] nvme nvme1: allocated 64 MiB host memory buffer.
      [    1.211044] nvme nvme1: 16/0/0 default/read/poll queues
      [    1.211080] nvme nvme0: 16/0/0 default/read/poll queues
      [    1.216145] nvme nvme0: Ignoring bogus Namespace Identifiers
      [    1.216261] nvme nvme1: Ignoring bogus Namespace Identifiers
      
      Adding the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk to resolves the issue.
      Signed-off-by: default avatarJuraj Pecigos <kernel@juraj.dev>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      1231363a
  9. 27 Mar, 2023 1 commit
    • Alyssa Ross's avatar
      loop: LOOP_CONFIGURE: send uevents for partitions · bb430b69
      Alyssa Ross authored
      LOOP_CONFIGURE is, as far as I understand it, supposed to be a way to
      combine LOOP_SET_FD and LOOP_SET_STATUS64 into a single syscall.  When
      using LOOP_SET_FD+LOOP_SET_STATUS64, a single uevent would be sent for
      each partition found on the loop device after the second ioctl(), but
      when using LOOP_CONFIGURE, no such uevent was being sent.
      
      In the old setup, uevents are disabled for LOOP_SET_FD, but not for
      LOOP_SET_STATUS64.  This makes sense, as it prevents uevents being
      sent for a partially configured device during LOOP_SET_FD - they're
      only sent at the end of LOOP_SET_STATUS64.  But for LOOP_CONFIGURE,
      uevents were disabled for the entire operation, so that final
      notification was never issued.  To fix this, reduce the critical
      section to exclude the loop_reread_partitions() call, which causes
      the uevents to be issued, to after uevents are re-enabled, matching
      the behaviour of the LOOP_SET_FD+LOOP_SET_STATUS64 combination.
      
      I noticed this because Busybox's losetup program recently changed from
      using LOOP_SET_FD+LOOP_SET_STATUS64 to LOOP_CONFIGURE, and this broke
      my setup, for which I want a notification from the kernel any time a
      new partition becomes available.
      Signed-off-by: default avatarAlyssa Ross <hi@alyssa.is>
      [hch: reduced the critical section]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Fixes: 3448914e ("loop: Add LOOP_CONFIGURE ioctl")
      Link: https://lore.kernel.org/r/20230320125430.55367-1-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bb430b69
  10. 23 Mar, 2023 1 commit
  11. 22 Mar, 2023 2 commits
  12. 21 Mar, 2023 1 commit
  13. 18 Mar, 2023 1 commit
  14. 16 Mar, 2023 2 commits
    • Lukas Bulwahn's avatar
      block: remove obsolete config BLOCK_COMPAT · 8f0d196e
      Lukas Bulwahn authored
      Before commit bdc1ddad ("compat_ioctl: block: move
      blkdev_compat_ioctl() into ioctl.c"), the config BLOCK_COMPAT was used to
      include compat_ioctl.c into the kernel build. With this commit, the code
      is moved into ioctl.c and included with the config COMPAT. So, since then,
      the config BLOCK_COMPAT has no effect and any further purpose.
      
      Remove this obsolete config BLOCK_COMPAT.
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20230316111630.4897-1-lukas.bulwahn@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8f0d196e
    • Jens Axboe's avatar
      Merge tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme into block-6.3 · 890a2fb0
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 6.3
      
       - avoid potential UAF in nvmet_req_complete (Damien Le Moal)
       - more quirks (Elmer Miroslav Mosher Golovin, Philipp Geulen)
       - fix a memory leak in the nvme-pci probe teardown path (Irvin Cote)
       - repair the MAINTAINERS entry (Lukas Bulwahn)
       - fix handling single range discard request (Ming Lei)
       - show more opcode names in trace events (Minwoo Im)
       - fix nvme-tcp timeout reporting (Sagi Grimberg)"
      
      * tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme:
        nvmet: avoid potential UAF in nvmet_req_complete()
        nvme-trace: show more opcode names
        nvme-tcp: add nvme-tcp pdu size build protection
        nvme-tcp: fix opcode reporting in the timeout handler
        nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM620
        nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV3000
        nvme-pci: fixing memory leak in probe teardown path
        nvme: fix handling single range discard request
        MAINTAINERS: repair malformed T: entries in NVM EXPRESS DRIVERS
      890a2fb0
  15. 15 Mar, 2023 17 commits
  16. 14 Mar, 2023 1 commit
    • Jan Kara's avatar
      block: do not reverse request order when flushing plug list · 34e0a279
      Jan Kara authored
      Commit 26fed4ac ("block: flush plug based on hardware and software
      queue order") changed flushing of plug list to submit requests one
      device at a time. However while doing that it also started using
      list_add_tail() instead of list_add() used previously thus effectively
      submitting requests in reverse order. Also when forming a rq_list with
      remaining requests (in case two or more devices are used), we
      effectively reverse the ordering of the plug list for each device we
      process. Submitting requests in reverse order has negative impact on
      performance for rotational disks (when BFQ is not in use). We observe
      10-25% regression in random 4k write throughput, as well as ~20%
      regression in MariaDB OLTP benchmark on rotational storage on btrfs
      filesystem.
      
      Fix the problem by preserving ordering of the plug list when inserting
      requests into the queuelist as well as by appending to requeue_list
      instead of prepending to it.
      
      Fixes: 26fed4ac ("block: flush plug based on hardware and software queue order")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.czSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      34e0a279
  17. 13 Mar, 2023 2 commits
    • NeilBrown's avatar
      md: avoid signed overflow in slot_store() · 3bc57292
      NeilBrown authored
      slot_store() uses kstrtouint() to get a slot number, but stores the
      result in an "int" variable (by casting a pointer).
      This can result in a negative slot number if the unsigned int value is
      very large.
      
      A negative number means that the slot is empty, but setting a negative
      slot number this way will not remove the device from the array.  I don't
      think this is a serious problem, but it could cause confusion and it is
      best to fix it.
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3bc57292
    • Xiao Ni's avatar
      md: Free resources in __md_stop · 3e453522
      Xiao Ni authored
      If md_run() fails after ->active_io is initialized, then percpu_ref_exit
      is called in error path. However, later md_free_disk will call
      percpu_ref_exit again which leads to a panic because of null pointer
      dereference. It can also trigger this bug when resources are initialized
      but are freed in error path, then will be freed again in md_free_disk.
      
      BUG: kernel NULL pointer dereference, address: 0000000000000038
      Oops: 0000 [#1] PREEMPT SMP
      Workqueue: md_misc mddev_delayed_delete
      RIP: 0010:free_percpu+0x110/0x630
      Call Trace:
       <TASK>
       __percpu_ref_exit+0x44/0x70
       percpu_ref_exit+0x16/0x90
       md_free_disk+0x2f/0x80
       disk_release+0x101/0x180
       device_release+0x84/0x110
       kobject_put+0x12a/0x380
       kobject_put+0x160/0x380
       mddev_delayed_delete+0x19/0x30
       process_one_work+0x269/0x680
       worker_thread+0x266/0x640
       kthread+0x151/0x1b0
       ret_from_fork+0x1f/0x30
      
      For creating raid device, md raid calls do_md_run->md_run, dm raid calls
      md_run. We alloc those memory in md_run. For stopping raid device, md raid
      calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can
      free those memory resources in __md_stop.
      
      Fixes: 72adae23 ("md: Change active_io to percpu")
      Reported-and-tested-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3e453522
  18. 08 Mar, 2023 1 commit