1. 31 Mar, 2023 1 commit
  2. 30 Mar, 2023 2 commits
    • Jens Axboe's avatar
      Merge tag 'nvme-6.3-2023-03-31' of git://git.infradead.org/nvme into block-6.3 · 1a06ed2d
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 6.3
      
       - mark Lexar NM760 as IGNORE_DEV_SUBNQN (Juraj Pecigos)
       - fix a possible UAF when failing to allocate an TCP io queue
         (Sagi Grimberg)"
      
      * tag 'nvme-6.3-2023-03-31' of git://git.infradead.org/nvme:
        nvme-tcp: fix a possible UAF when failing to allocate an io queue
        nvme-pci: mark Lexar NM760 as IGNORE_DEV_SUBNQN
      1a06ed2d
    • Sagi Grimberg's avatar
      nvme-tcp: fix a possible UAF when failing to allocate an io queue · 88eaba80
      Sagi Grimberg authored
      When we allocate a nvme-tcp queue, we set the data_ready callback before
      we actually need to use it. This creates the potential that if a stray
      controller sends us data on the socket before we connect, we can trigger
      the io_work and start consuming the socket.
      
      In this case reported: we failed to allocate one of the io queues, and
      as we start releasing the queues that we already allocated, we get
      a UAF [1] from the io_work which is running before it should really.
      
      Fix this by setting the socket ops callbacks only before we start the
      queue, so that we can't accidentally schedule the io_work in the
      initialization phase before the queue started. While we are at it,
      rename nvme_tcp_restore_sock_calls to pair with nvme_tcp_setup_sock_ops.
      
      [1]:
      [16802.107284] nvme nvme4: starting error recovery
      [16802.109166] nvme nvme4: Reconnecting in 10 seconds...
      [16812.173535] nvme nvme4: failed to connect socket: -111
      [16812.173745] nvme nvme4: Failed reconnect attempt 1
      [16812.173747] nvme nvme4: Reconnecting in 10 seconds...
      [16822.413555] nvme nvme4: failed to connect socket: -111
      [16822.413762] nvme nvme4: Failed reconnect attempt 2
      [16822.413765] nvme nvme4: Reconnecting in 10 seconds...
      [16832.661274] nvme nvme4: creating 32 I/O queues.
      [16833.919887] BUG: kernel NULL pointer dereference, address: 0000000000000088
      [16833.920068] nvme nvme4: Failed reconnect attempt 3
      [16833.920094] #PF: supervisor write access in kernel mode
      [16833.920261] nvme nvme4: Reconnecting in 10 seconds...
      [16833.920368] #PF: error_code(0x0002) - not-present page
      [16833.921086] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
      [16833.921191] RIP: 0010:_raw_spin_lock_bh+0x17/0x30
      ...
      [16833.923138] Call Trace:
      [16833.923271]  <TASK>
      [16833.923402]  lock_sock_nested+0x1e/0x50
      [16833.923545]  nvme_tcp_try_recv+0x40/0xa0 [nvme_tcp]
      [16833.923685]  nvme_tcp_io_work+0x68/0xa0 [nvme_tcp]
      [16833.923824]  process_one_work+0x1e8/0x390
      [16833.923969]  worker_thread+0x53/0x3d0
      [16833.924104]  ? process_one_work+0x390/0x390
      [16833.924240]  kthread+0x124/0x150
      [16833.924376]  ? set_kthread_struct+0x50/0x50
      [16833.924518]  ret_from_fork+0x1f/0x30
      [16833.924655]  </TASK>
      Reported-by: default avatarYanjun Zhang <zhangyanjun@cestc.cn>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Tested-by: default avatarYanjun Zhang <zhangyanjun@cestc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      88eaba80
  3. 29 Mar, 2023 1 commit
    • Yu Kuai's avatar
      md: fix regression for null-ptr-deference in __md_stop() · 433279be
      Yu Kuai authored
      Commit 3e453522 ("md: Free resources in __md_stop") tried to fix
      null-ptr-deference for 'active_io' by moving percpu_ref_exit() to
      __md_stop(), however, the commit also moving 'writes_pending' to
      __md_stop(), and this will cause mdadm tests broken:
      
      BUG: kernel NULL pointer dereference, address: 0000000000000038
      Oops: 0000 [#1] PREEMPT SMP
      CPU: 15 PID: 17830 Comm: mdadm Not tainted 6.3.0-rc3-next-20230324-00009-g520d37
      RIP: 0010:free_percpu+0x465/0x670
      Call Trace:
       <TASK>
       __percpu_ref_exit+0x48/0x70
       percpu_ref_exit+0x1a/0x90
       __md_stop+0xe9/0x170
       do_md_stop+0x1e1/0x7b0
       md_ioctl+0x90c/0x1aa0
       blkdev_ioctl+0x19b/0x400
       vfs_ioctl+0x20/0x50
       __x64_sys_ioctl+0xba/0xe0
       do_syscall_64+0x6c/0xe0
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      And the problem can be reporduced 100% by following test:
      
      mdadm -CR /dev/md0 -l1 -n1 /dev/sda --force
      echo inactive > /sys/block/md0/md/array_state
      echo read-auto  > /sys/block/md0/md/array_state
      echo inactive > /sys/block/md0/md/array_state
      
      Root cause:
      
      // start raid
      raid1_run
       mddev_init_writes_pending
        percpu_ref_init
      
      // inactive raid
      array_state_store
       do_md_stop
        __md_stop
         percpu_ref_exit
      
      // start raid again
      array_state_store
       do_md_run
        raid1_run
         mddev_init_writes_pending
          if (mddev->writes_pending.percpu_count_ptr)
          // won't reinit
      
      // inactive raid again
      ...
      percpu_ref_exit
      -> null-ptr-deference
      
      Before the commit, 'writes_pending' is exited when mddev is freed, and
      it's safe to restart raid because mddev_init_writes_pending() already make
      sure that 'writes_pending' will only be initialized once.
      
      Fix the prblem by moving 'writes_pending' back, it's a litter hard to find
      the relationship between alloc memory and free memory, however, code
      changes is much less and we lived with this for a long time already.
      
      Fixes: 3e453522 ("md: Free resources in __md_stop")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Link: https://lore.kernel.org/r/20230328094400.1448955-1-yukuai1@huaweicloud.com
      433279be
  4. 28 Mar, 2023 1 commit
    • Juraj Pecigos's avatar
      nvme-pci: mark Lexar NM760 as IGNORE_DEV_SUBNQN · 1231363a
      Juraj Pecigos authored
      A system with more than one of these SSDs will only have one usable.
      The kernel fails to detect more than one nvme device due to duplicate
      cntlids.
      
      before:
      [    9.395229] nvme 0000:01:00.0: platform quirk: setting simple suspend
      [    9.395262] nvme nvme0: pci function 0000:01:00.0
      [    9.395282] nvme 0000:03:00.0: platform quirk: setting simple suspend
      [    9.395305] nvme nvme1: pci function 0000:03:00.0
      [    9.409873] nvme nvme0: Duplicate cntlid 1 with nvme1, subsys nqn.2022-07.com.siliconmotion:nvm-subsystem-sn-                    , rejecting
      [    9.409982] nvme nvme0: Removing after probe failure status: -22
      [    9.427487] nvme nvme1: allocated 64 MiB host memory buffer.
      [    9.445088] nvme nvme1: 16/0/0 default/read/poll queues
      [    9.449898] nvme nvme1: Ignoring bogus Namespace Identifiers
      
      after:
      [    1.161890] nvme 0000:01:00.0: platform quirk: setting simple suspend
      [    1.162660] nvme nvme0: pci function 0000:01:00.0
      [    1.162684] nvme 0000:03:00.0: platform quirk: setting simple suspend
      [    1.162707] nvme nvme1: pci function 0000:03:00.0
      [    1.191354] nvme nvme0: allocated 64 MiB host memory buffer.
      [    1.193378] nvme nvme1: allocated 64 MiB host memory buffer.
      [    1.211044] nvme nvme1: 16/0/0 default/read/poll queues
      [    1.211080] nvme nvme0: 16/0/0 default/read/poll queues
      [    1.216145] nvme nvme0: Ignoring bogus Namespace Identifiers
      [    1.216261] nvme nvme1: Ignoring bogus Namespace Identifiers
      
      Adding the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk to resolves the issue.
      Signed-off-by: default avatarJuraj Pecigos <kernel@juraj.dev>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      1231363a
  5. 27 Mar, 2023 1 commit
    • Alyssa Ross's avatar
      loop: LOOP_CONFIGURE: send uevents for partitions · bb430b69
      Alyssa Ross authored
      LOOP_CONFIGURE is, as far as I understand it, supposed to be a way to
      combine LOOP_SET_FD and LOOP_SET_STATUS64 into a single syscall.  When
      using LOOP_SET_FD+LOOP_SET_STATUS64, a single uevent would be sent for
      each partition found on the loop device after the second ioctl(), but
      when using LOOP_CONFIGURE, no such uevent was being sent.
      
      In the old setup, uevents are disabled for LOOP_SET_FD, but not for
      LOOP_SET_STATUS64.  This makes sense, as it prevents uevents being
      sent for a partially configured device during LOOP_SET_FD - they're
      only sent at the end of LOOP_SET_STATUS64.  But for LOOP_CONFIGURE,
      uevents were disabled for the entire operation, so that final
      notification was never issued.  To fix this, reduce the critical
      section to exclude the loop_reread_partitions() call, which causes
      the uevents to be issued, to after uevents are re-enabled, matching
      the behaviour of the LOOP_SET_FD+LOOP_SET_STATUS64 combination.
      
      I noticed this because Busybox's losetup program recently changed from
      using LOOP_SET_FD+LOOP_SET_STATUS64 to LOOP_CONFIGURE, and this broke
      my setup, for which I want a notification from the kernel any time a
      new partition becomes available.
      Signed-off-by: default avatarAlyssa Ross <hi@alyssa.is>
      [hch: reduced the critical section]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Fixes: 3448914e ("loop: Add LOOP_CONFIGURE ioctl")
      Link: https://lore.kernel.org/r/20230320125430.55367-1-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bb430b69
  6. 23 Mar, 2023 1 commit
  7. 22 Mar, 2023 2 commits
  8. 21 Mar, 2023 1 commit
  9. 18 Mar, 2023 1 commit
  10. 16 Mar, 2023 2 commits
    • Lukas Bulwahn's avatar
      block: remove obsolete config BLOCK_COMPAT · 8f0d196e
      Lukas Bulwahn authored
      Before commit bdc1ddad ("compat_ioctl: block: move
      blkdev_compat_ioctl() into ioctl.c"), the config BLOCK_COMPAT was used to
      include compat_ioctl.c into the kernel build. With this commit, the code
      is moved into ioctl.c and included with the config COMPAT. So, since then,
      the config BLOCK_COMPAT has no effect and any further purpose.
      
      Remove this obsolete config BLOCK_COMPAT.
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20230316111630.4897-1-lukas.bulwahn@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8f0d196e
    • Jens Axboe's avatar
      Merge tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme into block-6.3 · 890a2fb0
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 6.3
      
       - avoid potential UAF in nvmet_req_complete (Damien Le Moal)
       - more quirks (Elmer Miroslav Mosher Golovin, Philipp Geulen)
       - fix a memory leak in the nvme-pci probe teardown path (Irvin Cote)
       - repair the MAINTAINERS entry (Lukas Bulwahn)
       - fix handling single range discard request (Ming Lei)
       - show more opcode names in trace events (Minwoo Im)
       - fix nvme-tcp timeout reporting (Sagi Grimberg)"
      
      * tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme:
        nvmet: avoid potential UAF in nvmet_req_complete()
        nvme-trace: show more opcode names
        nvme-tcp: add nvme-tcp pdu size build protection
        nvme-tcp: fix opcode reporting in the timeout handler
        nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM620
        nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV3000
        nvme-pci: fixing memory leak in probe teardown path
        nvme: fix handling single range discard request
        MAINTAINERS: repair malformed T: entries in NVM EXPRESS DRIVERS
      890a2fb0
  11. 15 Mar, 2023 17 commits
  12. 14 Mar, 2023 1 commit
    • Jan Kara's avatar
      block: do not reverse request order when flushing plug list · 34e0a279
      Jan Kara authored
      Commit 26fed4ac ("block: flush plug based on hardware and software
      queue order") changed flushing of plug list to submit requests one
      device at a time. However while doing that it also started using
      list_add_tail() instead of list_add() used previously thus effectively
      submitting requests in reverse order. Also when forming a rq_list with
      remaining requests (in case two or more devices are used), we
      effectively reverse the ordering of the plug list for each device we
      process. Submitting requests in reverse order has negative impact on
      performance for rotational disks (when BFQ is not in use). We observe
      10-25% regression in random 4k write throughput, as well as ~20%
      regression in MariaDB OLTP benchmark on rotational storage on btrfs
      filesystem.
      
      Fix the problem by preserving ordering of the plug list when inserting
      requests into the queuelist as well as by appending to requeue_list
      instead of prepending to it.
      
      Fixes: 26fed4ac ("block: flush plug based on hardware and software queue order")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.czSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      34e0a279
  13. 13 Mar, 2023 2 commits
    • NeilBrown's avatar
      md: avoid signed overflow in slot_store() · 3bc57292
      NeilBrown authored
      slot_store() uses kstrtouint() to get a slot number, but stores the
      result in an "int" variable (by casting a pointer).
      This can result in a negative slot number if the unsigned int value is
      very large.
      
      A negative number means that the slot is empty, but setting a negative
      slot number this way will not remove the device from the array.  I don't
      think this is a serious problem, but it could cause confusion and it is
      best to fix it.
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3bc57292
    • Xiao Ni's avatar
      md: Free resources in __md_stop · 3e453522
      Xiao Ni authored
      If md_run() fails after ->active_io is initialized, then percpu_ref_exit
      is called in error path. However, later md_free_disk will call
      percpu_ref_exit again which leads to a panic because of null pointer
      dereference. It can also trigger this bug when resources are initialized
      but are freed in error path, then will be freed again in md_free_disk.
      
      BUG: kernel NULL pointer dereference, address: 0000000000000038
      Oops: 0000 [#1] PREEMPT SMP
      Workqueue: md_misc mddev_delayed_delete
      RIP: 0010:free_percpu+0x110/0x630
      Call Trace:
       <TASK>
       __percpu_ref_exit+0x44/0x70
       percpu_ref_exit+0x16/0x90
       md_free_disk+0x2f/0x80
       disk_release+0x101/0x180
       device_release+0x84/0x110
       kobject_put+0x12a/0x380
       kobject_put+0x160/0x380
       mddev_delayed_delete+0x19/0x30
       process_one_work+0x269/0x680
       worker_thread+0x266/0x640
       kthread+0x151/0x1b0
       ret_from_fork+0x1f/0x30
      
      For creating raid device, md raid calls do_md_run->md_run, dm raid calls
      md_run. We alloc those memory in md_run. For stopping raid device, md raid
      calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can
      free those memory resources in __md_stop.
      
      Fixes: 72adae23 ("md: Change active_io to percpu")
      Reported-and-tested-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3e453522
  14. 08 Mar, 2023 1 commit
  15. 07 Mar, 2023 2 commits
  16. 05 Mar, 2023 4 commits
    • Linus Torvalds's avatar
      Linux 6.3-rc1 · fe15c26e
      Linus Torvalds authored
      fe15c26e
    • Linus Torvalds's avatar
      cpumask: re-introduce constant-sized cpumask optimizations · 596ff4a0
      Linus Torvalds authored
      Commit aa47a7c2 ("lib/cpumask: deprecate nr_cpumask_bits") resulted
      in the cpumask operations potentially becoming hugely less efficient,
      because suddenly the cpumask was always considered to be variable-sized.
      
      The optimization was then later added back in a limited form by commit
      6f9c07be ("lib/cpumask: add FORCE_NR_CPUS config option"), but that
      FORCE_NR_CPUS option is not useful in a generic kernel and more of a
      special case for embedded situations with fixed hardware.
      
      Instead, just re-introduce the optimization, with some changes.
      
      Instead of depending on CPUMASK_OFFSTACK being false, and then always
      using the full constant cpumask width, this introduces three different
      cpumask "sizes":
      
       - the exact size (nr_cpumask_bits) remains identical to nr_cpu_ids.
      
         This is used for situations where we should use the exact size.
      
       - the "small" size (small_cpumask_bits) is the NR_CPUS constant if it
         fits in a single word and the bitmap operations thus end up able
         to trigger the "small_const_nbits()" optimizations.
      
         This is used for the operations that have optimized single-word
         cases that get inlined, notably the bit find and scanning functions.
      
       - the "large" size (large_cpumask_bits) is the NR_CPUS constant if it
         is an sufficiently small constant that makes simple "copy" and
         "clear" operations more efficient.
      
         This is arbitrarily set at four words or less.
      
      As a an example of this situation, without this fixed size optimization,
      cpumask_clear() will generate code like
      
              movl    nr_cpu_ids(%rip), %edx
              addq    $63, %rdx
              shrq    $3, %rdx
              andl    $-8, %edx
              callq   memset@PLT
      
      on x86-64, because it would calculate the "exact" number of longwords
      that need to be cleared.
      
      In contrast, with this patch, using a MAX_CPU of 64 (which is quite a
      reasonable value to use), the above becomes a single
      
      	movq $0,cpumask
      
      instruction instead, because instead of caring to figure out exactly how
      many CPU's the system has, it just knows that the cpumask will be a
      single word and can just clear it all.
      
      Note that this does end up tightening the rules a bit from the original
      version in another way: operations that set bits in the cpumask are now
      limited to the actual nr_cpu_ids limit, whereas we used to do the
      nr_cpumask_bits thing almost everywhere in the cpumask code.
      
      But if you just clear bits, or scan for bits, we can use the simpler
      compile-time constants.
      
      In the process, remove 'cpumask_complement()' and 'for_each_cpu_not()'
      which were not useful, and which fundamentally have to be limited to
      'nr_cpu_ids'.  Better remove them now than have somebody introduce use
      of them later.
      
      Of course, on x86-64 with MAXSMP there is no sane small compile-time
      constant for the cpumask sizes, and we end up using the actual CPU bits,
      and will generate the above kind of horrors regardless.  Please don't
      use MAXSMP unless you really expect to have machines with thousands of
      cores.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      596ff4a0
    • Linus Torvalds's avatar
      Merge tag 'v6.3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · f915322f
      Linus Torvalds authored
      Pull crypto fix from Herbert Xu:
       "Fix a regression in the caam driver"
      
      * tag 'v6.3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: caam - Fix edesc/iv ordering mixup
      f915322f
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7f9ec7d8
      Linus Torvalds authored
      Pull x86 updates from Thomas Gleixner:
       "A small set of updates for x86:
      
         - Return -EIO instead of success when the certificate buffer for SEV
           guests is not large enough
      
         - Allow STIPB to be enabled with legacy IBSR. Legacy IBRS is cleared
           on return to userspace for performance reasons, but the leaves user
           space vulnerable to cross-thread attacks which STIBP prevents.
           Update the documentation accordingly"
      
      * tag 'x86-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        virt/sev-guest: Return -EIO if certificate buffer is not large enough
        Documentation/hw-vuln: Document the interaction between IBRS and STIBP
        x86/speculation: Allow enabling STIBP with legacy IBRS
      7f9ec7d8