1. 30 Mar, 2023 1 commit
    • Sagi Grimberg's avatar
      nvme-tcp: fix a possible UAF when failing to allocate an io queue · 88eaba80
      Sagi Grimberg authored
      When we allocate a nvme-tcp queue, we set the data_ready callback before
      we actually need to use it. This creates the potential that if a stray
      controller sends us data on the socket before we connect, we can trigger
      the io_work and start consuming the socket.
      
      In this case reported: we failed to allocate one of the io queues, and
      as we start releasing the queues that we already allocated, we get
      a UAF [1] from the io_work which is running before it should really.
      
      Fix this by setting the socket ops callbacks only before we start the
      queue, so that we can't accidentally schedule the io_work in the
      initialization phase before the queue started. While we are at it,
      rename nvme_tcp_restore_sock_calls to pair with nvme_tcp_setup_sock_ops.
      
      [1]:
      [16802.107284] nvme nvme4: starting error recovery
      [16802.109166] nvme nvme4: Reconnecting in 10 seconds...
      [16812.173535] nvme nvme4: failed to connect socket: -111
      [16812.173745] nvme nvme4: Failed reconnect attempt 1
      [16812.173747] nvme nvme4: Reconnecting in 10 seconds...
      [16822.413555] nvme nvme4: failed to connect socket: -111
      [16822.413762] nvme nvme4: Failed reconnect attempt 2
      [16822.413765] nvme nvme4: Reconnecting in 10 seconds...
      [16832.661274] nvme nvme4: creating 32 I/O queues.
      [16833.919887] BUG: kernel NULL pointer dereference, address: 0000000000000088
      [16833.920068] nvme nvme4: Failed reconnect attempt 3
      [16833.920094] #PF: supervisor write access in kernel mode
      [16833.920261] nvme nvme4: Reconnecting in 10 seconds...
      [16833.920368] #PF: error_code(0x0002) - not-present page
      [16833.921086] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
      [16833.921191] RIP: 0010:_raw_spin_lock_bh+0x17/0x30
      ...
      [16833.923138] Call Trace:
      [16833.923271]  <TASK>
      [16833.923402]  lock_sock_nested+0x1e/0x50
      [16833.923545]  nvme_tcp_try_recv+0x40/0xa0 [nvme_tcp]
      [16833.923685]  nvme_tcp_io_work+0x68/0xa0 [nvme_tcp]
      [16833.923824]  process_one_work+0x1e8/0x390
      [16833.923969]  worker_thread+0x53/0x3d0
      [16833.924104]  ? process_one_work+0x390/0x390
      [16833.924240]  kthread+0x124/0x150
      [16833.924376]  ? set_kthread_struct+0x50/0x50
      [16833.924518]  ret_from_fork+0x1f/0x30
      [16833.924655]  </TASK>
      Reported-by: default avatarYanjun Zhang <zhangyanjun@cestc.cn>
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Tested-by: default avatarYanjun Zhang <zhangyanjun@cestc.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      88eaba80
  2. 28 Mar, 2023 1 commit
    • Juraj Pecigos's avatar
      nvme-pci: mark Lexar NM760 as IGNORE_DEV_SUBNQN · 1231363a
      Juraj Pecigos authored
      A system with more than one of these SSDs will only have one usable.
      The kernel fails to detect more than one nvme device due to duplicate
      cntlids.
      
      before:
      [    9.395229] nvme 0000:01:00.0: platform quirk: setting simple suspend
      [    9.395262] nvme nvme0: pci function 0000:01:00.0
      [    9.395282] nvme 0000:03:00.0: platform quirk: setting simple suspend
      [    9.395305] nvme nvme1: pci function 0000:03:00.0
      [    9.409873] nvme nvme0: Duplicate cntlid 1 with nvme1, subsys nqn.2022-07.com.siliconmotion:nvm-subsystem-sn-                    , rejecting
      [    9.409982] nvme nvme0: Removing after probe failure status: -22
      [    9.427487] nvme nvme1: allocated 64 MiB host memory buffer.
      [    9.445088] nvme nvme1: 16/0/0 default/read/poll queues
      [    9.449898] nvme nvme1: Ignoring bogus Namespace Identifiers
      
      after:
      [    1.161890] nvme 0000:01:00.0: platform quirk: setting simple suspend
      [    1.162660] nvme nvme0: pci function 0000:01:00.0
      [    1.162684] nvme 0000:03:00.0: platform quirk: setting simple suspend
      [    1.162707] nvme nvme1: pci function 0000:03:00.0
      [    1.191354] nvme nvme0: allocated 64 MiB host memory buffer.
      [    1.193378] nvme nvme1: allocated 64 MiB host memory buffer.
      [    1.211044] nvme nvme1: 16/0/0 default/read/poll queues
      [    1.211080] nvme nvme0: 16/0/0 default/read/poll queues
      [    1.216145] nvme nvme0: Ignoring bogus Namespace Identifiers
      [    1.216261] nvme nvme1: Ignoring bogus Namespace Identifiers
      
      Adding the NVME_QUIRK_IGNORE_DEV_SUBNQN quirk to resolves the issue.
      Signed-off-by: default avatarJuraj Pecigos <kernel@juraj.dev>
      Reviewed-by: default avatarChaitanya Kulkarni <kch@nvidia.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      1231363a
  3. 27 Mar, 2023 1 commit
    • Alyssa Ross's avatar
      loop: LOOP_CONFIGURE: send uevents for partitions · bb430b69
      Alyssa Ross authored
      LOOP_CONFIGURE is, as far as I understand it, supposed to be a way to
      combine LOOP_SET_FD and LOOP_SET_STATUS64 into a single syscall.  When
      using LOOP_SET_FD+LOOP_SET_STATUS64, a single uevent would be sent for
      each partition found on the loop device after the second ioctl(), but
      when using LOOP_CONFIGURE, no such uevent was being sent.
      
      In the old setup, uevents are disabled for LOOP_SET_FD, but not for
      LOOP_SET_STATUS64.  This makes sense, as it prevents uevents being
      sent for a partially configured device during LOOP_SET_FD - they're
      only sent at the end of LOOP_SET_STATUS64.  But for LOOP_CONFIGURE,
      uevents were disabled for the entire operation, so that final
      notification was never issued.  To fix this, reduce the critical
      section to exclude the loop_reread_partitions() call, which causes
      the uevents to be issued, to after uevents are re-enabled, matching
      the behaviour of the LOOP_SET_FD+LOOP_SET_STATUS64 combination.
      
      I noticed this because Busybox's losetup program recently changed from
      using LOOP_SET_FD+LOOP_SET_STATUS64 to LOOP_CONFIGURE, and this broke
      my setup, for which I want a notification from the kernel any time a
      new partition becomes available.
      Signed-off-by: default avatarAlyssa Ross <hi@alyssa.is>
      [hch: reduced the critical section]
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Fixes: 3448914e ("loop: Add LOOP_CONFIGURE ioctl")
      Link: https://lore.kernel.org/r/20230320125430.55367-1-hch@lst.deSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bb430b69
  4. 23 Mar, 2023 1 commit
  5. 22 Mar, 2023 2 commits
  6. 21 Mar, 2023 1 commit
  7. 18 Mar, 2023 1 commit
  8. 16 Mar, 2023 2 commits
    • Lukas Bulwahn's avatar
      block: remove obsolete config BLOCK_COMPAT · 8f0d196e
      Lukas Bulwahn authored
      Before commit bdc1ddad ("compat_ioctl: block: move
      blkdev_compat_ioctl() into ioctl.c"), the config BLOCK_COMPAT was used to
      include compat_ioctl.c into the kernel build. With this commit, the code
      is moved into ioctl.c and included with the config COMPAT. So, since then,
      the config BLOCK_COMPAT has no effect and any further purpose.
      
      Remove this obsolete config BLOCK_COMPAT.
      Signed-off-by: default avatarLukas Bulwahn <lukas.bulwahn@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarArnd Bergmann <arnd@arndb.de>
      Link: https://lore.kernel.org/r/20230316111630.4897-1-lukas.bulwahn@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8f0d196e
    • Jens Axboe's avatar
      Merge tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme into block-6.3 · 890a2fb0
      Jens Axboe authored
      Pull NVMe fixes from Christoph:
      
      "nvme fixes for Linux 6.3
      
       - avoid potential UAF in nvmet_req_complete (Damien Le Moal)
       - more quirks (Elmer Miroslav Mosher Golovin, Philipp Geulen)
       - fix a memory leak in the nvme-pci probe teardown path (Irvin Cote)
       - repair the MAINTAINERS entry (Lukas Bulwahn)
       - fix handling single range discard request (Ming Lei)
       - show more opcode names in trace events (Minwoo Im)
       - fix nvme-tcp timeout reporting (Sagi Grimberg)"
      
      * tag 'nvme-6.3-2022-03-16' of git://git.infradead.org/nvme:
        nvmet: avoid potential UAF in nvmet_req_complete()
        nvme-trace: show more opcode names
        nvme-tcp: add nvme-tcp pdu size build protection
        nvme-tcp: fix opcode reporting in the timeout handler
        nvme-pci: add NVME_QUIRK_BOGUS_NID for Lexar NM620
        nvme-pci: add NVME_QUIRK_BOGUS_NID for Netac NV3000
        nvme-pci: fixing memory leak in probe teardown path
        nvme: fix handling single range discard request
        MAINTAINERS: repair malformed T: entries in NVM EXPRESS DRIVERS
      890a2fb0
  9. 15 Mar, 2023 17 commits
  10. 14 Mar, 2023 1 commit
    • Jan Kara's avatar
      block: do not reverse request order when flushing plug list · 34e0a279
      Jan Kara authored
      Commit 26fed4ac ("block: flush plug based on hardware and software
      queue order") changed flushing of plug list to submit requests one
      device at a time. However while doing that it also started using
      list_add_tail() instead of list_add() used previously thus effectively
      submitting requests in reverse order. Also when forming a rq_list with
      remaining requests (in case two or more devices are used), we
      effectively reverse the ordering of the plug list for each device we
      process. Submitting requests in reverse order has negative impact on
      performance for rotational disks (when BFQ is not in use). We observe
      10-25% regression in random 4k write throughput, as well as ~20%
      regression in MariaDB OLTP benchmark on rotational storage on btrfs
      filesystem.
      
      Fix the problem by preserving ordering of the plug list when inserting
      requests into the queuelist as well as by appending to requeue_list
      instead of prepending to it.
      
      Fixes: 26fed4ac ("block: flush plug based on hardware and software queue order")
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20230313093002.11756-1-jack@suse.czSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      34e0a279
  11. 13 Mar, 2023 2 commits
    • NeilBrown's avatar
      md: avoid signed overflow in slot_store() · 3bc57292
      NeilBrown authored
      slot_store() uses kstrtouint() to get a slot number, but stores the
      result in an "int" variable (by casting a pointer).
      This can result in a negative slot number if the unsigned int value is
      very large.
      
      A negative number means that the slot is empty, but setting a negative
      slot number this way will not remove the device from the array.  I don't
      think this is a serious problem, but it could cause confusion and it is
      best to fix it.
      Reported-by: default avatarDan Carpenter <error27@gmail.com>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3bc57292
    • Xiao Ni's avatar
      md: Free resources in __md_stop · 3e453522
      Xiao Ni authored
      If md_run() fails after ->active_io is initialized, then percpu_ref_exit
      is called in error path. However, later md_free_disk will call
      percpu_ref_exit again which leads to a panic because of null pointer
      dereference. It can also trigger this bug when resources are initialized
      but are freed in error path, then will be freed again in md_free_disk.
      
      BUG: kernel NULL pointer dereference, address: 0000000000000038
      Oops: 0000 [#1] PREEMPT SMP
      Workqueue: md_misc mddev_delayed_delete
      RIP: 0010:free_percpu+0x110/0x630
      Call Trace:
       <TASK>
       __percpu_ref_exit+0x44/0x70
       percpu_ref_exit+0x16/0x90
       md_free_disk+0x2f/0x80
       disk_release+0x101/0x180
       device_release+0x84/0x110
       kobject_put+0x12a/0x380
       kobject_put+0x160/0x380
       mddev_delayed_delete+0x19/0x30
       process_one_work+0x269/0x680
       worker_thread+0x266/0x640
       kthread+0x151/0x1b0
       ret_from_fork+0x1f/0x30
      
      For creating raid device, md raid calls do_md_run->md_run, dm raid calls
      md_run. We alloc those memory in md_run. For stopping raid device, md raid
      calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can
      free those memory resources in __md_stop.
      
      Fixes: 72adae23 ("md: Change active_io to percpu")
      Reported-and-tested-by: default avatarYu Kuai <yukuai3@huawei.com>
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      3e453522
  12. 08 Mar, 2023 1 commit
  13. 07 Mar, 2023 2 commits
  14. 05 Mar, 2023 7 commits
    • Linus Torvalds's avatar
      Linux 6.3-rc1 · fe15c26e
      Linus Torvalds authored
      fe15c26e
    • Linus Torvalds's avatar
      cpumask: re-introduce constant-sized cpumask optimizations · 596ff4a0
      Linus Torvalds authored
      Commit aa47a7c2 ("lib/cpumask: deprecate nr_cpumask_bits") resulted
      in the cpumask operations potentially becoming hugely less efficient,
      because suddenly the cpumask was always considered to be variable-sized.
      
      The optimization was then later added back in a limited form by commit
      6f9c07be ("lib/cpumask: add FORCE_NR_CPUS config option"), but that
      FORCE_NR_CPUS option is not useful in a generic kernel and more of a
      special case for embedded situations with fixed hardware.
      
      Instead, just re-introduce the optimization, with some changes.
      
      Instead of depending on CPUMASK_OFFSTACK being false, and then always
      using the full constant cpumask width, this introduces three different
      cpumask "sizes":
      
       - the exact size (nr_cpumask_bits) remains identical to nr_cpu_ids.
      
         This is used for situations where we should use the exact size.
      
       - the "small" size (small_cpumask_bits) is the NR_CPUS constant if it
         fits in a single word and the bitmap operations thus end up able
         to trigger the "small_const_nbits()" optimizations.
      
         This is used for the operations that have optimized single-word
         cases that get inlined, notably the bit find and scanning functions.
      
       - the "large" size (large_cpumask_bits) is the NR_CPUS constant if it
         is an sufficiently small constant that makes simple "copy" and
         "clear" operations more efficient.
      
         This is arbitrarily set at four words or less.
      
      As a an example of this situation, without this fixed size optimization,
      cpumask_clear() will generate code like
      
              movl    nr_cpu_ids(%rip), %edx
              addq    $63, %rdx
              shrq    $3, %rdx
              andl    $-8, %edx
              callq   memset@PLT
      
      on x86-64, because it would calculate the "exact" number of longwords
      that need to be cleared.
      
      In contrast, with this patch, using a MAX_CPU of 64 (which is quite a
      reasonable value to use), the above becomes a single
      
      	movq $0,cpumask
      
      instruction instead, because instead of caring to figure out exactly how
      many CPU's the system has, it just knows that the cpumask will be a
      single word and can just clear it all.
      
      Note that this does end up tightening the rules a bit from the original
      version in another way: operations that set bits in the cpumask are now
      limited to the actual nr_cpu_ids limit, whereas we used to do the
      nr_cpumask_bits thing almost everywhere in the cpumask code.
      
      But if you just clear bits, or scan for bits, we can use the simpler
      compile-time constants.
      
      In the process, remove 'cpumask_complement()' and 'for_each_cpu_not()'
      which were not useful, and which fundamentally have to be limited to
      'nr_cpu_ids'.  Better remove them now than have somebody introduce use
      of them later.
      
      Of course, on x86-64 with MAXSMP there is no sane small compile-time
      constant for the cpumask sizes, and we end up using the actual CPU bits,
      and will generate the above kind of horrors regardless.  Please don't
      use MAXSMP unless you really expect to have machines with thousands of
      cores.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      596ff4a0
    • Linus Torvalds's avatar
      Merge tag 'v6.3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · f915322f
      Linus Torvalds authored
      Pull crypto fix from Herbert Xu:
       "Fix a regression in the caam driver"
      
      * tag 'v6.3-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
        crypto: caam - Fix edesc/iv ordering mixup
      f915322f
    • Linus Torvalds's avatar
      Merge tag 'x86-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 7f9ec7d8
      Linus Torvalds authored
      Pull x86 updates from Thomas Gleixner:
       "A small set of updates for x86:
      
         - Return -EIO instead of success when the certificate buffer for SEV
           guests is not large enough
      
         - Allow STIPB to be enabled with legacy IBSR. Legacy IBRS is cleared
           on return to userspace for performance reasons, but the leaves user
           space vulnerable to cross-thread attacks which STIBP prevents.
           Update the documentation accordingly"
      
      * tag 'x86-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        virt/sev-guest: Return -EIO if certificate buffer is not large enough
        Documentation/hw-vuln: Document the interaction between IBRS and STIBP
        x86/speculation: Allow enabling STIBP with legacy IBRS
      7f9ec7d8
    • Linus Torvalds's avatar
      Merge tag 'irq-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 4e9c542c
      Linus Torvalds authored
      Pull irq updates from Thomas Gleixner:
       "A set of updates for the interrupt susbsystem:
      
         - Prevent possible NULL pointer derefences in
           irq_data_get_affinity_mask() and irq_domain_create_hierarchy()
      
         - Take the per device MSI lock before invoking code which relies on
           it being hold
      
         - Make sure that MSI descriptors are unreferenced before freeing
           them. This was overlooked when the platform MSI code was converted
           to use core infrastructure and results in a fals positive warning
      
         - Remove dead code in the MSI subsystem
      
         - Clarify the documentation for pci_msix_free_irq()
      
         - More kobj_type constification"
      
      * tag 'irq-urgent-2023-03-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        genirq/msi, platform-msi: Ensure that MSI descriptors are unreferenced
        genirq/msi: Drop dead domain name assignment
        irqdomain: Add missing NULL pointer check in irq_domain_create_hierarchy()
        genirq/irqdesc: Make kobj_type structures constant
        PCI/MSI: Clarify usage of pci_msix_free_irq()
        genirq/msi: Take the per-device MSI lock before validating the control structure
        genirq/ipi: Fix NULL pointer deref in irq_data_get_affinity_mask()
      4e9c542c
    • Linus Torvalds's avatar
      Merge tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 1a90673e
      Linus Torvalds authored
      Pull vfs update from Al Viro:
       "Adding Christian Brauner as VFS co-maintainer"
      
      * tag 'pull-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        Adding VFS co-maintainer
      1a90673e
    • Linus Torvalds's avatar
      Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 1a8d05a7
      Linus Torvalds authored
      Pull VM_FAULT_RETRY fixes from Al Viro:
       "Some of the page fault handlers do not deal with the following case
        correctly:
      
         - handle_mm_fault() has returned VM_FAULT_RETRY
      
         - there is a pending fatal signal
      
         - fault had happened in kernel mode
      
        Correct action in such case is not "return unconditionally" - fatal
        signals are handled only upon return to userland and something like
        copy_to_user() would end up retrying the faulting instruction and
        triggering the same fault again and again.
      
        What we need to do in such case is to make the caller to treat that as
        failed uaccess attempt - handle exception if there is an exception
        handler for faulting instruction or oops if there isn't one.
      
        Over the years some architectures had been fixed and now are handling
        that case properly; some still do not. This series should fix the
        remaining ones.
      
        Status:
      
         - m68k, riscv, hexagon, parisc: tested/acked by maintainers.
      
         - alpha, sparc32, sparc64: tested locally - bug has been reproduced
           on the unpatched kernel and verified to be fixed by this series.
      
         - ia64, microblaze, nios2, openrisc: build, but otherwise completely
           untested"
      
      * tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        openrisc: fix livelock in uaccess
        nios2: fix livelock in uaccess
        microblaze: fix livelock in uaccess
        ia64: fix livelock in uaccess
        sparc: fix livelock in uaccess
        alpha: fix livelock in uaccess
        parisc: fix livelock in uaccess
        hexagon: fix livelock in uaccess
        riscv: fix livelock in uaccess
        m68k: fix livelock in uaccess
      1a8d05a7