1. 06 Jan, 2022 10 commits
    • Xiao Ni's avatar
      md: Move alloc/free acct bioset in to personality · 0c031fd3
      Xiao Ni authored
      bioset acct is only needed for raid0 and raid5. Therefore, md_run only
      allocates it for raid0 and raid5. However, this does not cover
      personality takeover, which may cause uninitialized bioset. For example,
      the following repro steps:
      
        mdadm -CR /dev/md0 -l1 -n2 /dev/loop0 /dev/loop1
        mdadm --wait /dev/md0
        mkfs.xfs /dev/md0
        mdadm /dev/md0 --grow -l5
        mount /dev/md0 /mnt
      
      causes panic like:
      
      [  225.933939] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [  225.934903] #PF: supervisor instruction fetch in kernel mode
      [  225.935639] #PF: error_code(0x0010) - not-present page
      [  225.936361] PGD 0 P4D 0
      [  225.936677] Oops: 0010 [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN PTI
      [  225.937525] CPU: 27 PID: 1133 Comm: mount Not tainted 5.16.0-rc3+ #706
      [  225.938416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.module_el8.4.0+547+a85d02ba 04/01/2014
      [  225.939922] RIP: 0010:0x0
      [  225.940289] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
      [  225.941196] RSP: 0018:ffff88815897eff0 EFLAGS: 00010246
      [  225.941897] RAX: 0000000000000000 RBX: 0000000000092800 RCX: ffffffff81370a39
      [  225.942813] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000092800
      [  225.943772] RBP: 1ffff1102b12fe04 R08: fffffbfff0b43c01 R09: fffffbfff0b43c01
      [  225.944807] R10: ffffffff85a1e007 R11: fffffbfff0b43c00 R12: ffff88810eaaaf58
      [  225.945757] R13: 0000000000000000 R14: ffff88810eaaafb8 R15: ffff88815897f040
      [  225.946709] FS:  00007ff3f2505080(0000) GS:ffff888fb5e00000(0000) knlGS:0000000000000000
      [  225.947814] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  225.948556] CR2: ffffffffffffffd6 CR3: 000000015aa5a006 CR4: 0000000000370ee0
      [  225.949537] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  225.950455] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  225.951414] Call Trace:
      [  225.951787]  <TASK>
      [  225.952120]  mempool_alloc+0xe5/0x250
      [  225.952625]  ? mempool_resize+0x370/0x370
      [  225.953187]  ? rcu_read_lock_sched_held+0xa1/0xd0
      [  225.953862]  ? rcu_read_lock_bh_held+0xb0/0xb0
      [  225.954464]  ? sched_clock_cpu+0x15/0x120
      [  225.955019]  ? find_held_lock+0xac/0xd0
      [  225.955564]  bio_alloc_bioset+0x1ed/0x2a0
      [  225.956080]  ? lock_downgrade+0x3a0/0x3a0
      [  225.956644]  ? bvec_alloc+0xc0/0xc0
      [  225.957135]  bio_clone_fast+0x19/0x80
      [  225.957651]  raid5_make_request+0x1370/0x1b70
      [  225.958286]  ? sched_clock_cpu+0x15/0x120
      [  225.958797]  ? __lock_acquire+0x8b2/0x3510
      [  225.959339]  ? raid5_get_active_stripe+0xce0/0xce0
      [  225.959986]  ? lock_is_held_type+0xd8/0x130
      [  225.960528]  ? rcu_read_lock_sched_held+0xa1/0xd0
      [  225.961135]  ? rcu_read_lock_bh_held+0xb0/0xb0
      [  225.961703]  ? sched_clock_cpu+0x15/0x120
      [  225.962232]  ? lock_release+0x27a/0x6c0
      [  225.962746]  ? do_wait_intr_irq+0x130/0x130
      [  225.963302]  ? lock_downgrade+0x3a0/0x3a0
      [  225.963815]  ? lock_release+0x6c0/0x6c0
      [  225.964348]  md_handle_request+0x342/0x530
      [  225.964888]  ? set_in_sync+0x170/0x170
      [  225.965397]  ? blk_queue_split+0x133/0x150
      [  225.965988]  ? __blk_queue_split+0x8b0/0x8b0
      [  225.966524]  ? submit_bio_checks+0x3b2/0x9d0
      [  225.967069]  md_submit_bio+0x127/0x1c0
      [...]
      
      Fix this by moving alloc/free of acct bioset to pers->run and pers->free.
      
      While we are on this, properly handle md_integrity_register() error in
      raid0_run().
      
      Fixes: daee2024 (md: check level before create and exit io_acct_set)
      Cc: stable@vger.kernel.org
      Acked-by: default avatarGuoqing Jiang <guoqing.jiang@linux.dev>
      Signed-off-by: default avatarXiao Ni <xni@redhat.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      0c031fd3
    • Dirk Müller's avatar
      lib/raid6: Use strict priority ranking for pq gen() benchmarking · 36dacddb
      Dirk Müller authored
      On x86_64, currently 3 variants of AVX512, 3 variants of AVX2
      and 3 variants of SSE2 are benchmarked on initialization, taking
      between 144-153 jiffies. Testing across a hardware pool of
      various generations of intel cpus I could not find a single
      case where SSE2 won over AVX2 or AVX512. There are cases where
      AVX2 wins over AVX512 however.
      
      Change "prefer" into an integer priority field (similar to
      how recov selection works) to have more than one ranking level
      available, which is backwards compatible with existing behavior.
      
      Give AVX2/512 variants higher priority over SSE2 in order to skip
      SSE testing when AVX is available. in a AVX2/x86_64/HZ=250 case this
      saves in the order of 200ms of initialization time.
      Signed-off-by: default avatarDirk Müller <dmueller@suse.de>
      Acked-by: default avatarPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      36dacddb
    • Dirk Müller's avatar
      lib/raid6: skip benchmark of non-chosen xor_syndrome functions · 38640c48
      Dirk Müller authored
      In commit fe5cbc6e ("md/raid6 algorithms: delta syndrome functions")
      a xor_syndrome() benchmarking was added also to the raid6_choose_gen()
      function. However, the results of that benchmarking were intentionally
      discarded and did not influence the choice. It picked the
      xor_syndrome() variant related to the best performing gen_syndrome().
      
      Reduce runtime of raid6_choose_gen() without modifying its outcome by
      only benchmarking the xor_syndrome() of the best gen_syndrome() variant.
      
      For a HZ=250 x86_64 system with avx2 and without avx512 this removes
      5 out of 6 xor() benchmarks, saving 340ms of raid6 initialization time.
      Signed-off-by: default avatarDirk Müller <dmueller@suse.de>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      38640c48
    • Randy Dunlap's avatar
      md: fix spelling of "its" · dd3dc5f4
      Randy Dunlap authored
      Use the possessive "its" instead of the contraction "it's"
      in printed messages.
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: Song Liu <song@kernel.org>
      Cc: linux-raid@vger.kernel.org
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      dd3dc5f4
    • Vishal Verma's avatar
      md: raid456 add nowait support · bf2c411b
      Vishal Verma authored
      Returns EAGAIN in case the raid456 driver would block waiting for reshape.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarVishal Verma <vverma@digitalocean.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      bf2c411b
    • Vishal Verma's avatar
      md: raid10 add nowait support · c9aa889b
      Vishal Verma authored
      This adds nowait support to the RAID10 driver. Very similar to
      raid1 driver changes. It makes RAID10 driver return with EAGAIN
      for situations where it could wait for eg:
      
        - Waiting for the barrier,
        - Reshape operation,
        - Discard operation.
      
      wait_barrier() and regular_request_wait() fn are modified to return bool
      to support error for wait barriers. They returns true in case of wait
      or if wait is not required and returns false if wait was required
      but not performed to support nowait.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarVishal Verma <vverma@digitalocean.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      c9aa889b
    • Vishal Verma's avatar
      md: raid1 add nowait support · 5aa70503
      Vishal Verma authored
      This adds nowait support to the RAID1 driver. It makes RAID1 driver
      return with EAGAIN for situations where it could wait for eg:
      
        - Waiting for the barrier,
      
      wait_barrier() fn is modified to return bool to support error for
      wait barriers. It returns true in case of wait or if wait is not
      required and returns false if wait was required but not performed
      to support nowait.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarVishal Verma <vverma@digitalocean.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      5aa70503
    • Vishal Verma's avatar
      md: add support for REQ_NOWAIT · f51d46d0
      Vishal Verma authored
      commit 021a2446 ("block: add QUEUE_FLAG_NOWAIT") added support
      for checking whether a given bdev supports handling of REQ_NOWAIT or not.
      Since then commit 6abc4946 ("dm: add support for REQ_NOWAIT and enable
      it for linear target") added support for REQ_NOWAIT for dm. This uses
      a similar approach to incorporate REQ_NOWAIT for md based bios.
      
      This patch was tested using t/io_uring tool within FIO. A nvme drive
      was partitioned into 2 partitions and a simple raid 0 configuration
      /dev/md0 was created.
      
      md0 : active raid0 nvme4n1p1[1] nvme4n1p2[0]
            937423872 blocks super 1.2 512k chunks
      
      Before patch:
      
      $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
      
      Running top while the above runs:
      
      $ ps -eL | grep $(pidof io_uring)
      
        38396   38396 pts/2    00:00:00 io_uring
        38396   38397 pts/2    00:00:15 io_uring
        38396   38398 pts/2    00:00:13 iou-wrk-38397
      
      We can see iou-wrk-38397 io worker thread created which gets created
      when io_uring sees that the underlying device (/dev/md0 in this case)
      doesn't support nowait.
      
      After patch:
      
      $ ./t/io_uring /dev/md0 -p 0 -a 0 -d 1 -r 100
      
      Running top while the above runs:
      
      $ ps -eL | grep $(pidof io_uring)
      
        38341   38341 pts/2    00:10:22 io_uring
        38341   38342 pts/2    00:10:37 io_uring
      
      After running this patch, we don't see any io worker thread
      being created which indicated that io_uring saw that the
      underlying device does support nowait. This is the exact behaviour
      noticed on a dm device which also supports nowait.
      
      For all the other raid personalities except raid0, we would need
      to train pieces which involves make_request fn in order for them
      to correctly handle REQ_NOWAIT.
      Reviewed-by: default avatarJens Axboe <axboe@kernel.dk>
      Signed-off-by: default avatarVishal Verma <vverma@digitalocean.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      f51d46d0
    • Mariusz Tkaczyk's avatar
      md: drop queue limitation for RAID1 and RAID10 · a92ce0fe
      Mariusz Tkaczyk authored
      As suggested by Neil Brown[1], this limitation seems to be
      deprecated.
      
      With plugging in use, writes are processed behind the raid thread
      and conf->pending_count is not increased. This limitation occurs only
      if caller doesn't use plugs.
      
      It can be avoided and often it is (with plugging). There are no reports
      that queue is growing to enormous size so remove queue limitation for
      non-plugged IOs too.
      
      [1] https://lore.kernel.org/linux-raid/162496301481.7211.18031090130574610495@noble.neil.brown.nameSigned-off-by: default avatarMariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      a92ce0fe
    • Davidlohr Bueso's avatar
      md/raid5: play nice with PREEMPT_RT · 770b1d21
      Davidlohr Bueso authored
      raid_run_ops() relies on the implicitly disabled preemption for
      its percpu ops, although this is really about CPU locality. This
      breaks RT semantics as it can take regular (and thus sleeping)
      spinlocks, such as stripe_lock.
      
      Add a local_lock such that non-RT does not change and continues
      to be just map to preempt_disable/enable, but makes RT happy as
      the region will use a per-CPU spinlock and thus be preemptible
      and still guarantee CPU locality.
      Signed-off-by: default avatarDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: default avatarSong Liu <songliubraving@fb.com>
      770b1d21
  2. 05 Jan, 2022 1 commit
  3. 04 Jan, 2022 1 commit
  4. 29 Dec, 2021 1 commit
    • Jens Axboe's avatar
      Merge tag 'nvme-5.17-2021-12-29' of git://git.infradead.org/nvme into for-5.17/drivers · 498860df
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "nvme updates for Linux 5.17
      
       - increment request genctr on completion (Keith Busch, Geliang Tang)
       - add a 'iopolicy' module parameter (Hannes Reinecke)
       - print out valid arguments when reading from /dev/nvme-fabrics
         (Hannes Reinecke)"
      
      * tag 'nvme-5.17-2021-12-29' of git://git.infradead.org/nvme:
        nvme: add 'iopolicy' module parameter
        nvme: drop unused variable ctrl in nvme_setup_cmd
        nvme: increment request genctr on completion
        nvme-fabrics: print out valid arguments when reading from /dev/nvme-fabrics
      498860df
  5. 24 Dec, 2021 1 commit
  6. 23 Dec, 2021 4 commits
  7. 16 Dec, 2021 1 commit
  8. 14 Dec, 2021 4 commits
  9. 13 Dec, 2021 2 commits
  10. 10 Dec, 2021 2 commits
    • Jens Axboe's avatar
      null_blk: cast command status to integer · c5eafd79
      Jens Axboe authored
      kernel test robot reports that sparse now triggers a warning on null_blk:
      
      >> drivers/block/null_blk/main.c:1577:55: sparse: sparse: incorrect type in argument 3 (different base types) @@     expected int ioerror @@     got restricted blk_status_t [usertype] error @@
         drivers/block/null_blk/main.c:1577:55: sparse:     expected int ioerror
         drivers/block/null_blk/main.c:1577:55: sparse:     got restricted blk_status_t [usertype] error
      
      because blk_mq_add_to_batch() takes an integer instead of a blk_status_t.
      Just cast this to an integer to silence it, null_blk is the odd one out
      here since the command status is the "right" type. If we change the
      function type, then we'll have do that for other callers too (existing and
      future ones).
      
      Fixes: 2385ebf3 ("block: null_blk: batched complete poll requests")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c5eafd79
    • NeilBrown's avatar
      pktdvd: stop using bdi congestion framework. · db67097a
      NeilBrown authored
      The bdi congestion framework isn't widely used and should be
      deprecated.
      
      pktdvd makes use of it to track congestion, but this can be done
      entirely internally to pktdvd, so it doesn't need to use the framework.
      
      So introduce a "congested" flag.  When waiting for bio_queue_size to
      drop, set this flag and a var_waitqueue() to wait for it.  When
      bio_queue_size does drop and this flag is set, clear the flag and call
      wake_up_var().
      
      We don't use a wait_var_event macro for the waiting as we need to set
      the flag and drop the spinlock before calling schedule() and while that
      is possible with __wait_var_event(), result is not easy to read.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarNeilBrown <neilb@suse.de>
      Link: https://lore.kernel.org/r/163910843527.9928.857338663717630212@noble.neil.brown.nameSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      db67097a
  11. 03 Dec, 2021 4 commits
    • Ming Lei's avatar
      block: null_blk: batched complete poll requests · 2385ebf3
      Ming Lei authored
      Complete poll requests via blk_mq_add_to_batch() and
      blk_mq_end_request_batch(), so that we can cover batched complete
      code path by running null_blk test.
      
      Meantime this way shows ~14% IOPS boost on 't/io_uring /dev/nullb0'
      in my test.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20211203081703.3506020-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2385ebf3
    • Xiongwei Song's avatar
      floppy: Add max size check for user space request · 545a3249
      Xiongwei Song authored
      We need to check the max request size that is from user space before
      allocating pages. If the request size exceeds the limit, return -EINVAL.
      This check can avoid the warning below from page allocator.
      
      WARNING: CPU: 3 PID: 16525 at mm/page_alloc.c:5344 current_gfp_context include/linux/sched/mm.h:195 [inline]
      WARNING: CPU: 3 PID: 16525 at mm/page_alloc.c:5344 __alloc_pages+0x45d/0x500 mm/page_alloc.c:5356
      Modules linked in:
      CPU: 3 PID: 16525 Comm: syz-executor.3 Not tainted 5.15.0-syzkaller #0
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
      RIP: 0010:__alloc_pages+0x45d/0x500 mm/page_alloc.c:5344
      Code: be c9 00 00 00 48 c7 c7 20 4a 97 89 c6 05 62 32 a7 0b 01 e8 74 9a 42 07 e9 6a ff ff ff 0f 0b e9 a0 fd ff ff 40 80 e5 3f eb 88 <0f> 0b e9 18 ff ff ff 4c 89 ef 44 89 e6 45 31 ed e8 1e 76 ff ff e9
      RSP: 0018:ffffc90023b87850 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 1ffff92004770f0b RCX: dffffc0000000000
      RDX: 0000000000000000 RSI: 0000000000000033 RDI: 0000000000010cc1
      RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      R10: ffffffff81bb4686 R11: 0000000000000001 R12: ffffffff902c1960
      R13: 0000000000000033 R14: 0000000000000000 R15: ffff88804cf64a30
      FS:  0000000000000000(0000) GS:ffff88802cd00000(0063) knlGS:00000000f44b4b40
      CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
      CR2: 000000002c921000 CR3: 000000004f507000 CR4: 0000000000150ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
       __get_free_pages+0x8/0x40 mm/page_alloc.c:5418
       raw_cmd_copyin drivers/block/floppy.c:3113 [inline]
       raw_cmd_ioctl drivers/block/floppy.c:3160 [inline]
       fd_locked_ioctl+0x12e5/0x2820 drivers/block/floppy.c:3528
       fd_ioctl drivers/block/floppy.c:3555 [inline]
       fd_compat_ioctl+0x891/0x1b60 drivers/block/floppy.c:3869
       compat_blkdev_ioctl+0x3b8/0x810 block/ioctl.c:662
       __do_compat_sys_ioctl+0x1c7/0x290 fs/ioctl.c:972
       do_syscall_32_irqs_on arch/x86/entry/common.c:112 [inline]
       __do_fast_syscall_32+0x65/0xf0 arch/x86/entry/common.c:178
       do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:203
       entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
      
      Reported-by: syzbot+23a02c7df2cf2bc93fa2@syzkaller.appspotmail.com
      Link: https://lore.kernel.org/r/20211116131033.27685-1-sxwjean@me.comSigned-off-by: default avatarXiongwei Song <sxwjean@gmail.com>
      Signed-off-by: default avatarDenis Efremov <efremov@linux.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      545a3249
    • Tasos Sahanidis's avatar
      floppy: Fix hang in watchdog when disk is ejected · fb48febc
      Tasos Sahanidis authored
      When the watchdog detects a disk change, it calls cancel_activity(),
      which in turn tries to cancel the fd_timer delayed work.
      
      In the above scenario, fd_timer_fn is set to fd_watchdog(), meaning
      it is trying to cancel its own work.
      This results in a hang as cancel_delayed_work_sync() is waiting for the
      watchdog (itself) to return, which never happens.
      
      This can be reproduced relatively consistently by attempting to read a
      broken floppy, and ejecting it while IO is being attempted and retried.
      
      To resolve this, this patch calls cancel_delayed_work() instead, which
      cancels the work without waiting for the watchdog to return and finish.
      
      Before this regression was introduced, the code in this section used
      del_timer(), and not del_timer_sync() to delete the watchdog timer.
      
      Link: https://lore.kernel.org/r/399e486c-6540-db27-76aa-7a271b061f76@tasossah.com
      Fixes: 070ad7e7 ("floppy: convert to delayed work and single-thread wq")
      Signed-off-by: default avatarTasos Sahanidis <tasos@tasossah.com>
      Signed-off-by: default avatarDenis Efremov <efremov@linux.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fb48febc
    • Ming Lei's avatar
      null_blk: allow zero poll queues · 2bfdbe8b
      Ming Lei authored
      There isn't any reason to not allow zero poll queues from user
      viewpoint.
      
      Also sometimes we need to compare io poll between poll mode and irq
      mode, so not allowing poll queues is bad.
      
      Fixes: 15dfc662 ("null_blk: Fix handling of submit_queues and poll_queues attributes")
      Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20211203023935.3424042-1-ming.lei@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2bfdbe8b
  12. 29 Nov, 2021 9 commits