1. 30 Nov, 2022 6 commits
  2. 29 Nov, 2022 7 commits
  3. 25 Nov, 2022 1 commit
    • Ye Bin's avatar
      blk-mq: fix possible memleak when register 'hctx' failed · 4b7a21c5
      Ye Bin authored
      There's issue as follows when do fault injection test:
      unreferenced object 0xffff888132a9f400 (size 512):
        comm "insmod", pid 308021, jiffies 4324277909 (age 509.733s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 08 f4 a9 32 81 88 ff ff  ...........2....
          08 f4 a9 32 81 88 ff ff 00 00 00 00 00 00 00 00  ...2............
        backtrace:
          [<00000000e8952bb4>] kmalloc_node_trace+0x22/0xa0
          [<00000000f9980e0f>] blk_mq_alloc_and_init_hctx+0x3f1/0x7e0
          [<000000002e719efa>] blk_mq_realloc_hw_ctxs+0x1e6/0x230
          [<000000004f1fda40>] blk_mq_init_allocated_queue+0x27e/0x910
          [<00000000287123ec>] __blk_mq_alloc_disk+0x67/0xf0
          [<00000000a2a34657>] 0xffffffffa2ad310f
          [<00000000b173f718>] 0xffffffffa2af824a
          [<0000000095a1dabb>] do_one_initcall+0x87/0x2a0
          [<00000000f32fdf93>] do_init_module+0xdf/0x320
          [<00000000cbe8541e>] load_module+0x3006/0x3390
          [<0000000069ed1bdb>] __do_sys_finit_module+0x113/0x1b0
          [<00000000a1a29ae8>] do_syscall_64+0x35/0x80
          [<000000009cd878b0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fault injection context as follows:
       kobject_add
       blk_mq_register_hctx
       blk_mq_sysfs_register
       blk_register_queue
       device_add_disk
       null_add_dev.part.0 [null_blk]
      
      As 'blk_mq_register_hctx' may already add some objects when failed halfway,
      but there isn't do fallback, caller don't know which objects add failed.
      To solve above issue just do fallback when add objects failed halfway in
      'blk_mq_register_hctx'.
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20221117022940.873959-1-yebin@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4b7a21c5
  4. 24 Nov, 2022 5 commits
    • Ye Bin's avatar
      block: fix crash in 'blk_mq_elv_switch_none' · 90b0296e
      Ye Bin authored
      Syzbot found the following issue:
      general protection fault, probably for non-canonical address 0xdffffc000000001d: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
      CPU: 0 PID: 5234 Comm: syz-executor931 Not tainted 6.1.0-rc3-next-20221102-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__elevator_get block/elevator.h:94 [inline]
      RIP: 0010:blk_mq_elv_switch_none block/blk-mq.c:4593 [inline]
      RIP: 0010:__blk_mq_update_nr_hw_queues block/blk-mq.c:4658 [inline]
      RIP: 0010:blk_mq_update_nr_hw_queues+0x304/0xe40 block/blk-mq.c:4709
      RSP: 0018:ffffc90003cdfc08 EFLAGS: 00010206
      RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000
      RDX: 000000000000001d RSI: 0000000000000002 RDI: 00000000000000e8
      RBP: ffff88801dbd0000 R08: ffff888027c89398 R09: ffffffff8de2e517
      R10: fffffbfff1bc5ca2 R11: 0000000000000000 R12: ffffc90003cdfc70
      R13: ffff88801dbd0008 R14: ffff88801dbd03f8 R15: ffff888027c89380
      FS:  0000555557259300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000005d84c8 CR3: 000000007a7cb000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       nbd_start_device+0x153/0xc30 drivers/block/nbd.c:1355
       nbd_start_device_ioctl drivers/block/nbd.c:1405 [inline]
       __nbd_ioctl drivers/block/nbd.c:1481 [inline]
       nbd_ioctl+0x5a1/0xbd0 drivers/block/nbd.c:1521
       blkdev_ioctl+0x36e/0x800 block/ioctl.c:614
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:870 [inline]
       __se_sys_ioctl fs/ioctl.c:856 [inline]
       __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      As after dd6f7f17 commit move '__elevator_get(qe->type)' before set
      'qe->type', so will lead to access wild pointer.
      To solve above issue get 'qe->type' after set 'qe->type'.
      
      Reported-by: syzbot+746a4eece09f86bc39d7@syzkaller.appspotmail.com
      Fixes:dd6f7f17("block: add proper helpers for elevator_type module refcount management")
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20221107033956.3276891-1-yebin@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      90b0296e
    • Wang ShaoBo's avatar
      drbd: destroy workqueue when drbd device was freed · 8692814b
      Wang ShaoBo authored
      A submitter workqueue is dynamically allocated by init_submitter()
      called by drbd_create_device(), we should destroy it when this
      device is not needed or destroyed.
      
      Fixes: 113fef9e ("drbd: prepare to queue write requests on a submit worker")
      Signed-off-by: default avatarWang ShaoBo <bobo.shaobowang@huawei.com>
      Link: https://lore.kernel.org/r/20221124015817.2729789-3-bobo.shaobowang@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8692814b
    • Wang ShaoBo's avatar
      drbd: remove call to memset before free device/resource/connection · 6e7b854e
      Wang ShaoBo authored
      This revert c2258ffc ("drbd: poison free'd device, resource and
      connection structs"), add memset is odd here for debugging, there are
      some methods to accurately show what happened, such as kdump.
      Signed-off-by: default avatarWang ShaoBo <bobo.shaobowang@huawei.com>
      Link: https://lore.kernel.org/r/20221124015817.2729789-2-bobo.shaobowang@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e7b854e
    • Damien Le Moal's avatar
      block: mq-deadline: Do not break sequential write streams to zoned HDDs · 015d02f4
      Damien Le Moal authored
      mq-deadline ensures an in order dispatching of write requests to zoned
      block devices using a per zone lock (a bit). This implies that for any
      purely sequential write workload, the drive is exercised most of the
      time at a maximum queue depth of one.
      
      However, when such sequential write workload crosses a zone boundary
      (when sequentially writing multiple contiguous zones), zone write
      locking may prevent the last write to one zone to be issued (as the
      previous write is still being executed) but allow the first write to the
      following zone to be issued (as that zone is not yet being writen and
      not locked). This result in an out of order delivery of the sequential
      write commands to the device every time a zone boundary is crossed.
      
      While such behavior does not break the sequential write constraint of
      zoned block devices (and does not generate any write error), some zoned
      hard-disks react badly to seeing these out of order writes, resulting in
      lower write throughput.
      
      This problem can be addressed by always dispatching the first request
      of a stream of sequential write requests, regardless of the zones
      targeted by these sequential writes. To do so, the function
      deadline_skip_seq_writes() is introduced and used in
      deadline_next_request() to select the next write command to issue if the
      target device is an HDD (blk_queue_nonrot() being false).
      deadline_fifo_request() is modified using the new
      deadline_earlier_request() and deadline_is_seq_write() helpers to ignore
      requests in the fifo list that have a preceding request in lba order
      that is sequential.
      
      With this fix, a sequential write workload executed with the following
      fio command:
      
      fio  --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \
           --size=68719476736  --ioengine=libaio --iodepth=32 --rw=write \
           --bs=65536
      
      results in an increase from 225 MB/s to 250 MB/s of the write throughput
      of an SMR HDD (11% increase).
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      015d02f4
    • Damien Le Moal's avatar
      block: mq-deadline: Fix dd_finish_request() for zoned devices · 2820e5d0
      Damien Le Moal authored
      dd_finish_request() tests if the per prio fifo_list is not empty to
      determine if request dispatching must be restarted for handling blocked
      write requests to zoned devices with a call to
      blk_mq_sched_mark_restart_hctx(). While simple, this implementation has
      2 problems:
      
      1) Only the priority level of the completed request is considered.
         However, writes to a zone may be blocked due to other writes to the
         same zone using a different priority level. While this is unlikely to
         happen in practice, as writing a zone with different IO priorirites
         does not make sense, nothing in the code prevents this from
         happening.
      2) The use of list_empty() is dangerous as dd_finish_request() does not
         take dd->lock and may run concurrently with the insert and dispatch
         code.
      
      Fix these 2 problems by testing the write fifo list of all priority
      levels using the new helper dd_has_write_work(), and by testing each
      fifo list using list_empty_careful().
      
      Fixes: c807ab52 ("block/mq-deadline: Add I/O priority support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2820e5d0
  5. 23 Nov, 2022 9 commits
  6. 22 Nov, 2022 1 commit
  7. 21 Nov, 2022 5 commits
  8. 18 Nov, 2022 1 commit
  9. 16 Nov, 2022 5 commits
    • Waiman Long's avatar
      blk-cgroup: Flush stats at blkgs destruction path · dae590a6
      Waiman Long authored
      As noted by Michal, the blkg_iostat_set's in the lockless list
      hold reference to blkg's to protect against their removal. Those
      blkg's hold reference to blkcg. When a cgroup is being destroyed,
      cgroup_rstat_flush() is only called at css_release_work_fn() which is
      called when the blkcg reference count reaches 0. This circular dependency
      will prevent blkcg from being freed until some other events cause
      cgroup_rstat_flush() to be called to flush out the pending blkcg stats.
      
      To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
      function to flush stats for a given css and cpu and call it at the blkgs
      destruction path, blkcg_destroy_blkgs(), whenever there are still some
      pending stats to be flushed. This will ensure that blkcg reference
      count can reach 0 ASAP.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dae590a6
    • Waiman Long's avatar
      blk-cgroup: Optimize blkcg_rstat_flush() · 3b8cc629
      Waiman Long authored
      For a system with many CPUs and block devices, the time to do
      blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
      can be especially problematic as interrupt is disabled during the flush.
      It was reported that it might take seconds to complete in some extreme
      cases leading to hard lockup messages.
      
      As it is likely that not all the percpu blkg_iostat_set's has been
      updated since the last flush, those stale blkg_iostat_set's don't need
      to be flushed in this case. This patch optimizes blkcg_rstat_flush()
      by keeping a lockless list of recently updated blkg_iostat_set's in a
      newly added percpu blkcg->lhead pointer.
      
      The blkg_iostat_set is added to a lockless list on the update side
      in blk_cgroup_bio_start(). It is removed from the lockless list when
      flushed in blkcg_rstat_flush(). Due to racing, it is possible that
      blk_iostat_set's in the lockless list may have no new IO stats to be
      flushed, but that is OK.
      
      To protect against destruction of blkg, a percpu reference is gotten
      when putting into the lockless list and put back when removed.
      
      When booting up an instrumented test kernel with this patch on a
      2-socket 96-thread system with cgroup v2, out of the 2051 calls to
      cgroup_rstat_flush() after bootup, 1788 of the calls were exited
      immediately because of empty lockless list. After an all-cpu kernel
      build, the ratio became 6295424/6340513. That was more than 99%.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20221105005902.407297-3-longman@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3b8cc629
    • Waiman Long's avatar
      blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path · b5a9adcb
      Waiman Long authored
      For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
      Simplify error handling code by returning this error directly instead
      of setting an intermediate "ret" variable.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20221105005902.407297-2-longman@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b5a9adcb
    • Yu Kuai's avatar
      block: don't allow a disk link holder to itself · 077a4033
      Yu Kuai authored
      After creating a dm device, then user can reload such dm with itself,
      and dead loop will be triggered because dm keep looking up to itself.
      
      Test procedures:
      
      1) dmsetup create test --table "xxx sda", assume dm-0 is created
      2) dmsetup suspend test
      3) dmsetup reload test --table "xxx dm-0"
      4) dmsetup resume test
      
      Test result:
      
      BUG: TASK stack guard page was hit at 00000000736a261f (stack is 000000008d12c88d..00000000c8dd82d5)
      stack guard page: 0000 [#1] PREEMPT SMP
      CPU: 29 PID: 946 Comm: systemd-udevd Not tainted 6.1.0-rc3-next-20221101-00006-g17640ca3b0ee #1295
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
      RIP: 0010:dm_prepare_ioctl+0xf/0x1e0
      Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9
      RSP: 0018:ffffc90002090000 EFLAGS: 00010206
      RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000
      RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000
      RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331
      R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       dm_blk_ioctl+0x50/0x1c0
       ? dm_prepare_ioctl+0xe0/0x1e0
       dm_blk_ioctl+0x88/0x1c0
       dm_blk_ioctl+0x88/0x1c0
       ......(a lot of same lines)
       dm_blk_ioctl+0x88/0x1c0
       dm_blk_ioctl+0x88/0x1c0
       blkdev_ioctl+0x184/0x3e0
       __x64_sys_ioctl+0xa3/0x110
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      RIP: 0033:0x7fddf7306577
      Code: b3 66 90 48 8b 05 11 89 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d e1 88 8
      RSP: 002b:00007ffd0b2ec318 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      RAX: ffffffffffffffda RBX: 00005634ef478320 RCX: 00007fddf7306577
      RDX: 0000000000000000 RSI: 0000000000005331 RDI: 0000000000000007
      RBP: 0000000000000007 R08: 00005634ef4843e0 R09: 0000000000000080
      R10: 00007fddf75cfb38 R11: 0000000000000246 R12: 00000000030d4000
      R13: 0000000000000000 R14: 0000000000000000 R15: 00005634ef48b800
       </TASK>
      Modules linked in:
      ---[ end trace 0000000000000000 ]---
      RIP: 0010:dm_prepare_ioctl+0xf/0x1e0
      Code: da 48 83 05 4a 7c 99 0b 01 41 89 c4 eb cd e8 b8 1f 40 00 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 57 48 83 05 a1 5a 99 0b 01 <41> 56 49 89 d6 41 55 4c 8d af 90 02 00 00 9
      RSP: 0018:ffffc90002090000 EFLAGS: 00010206
      RAX: ffff8881049d6800 RBX: ffff88817e589000 RCX: 0000000000000000
      RDX: ffffc90002090010 RSI: ffffc9000209001c RDI: ffff88817e589000
      RBP: 00000000484a101d R08: 0000000000000000 R09: 0000000000000007
      R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000005331
      R13: 0000000000005331 R14: 0000000000000000 R15: 0000000000000000
      FS:  00007fddf9609200(0000) GS:ffff889fbfd40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: ffffc9000208fff8 CR3: 0000000179043000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Kernel panic - not syncing: Fatal exception in interrupt
      Kernel Offset: disabled
      ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---
      
      Fix the problem by forbidding a disk to create link to itself.
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20221115141054.1051801-11-yukuai1@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      077a4033
    • Yu Kuai's avatar
      block: store the holder kobject in bd_holder_disk · 3b3449c1
      Yu Kuai authored
      We hold a reference to the holder kobject for each bd_holder_disk,
      so to make the code a bit more robust, use a reference to it instead
      of the block_device.  As long as no one clears ->bd_holder_dir in
      before freeing the disk, this isn't strictly required, but it does
      make the code more clear and more robust.
      
      Orignally-From: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20221115141054.1051801-10-yukuai1@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3b3449c1