1. 01 Dec, 2022 1 commit
  2. 30 Nov, 2022 7 commits
  3. 29 Nov, 2022 7 commits
  4. 25 Nov, 2022 1 commit
    • Ye Bin's avatar
      blk-mq: fix possible memleak when register 'hctx' failed · 4b7a21c5
      Ye Bin authored
      There's issue as follows when do fault injection test:
      unreferenced object 0xffff888132a9f400 (size 512):
        comm "insmod", pid 308021, jiffies 4324277909 (age 509.733s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 08 f4 a9 32 81 88 ff ff  ...........2....
          08 f4 a9 32 81 88 ff ff 00 00 00 00 00 00 00 00  ...2............
        backtrace:
          [<00000000e8952bb4>] kmalloc_node_trace+0x22/0xa0
          [<00000000f9980e0f>] blk_mq_alloc_and_init_hctx+0x3f1/0x7e0
          [<000000002e719efa>] blk_mq_realloc_hw_ctxs+0x1e6/0x230
          [<000000004f1fda40>] blk_mq_init_allocated_queue+0x27e/0x910
          [<00000000287123ec>] __blk_mq_alloc_disk+0x67/0xf0
          [<00000000a2a34657>] 0xffffffffa2ad310f
          [<00000000b173f718>] 0xffffffffa2af824a
          [<0000000095a1dabb>] do_one_initcall+0x87/0x2a0
          [<00000000f32fdf93>] do_init_module+0xdf/0x320
          [<00000000cbe8541e>] load_module+0x3006/0x3390
          [<0000000069ed1bdb>] __do_sys_finit_module+0x113/0x1b0
          [<00000000a1a29ae8>] do_syscall_64+0x35/0x80
          [<000000009cd878b0>] entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fault injection context as follows:
       kobject_add
       blk_mq_register_hctx
       blk_mq_sysfs_register
       blk_register_queue
       device_add_disk
       null_add_dev.part.0 [null_blk]
      
      As 'blk_mq_register_hctx' may already add some objects when failed halfway,
      but there isn't do fallback, caller don't know which objects add failed.
      To solve above issue just do fallback when add objects failed halfway in
      'blk_mq_register_hctx'.
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20221117022940.873959-1-yebin@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4b7a21c5
  5. 24 Nov, 2022 5 commits
    • Ye Bin's avatar
      block: fix crash in 'blk_mq_elv_switch_none' · 90b0296e
      Ye Bin authored
      Syzbot found the following issue:
      general protection fault, probably for non-canonical address 0xdffffc000000001d: 0000 [#1] PREEMPT SMP KASAN
      KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
      CPU: 0 PID: 5234 Comm: syz-executor931 Not tainted 6.1.0-rc3-next-20221102-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022
      RIP: 0010:__elevator_get block/elevator.h:94 [inline]
      RIP: 0010:blk_mq_elv_switch_none block/blk-mq.c:4593 [inline]
      RIP: 0010:__blk_mq_update_nr_hw_queues block/blk-mq.c:4658 [inline]
      RIP: 0010:blk_mq_update_nr_hw_queues+0x304/0xe40 block/blk-mq.c:4709
      RSP: 0018:ffffc90003cdfc08 EFLAGS: 00010206
      RAX: 0000000000000000 RBX: dffffc0000000000 RCX: 0000000000000000
      RDX: 000000000000001d RSI: 0000000000000002 RDI: 00000000000000e8
      RBP: ffff88801dbd0000 R08: ffff888027c89398 R09: ffffffff8de2e517
      R10: fffffbfff1bc5ca2 R11: 0000000000000000 R12: ffffc90003cdfc70
      R13: ffff88801dbd0008 R14: ffff88801dbd03f8 R15: ffff888027c89380
      FS:  0000555557259300(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000005d84c8 CR3: 000000007a7cb000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       nbd_start_device+0x153/0xc30 drivers/block/nbd.c:1355
       nbd_start_device_ioctl drivers/block/nbd.c:1405 [inline]
       __nbd_ioctl drivers/block/nbd.c:1481 [inline]
       nbd_ioctl+0x5a1/0xbd0 drivers/block/nbd.c:1521
       blkdev_ioctl+0x36e/0x800 block/ioctl.c:614
       vfs_ioctl fs/ioctl.c:51 [inline]
       __do_sys_ioctl fs/ioctl.c:870 [inline]
       __se_sys_ioctl fs/ioctl.c:856 [inline]
       __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:856
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      As after dd6f7f17 commit move '__elevator_get(qe->type)' before set
      'qe->type', so will lead to access wild pointer.
      To solve above issue get 'qe->type' after set 'qe->type'.
      
      Reported-by: syzbot+746a4eece09f86bc39d7@syzkaller.appspotmail.com
      Fixes:dd6f7f17("block: add proper helpers for elevator_type module refcount management")
      Signed-off-by: default avatarYe Bin <yebin10@huawei.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/20221107033956.3276891-1-yebin@huaweicloud.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      90b0296e
    • Wang ShaoBo's avatar
      drbd: destroy workqueue when drbd device was freed · 8692814b
      Wang ShaoBo authored
      A submitter workqueue is dynamically allocated by init_submitter()
      called by drbd_create_device(), we should destroy it when this
      device is not needed or destroyed.
      
      Fixes: 113fef9e ("drbd: prepare to queue write requests on a submit worker")
      Signed-off-by: default avatarWang ShaoBo <bobo.shaobowang@huawei.com>
      Link: https://lore.kernel.org/r/20221124015817.2729789-3-bobo.shaobowang@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8692814b
    • Wang ShaoBo's avatar
      drbd: remove call to memset before free device/resource/connection · 6e7b854e
      Wang ShaoBo authored
      This revert c2258ffc ("drbd: poison free'd device, resource and
      connection structs"), add memset is odd here for debugging, there are
      some methods to accurately show what happened, such as kdump.
      Signed-off-by: default avatarWang ShaoBo <bobo.shaobowang@huawei.com>
      Link: https://lore.kernel.org/r/20221124015817.2729789-2-bobo.shaobowang@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6e7b854e
    • Damien Le Moal's avatar
      block: mq-deadline: Do not break sequential write streams to zoned HDDs · 015d02f4
      Damien Le Moal authored
      mq-deadline ensures an in order dispatching of write requests to zoned
      block devices using a per zone lock (a bit). This implies that for any
      purely sequential write workload, the drive is exercised most of the
      time at a maximum queue depth of one.
      
      However, when such sequential write workload crosses a zone boundary
      (when sequentially writing multiple contiguous zones), zone write
      locking may prevent the last write to one zone to be issued (as the
      previous write is still being executed) but allow the first write to the
      following zone to be issued (as that zone is not yet being writen and
      not locked). This result in an out of order delivery of the sequential
      write commands to the device every time a zone boundary is crossed.
      
      While such behavior does not break the sequential write constraint of
      zoned block devices (and does not generate any write error), some zoned
      hard-disks react badly to seeing these out of order writes, resulting in
      lower write throughput.
      
      This problem can be addressed by always dispatching the first request
      of a stream of sequential write requests, regardless of the zones
      targeted by these sequential writes. To do so, the function
      deadline_skip_seq_writes() is introduced and used in
      deadline_next_request() to select the next write command to issue if the
      target device is an HDD (blk_queue_nonrot() being false).
      deadline_fifo_request() is modified using the new
      deadline_earlier_request() and deadline_is_seq_write() helpers to ignore
      requests in the fifo list that have a preceding request in lba order
      that is sequential.
      
      With this fix, a sequential write workload executed with the following
      fio command:
      
      fio  --name=seq-write --filename=/dev/sda --zonemode=zbd --direct=1 \
           --size=68719476736  --ioengine=libaio --iodepth=32 --rw=write \
           --bs=65536
      
      results in an increase from 225 MB/s to 250 MB/s of the write throughput
      of an SMR HDD (11% increase).
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20221124021208.242541-3-damien.lemoal@opensource.wdc.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      015d02f4
    • Damien Le Moal's avatar
      block: mq-deadline: Fix dd_finish_request() for zoned devices · 2820e5d0
      Damien Le Moal authored
      dd_finish_request() tests if the per prio fifo_list is not empty to
      determine if request dispatching must be restarted for handling blocked
      write requests to zoned devices with a call to
      blk_mq_sched_mark_restart_hctx(). While simple, this implementation has
      2 problems:
      
      1) Only the priority level of the completed request is considered.
         However, writes to a zone may be blocked due to other writes to the
         same zone using a different priority level. While this is unlikely to
         happen in practice, as writing a zone with different IO priorirites
         does not make sense, nothing in the code prevents this from
         happening.
      2) The use of list_empty() is dangerous as dd_finish_request() does not
         take dd->lock and may run concurrently with the insert and dispatch
         code.
      
      Fix these 2 problems by testing the write fifo list of all priority
      levels using the new helper dd_has_write_work(), and by testing each
      fifo list using list_empty_careful().
      
      Fixes: c807ab52 ("block/mq-deadline: Add I/O priority support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@opensource.wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Link: https://lore.kernel.org/r/20221124021208.242541-2-damien.lemoal@opensource.wdc.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2820e5d0
  6. 23 Nov, 2022 9 commits
  7. 22 Nov, 2022 1 commit
  8. 21 Nov, 2022 5 commits
  9. 18 Nov, 2022 1 commit
  10. 16 Nov, 2022 3 commits
    • Waiman Long's avatar
      blk-cgroup: Flush stats at blkgs destruction path · dae590a6
      Waiman Long authored
      As noted by Michal, the blkg_iostat_set's in the lockless list
      hold reference to blkg's to protect against their removal. Those
      blkg's hold reference to blkcg. When a cgroup is being destroyed,
      cgroup_rstat_flush() is only called at css_release_work_fn() which is
      called when the blkcg reference count reaches 0. This circular dependency
      will prevent blkcg from being freed until some other events cause
      cgroup_rstat_flush() to be called to flush out the pending blkcg stats.
      
      To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
      function to flush stats for a given css and cpu and call it at the blkgs
      destruction path, blkcg_destroy_blkgs(), whenever there are still some
      pending stats to be flushed. This will ensure that blkcg reference
      count can reach 0 ASAP.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dae590a6
    • Waiman Long's avatar
      blk-cgroup: Optimize blkcg_rstat_flush() · 3b8cc629
      Waiman Long authored
      For a system with many CPUs and block devices, the time to do
      blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
      can be especially problematic as interrupt is disabled during the flush.
      It was reported that it might take seconds to complete in some extreme
      cases leading to hard lockup messages.
      
      As it is likely that not all the percpu blkg_iostat_set's has been
      updated since the last flush, those stale blkg_iostat_set's don't need
      to be flushed in this case. This patch optimizes blkcg_rstat_flush()
      by keeping a lockless list of recently updated blkg_iostat_set's in a
      newly added percpu blkcg->lhead pointer.
      
      The blkg_iostat_set is added to a lockless list on the update side
      in blk_cgroup_bio_start(). It is removed from the lockless list when
      flushed in blkcg_rstat_flush(). Due to racing, it is possible that
      blk_iostat_set's in the lockless list may have no new IO stats to be
      flushed, but that is OK.
      
      To protect against destruction of blkg, a percpu reference is gotten
      when putting into the lockless list and put back when removed.
      
      When booting up an instrumented test kernel with this patch on a
      2-socket 96-thread system with cgroup v2, out of the 2051 calls to
      cgroup_rstat_flush() after bootup, 1788 of the calls were exited
      immediately because of empty lockless list. After an all-cpu kernel
      build, the ratio became 6295424/6340513. That was more than 99%.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20221105005902.407297-3-longman@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3b8cc629
    • Waiman Long's avatar
      blk-cgroup: Return -ENOMEM directly in blkcg_css_alloc() error path · b5a9adcb
      Waiman Long authored
      For blkcg_css_alloc(), the only error that will be returned is -ENOMEM.
      Simplify error handling code by returning this error directly instead
      of setting an intermediate "ret" variable.
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20221105005902.407297-2-longman@redhat.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b5a9adcb