1. 27 Oct, 2021 2 commits
    • Pavel Begunkov's avatar
      block: avoid extra iter advance with async iocb · 1bb6b810
      Pavel Begunkov authored
      Nobody cares about iov iterators state if we return -EIOCBQUEUED, so as
      the we now have __blkdev_direct_IO_async(), which gets pages only once,
      we can skip expensive iov_iter_advance(). It's around 1-2% of all CPU
      spent.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/a6158edfbfa2ae3bc24aed29a72f035df18fad2f.1635337135.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1bb6b810
    • Damien Le Moal's avatar
      block: Add independent access ranges support · a2247f19
      Damien Le Moal authored
      The Concurrent Positioning Ranges VPD page (for SCSI) and data log page
      (for ATA) contain parameters describing the set of contiguous LBAs that
      can be served independently by a single LUN multi-actuator hard-disk.
      Similarly, a logically defined block device composed of multiple disks
      can in some cases execute requests directed at different sector ranges
      in parallel. A dm-linear device aggregating 2 block devices together is
      an example.
      
      This patch implements support for exposing a block device independent
      access ranges to the user through sysfs to allow optimizing device
      accesses to increase performance.
      
      To describe the set of independent sector ranges of a device (actuators
      of a multi-actuator HDDs or table entries of a dm-linear device),
      The type struct blk_independent_access_ranges is introduced. This
      structure describes the sector ranges using an array of
      struct blk_independent_access_range structures. This range structure
      defines the start sector and number of sectors of the access range.
      The ranges in the array cannot overlap and must contain all sectors
      within the device capacity.
      
      The function disk_set_independent_access_ranges() allows a device
      driver to signal to the block layer that a device has multiple
      independent access ranges.  In this case, a struct
      blk_independent_access_ranges is attached to the device request queue
      by the function disk_set_independent_access_ranges(). The function
      disk_alloc_independent_access_ranges() is provided for drivers to
      allocate this structure.
      
      struct blk_independent_access_ranges contains kobjects (struct kobject)
      to expose to the user through sysfs the set of independent access ranges
      supported by a device. When the device is initialized, sysfs
      registration of the ranges information is done from blk_register_queue()
      using the block layer internal function
      disk_register_independent_access_ranges(). If a driver calls
      disk_set_independent_access_ranges() for a registered queue, e.g. when a
      device is revalidated, disk_set_independent_access_ranges() will execute
      disk_register_independent_access_ranges() to update the sysfs attribute
      files.  The sysfs file structure created starts from the
      independent_access_ranges sub-directory and contains the start sector
      and number of sectors of each range, with the information for each range
      grouped in numbered sub-directories.
      
      E.g. for a dual actuator HDD, the user sees:
      
      $ tree /sys/block/sdk/queue/independent_access_ranges/
      /sys/block/sdk/queue/independent_access_ranges/
      |-- 0
      |   |-- nr_sectors
      |   `-- sector
      `-- 1
          |-- nr_sectors
          `-- sector
      
      For a regular device with a single access range, the
      independent_access_ranges sysfs directory does not exist.
      
      Device revalidation may lead to changes to this structure and to the
      attribute values. When manipulated, the queue sysfs_lock and
      sysfs_dir_lock mutexes are held for atomicity, similarly to how the
      blk-mq and elevator sysfs queue sub-directories are protected.
      
      The code related to the management of independent access ranges is
      added in the new file block/blk-ia-ranges.c.
      Signed-off-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.de>
      Reviewed-by: default avatarMartin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: default avatarKeith Busch <kbusch@kernel.org>
      Link: https://lore.kernel.org/r/20211027022223.183838-2-damien.lemoal@wdc.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a2247f19
  2. 26 Oct, 2021 1 commit
  3. 25 Oct, 2021 4 commits
    • Jens Axboe's avatar
      sbitmap: silence data race warning · 9f8b93a7
      Jens Axboe authored
      KCSAN complaints about the sbitmap hint update:
      
      ==================================================================
      BUG: KCSAN: data-race in sbitmap_queue_clear / sbitmap_queue_clear
      
      write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 1:
       sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
       blk_mq_put_tag+0x82/0x90
       __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
       blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
       __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
       blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
       lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
       blk_complete_reqs block/blk-mq.c:584 [inline]
       blk_done_softirq+0x69/0x90 block/blk-mq.c:589
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
       smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
       kthread+0x262/0x280 kernel/kthread.c:319
       ret_from_fork+0x1f/0x30
      
      write to 0xffffe8ffffd145b8 of 4 bytes by interrupt on cpu 0:
       sbitmap_queue_clear+0xca/0xf0 lib/sbitmap.c:606
       blk_mq_put_tag+0x82/0x90
       __blk_mq_free_request+0x114/0x180 block/blk-mq.c:507
       blk_mq_free_request+0x2c8/0x340 block/blk-mq.c:541
       __blk_mq_end_request+0x214/0x230 block/blk-mq.c:565
       blk_mq_end_request+0x37/0x50 block/blk-mq.c:574
       lo_complete_rq+0xca/0x170 drivers/block/loop.c:541
       blk_complete_reqs block/blk-mq.c:584 [inline]
       blk_done_softirq+0x69/0x90 block/blk-mq.c:589
       __do_softirq+0x12c/0x26e kernel/softirq.c:558
       run_ksoftirqd+0x13/0x20 kernel/softirq.c:920
       smpboot_thread_fn+0x22f/0x330 kernel/smpboot.c:164
       kthread+0x262/0x280 kernel/kthread.c:319
       ret_from_fork+0x1f/0x30
      
      value changed: 0x00000035 -> 0x00000044
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 10 Comm: ksoftirqd/0 Not tainted 5.15.0-rc6-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      ==================================================================
      
      which is a data race, but not an important one. This is just updating the
      percpu alloc hint, and the reader of that hint doesn't ever require it to
      be valid.
      
      Just annotate it with data_race() to silence this one.
      
      Reported-by: syzbot+4f8bfd804b4a1f95b8f6@syzkaller.appspotmail.com
      Acked-by: default avatarMarco Elver <elver@google.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9f8b93a7
    • Yu Kuai's avatar
      blk-cgroup: synchronize blkg creation against policy deactivation · 0c9d338c
      Yu Kuai authored
      Our test reports a null pointer dereference:
      
      [  168.534653] ==================================================================
      [  168.535614] Disabling lock debugging due to kernel taint
      [  168.536346] BUG: kernel NULL pointer dereference, address: 0000000000000008
      [  168.537274] #PF: supervisor read access in kernel mode
      [  168.537964] #PF: error_code(0x0000) - not-present page
      [  168.538667] PGD 0 P4D 0
      [  168.539025] Oops: 0000 [#1] PREEMPT SMP KASAN
      [  168.539656] CPU: 13 PID: 759 Comm: bash Tainted: G    B             5.15.0-rc2-next-202100
      [  168.540954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_0738364
      [  168.542736] RIP: 0010:bfq_pd_init+0x88/0x1e0
      [  168.543318] Code: 98 00 00 00 e8 c9 e4 5b ff 4c 8b 65 00 49 8d 7c 24 08 e8 bb e4 5b ff 4d0
      [  168.545803] RSP: 0018:ffff88817095f9c0 EFLAGS: 00010002
      [  168.546497] RAX: 0000000000000001 RBX: ffff888101a1c000 RCX: 0000000000000000
      [  168.547438] RDX: 0000000000000003 RSI: 0000000000000002 RDI: ffff888106553428
      [  168.548402] RBP: ffff888106553400 R08: ffffffff961bcaf4 R09: 0000000000000001
      [  168.549365] R10: ffffffffa2e16c27 R11: fffffbfff45c2d84 R12: 0000000000000000
      [  168.550291] R13: ffff888101a1c098 R14: ffff88810c7a08c8 R15: ffffffffa55541a0
      [  168.551221] FS:  00007fac75227700(0000) GS:ffff88839ba80000(0000) knlGS:0000000000000000
      [  168.552278] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  168.553040] CR2: 0000000000000008 CR3: 0000000165ce7000 CR4: 00000000000006e0
      [  168.554000] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  168.554929] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  168.555888] Call Trace:
      [  168.556221]  <TASK>
      [  168.556510]  blkg_create+0x1c0/0x8c0
      [  168.556989]  blkg_conf_prep+0x574/0x650
      [  168.557502]  ? stack_trace_save+0x99/0xd0
      [  168.558033]  ? blkcg_conf_open_bdev+0x1b0/0x1b0
      [  168.558629]  tg_set_conf.constprop.0+0xb9/0x280
      [  168.559231]  ? kasan_set_track+0x29/0x40
      [  168.559758]  ? kasan_set_free_info+0x30/0x60
      [  168.560344]  ? tg_set_limit+0xae0/0xae0
      [  168.560853]  ? do_sys_openat2+0x33b/0x640
      [  168.561383]  ? do_sys_open+0xa2/0x100
      [  168.561877]  ? __x64_sys_open+0x4e/0x60
      [  168.562383]  ? __kasan_check_write+0x20/0x30
      [  168.562951]  ? copyin+0x48/0x70
      [  168.563390]  ? _copy_from_iter+0x234/0x9e0
      [  168.563948]  tg_set_conf_u64+0x17/0x20
      [  168.564467]  cgroup_file_write+0x1ad/0x380
      [  168.565014]  ? cgroup_file_poll+0x80/0x80
      [  168.565568]  ? __mutex_lock_slowpath+0x30/0x30
      [  168.566165]  ? pgd_free+0x100/0x160
      [  168.566649]  kernfs_fop_write_iter+0x21d/0x340
      [  168.567246]  ? cgroup_file_poll+0x80/0x80
      [  168.567796]  new_sync_write+0x29f/0x3c0
      [  168.568314]  ? new_sync_read+0x410/0x410
      [  168.568840]  ? __handle_mm_fault+0x1c97/0x2d80
      [  168.569425]  ? copy_page_range+0x2b10/0x2b10
      [  168.570007]  ? _raw_read_lock_bh+0xa0/0xa0
      [  168.570622]  vfs_write+0x46e/0x630
      [  168.571091]  ksys_write+0xcd/0x1e0
      [  168.571563]  ? __x64_sys_read+0x60/0x60
      [  168.572081]  ? __kasan_check_write+0x20/0x30
      [  168.572659]  ? do_user_addr_fault+0x446/0xff0
      [  168.573264]  __x64_sys_write+0x46/0x60
      [  168.573774]  do_syscall_64+0x35/0x80
      [  168.574264]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  168.574960] RIP: 0033:0x7fac74915130
      [  168.575456] Code: 73 01 c3 48 8b 0d 58 ed 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 444
      [  168.577969] RSP: 002b:00007ffc3080e288 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  168.578986] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007fac74915130
      [  168.579937] RDX: 0000000000000009 RSI: 000056007669f080 RDI: 0000000000000001
      [  168.580884] RBP: 000056007669f080 R08: 000000000000000a R09: 00007fac75227700
      [  168.581841] R10: 000056007655c8f0 R11: 0000000000000246 R12: 0000000000000009
      [  168.582796] R13: 0000000000000001 R14: 00007fac74be55e0 R15: 00007fac74be08c0
      [  168.583757]  </TASK>
      [  168.584063] Modules linked in:
      [  168.584494] CR2: 0000000000000008
      [  168.584964] ---[ end trace 2475611ad0f77a1a ]---
      
      This is because blkg_alloc() is called from blkg_conf_prep() without
      holding 'q->queue_lock', and elevator is exited before blkg_create():
      
      thread 1                            thread 2
      blkg_conf_prep
       spin_lock_irq(&q->queue_lock);
       blkg_lookup_check -> return NULL
       spin_unlock_irq(&q->queue_lock);
      
       blkg_alloc
        blkcg_policy_enabled -> true
        pd = ->pd_alloc_fn
        blkg->pd[i] = pd
                                         blk_mq_exit_sched
                                          bfq_exit_queue
                                           blkcg_deactivate_policy
                                            spin_lock_irq(&q->queue_lock);
                                            __clear_bit(pol->plid, q->blkcg_pols);
                                            spin_unlock_irq(&q->queue_lock);
                                          q->elevator = NULL;
        spin_lock_irq(&q->queue_lock);
         blkg_create
          if (blkg->pd[i])
           ->pd_init_fn -> q->elevator is NULL
        spin_unlock_irq(&q->queue_lock);
      
      Because blkcg_deactivate_policy() requires queue to be frozen, we can
      grab q_usage_counter to synchoronize blkg_conf_prep() against
      blkcg_deactivate_policy().
      
      Fixes: e21b7a0b ("block, bfq: add full hierarchical scheduling and cgroups support")
      Signed-off-by: default avatarYu Kuai <yukuai3@huawei.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/r/20211020014036.2141723-1-yukuai3@huawei.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0c9d338c
    • Pavel Begunkov's avatar
      block: refactor bio_iov_bvec_set() · fa5fa8ec
      Pavel Begunkov authored
      Combine bio_iov_bvec_set() and bio_iov_bvec_set_append() and let the
      caller to do iov_iter_advance(). Also get rid of __bio_iov_bvec_set(),
      which was duplicated in the final binary, and replace a weird
      iov_iter_truncate() of a temporal iter copy with min() better reflecting
      the intention.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Link: https://lore.kernel.org/r/bcf1ac36fce769a514e19475f3623cd86a1d8b72.1635006010.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fa5fa8ec
    • Pavel Begunkov's avatar
      block: add single bio async direct IO helper · 54a88eb8
      Pavel Begunkov authored
      As with __blkdev_direct_IO_simple(), we can implement direct IO more
      efficiently if there is only one bio. Add __blkdev_direct_IO_async() and
      blkdev_bio_end_io_async(). This patch brings me from 4.45-4.5 MIOPS with
      nullblk to 4.7+.
      Signed-off-by: default avatarPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/f0ae4109b7a6934adede490f84d188d53b97051b.1635006010.git.asml.silence@gmail.comSigned-off-by: default avatarJens Axboe <axboe@kernel.dk>
      54a88eb8
  4. 23 Oct, 2021 1 commit
    • Jens Axboe's avatar
      sched: make task_struct->plug always defined · 599593a8
      Jens Axboe authored
      If CONFIG_BLOCK isn't set, then it's an empty struct anyway. Just make
      it generally available, so we don't break the compile:
      
      kernel/sched/core.c: In function ‘sched_submit_work’:
      kernel/sched/core.c:6346:35: error: ‘struct task_struct’ has no member named ‘plug’
       6346 |                 blk_flush_plug(tsk->plug, true);
            |                                   ^~
      kernel/sched/core.c: In function ‘io_schedule_prepare’:
      kernel/sched/core.c:8357:20: error: ‘struct task_struct’ has no member named ‘plug’
       8357 |         if (current->plug)
            |                    ^~
      kernel/sched/core.c:8358:39: error: ‘struct task_struct’ has no member named ‘plug’
       8358 |                 blk_flush_plug(current->plug, true);
            |                                       ^~
      Reported-by: default avatarNathan Chancellor <nathan@kernel.org>
      Fixes: 008f75a2 ("block: cleanup the flush plug helpers")
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      599593a8
  5. 22 Oct, 2021 2 commits
  6. 21 Oct, 2021 15 commits
  7. 20 Oct, 2021 15 commits