1. 25 Mar, 2020 4 commits
    • Johannes Thumshirn's avatar
      block: factor out requeue handling from dispatch code · c92a4103
      Johannes Thumshirn authored
      Factor out the requeue handling from the dispatch code, this will make
      subsequent addition of different requeueing schemes easier.
      Signed-off-by: default avatarJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c92a4103
    • Konstantin Khlebnikov's avatar
      block/diskstats: replace time_in_queue with sum of request times · 8cd5b8fc
      Konstantin Khlebnikov authored
      Column "time_in_queue" in diskstats is supposed to show total waiting time
      of all requests. I.e. value should be equal to the sum of times from other
      columns. But this is not true, because column "time_in_queue" is counted
      separately in jiffies rather than in nanoseconds as other times.
      
      This patch removes redundant counter for "time_in_queue" and shows total
      time of read, write, discard and flush requests.
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8cd5b8fc
    • Konstantin Khlebnikov's avatar
      block/diskstats: accumulate all per-cpu counters in one pass · ea18e0f0
      Konstantin Khlebnikov authored
      Reading /proc/diskstats iterates over all cpus for summing each field.
      It's faster to sum all fields in one pass.
      
      Hammering /proc/diskstats with fio shows 2x performance improvement:
      
      fio --name=test --numjobs=$JOBS --filename=/proc/diskstats \
          --size=1k --bs=1k --fallocate=none --create_on_open=1 \
          --time_based=1 --runtime=10 --invalidate=0 --group_report
      
      	  JOBS=1	JOBS=10
      Before:	  7k iops	64k iops
      After:	 18k iops      120k iops
      
      Also this way code is more compact:
      
      add/remove: 1/0 grow/shrink: 0/2 up/down: 194/-1540 (-1346)
      Function                                     old     new   delta
      part_stat_read_all                             -     194    +194
      diskstats_show                              1344     631    -713
      part_stat_show                              1219     392    -827
      Total: Before=14966947, After=14965601, chg -0.01%
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ea18e0f0
    • Konstantin Khlebnikov's avatar
      block/diskstats: more accurate approximation of io_ticks for slow disks · 2b8bd423
      Konstantin Khlebnikov authored
      Currently io_ticks is approximated by adding one at each start and end of
      requests if jiffies counter has changed. This works perfectly for requests
      shorter than a jiffy or if one of requests starts/ends at each jiffy.
      
      If disk executes just one request at a time and they are longer than two
      jiffies then only first and last jiffies will be accounted.
      
      Fix is simple: at the end of request add up into io_ticks jiffies passed
      since last update rather than just one jiffy.
      
      Example: common HDD executes random read 4k requests around 12ms.
      
      fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
      iostat -x 10 sdb
      
      Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:
      
      Before:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,60    0,00   330,40     0,00     8,00     0,96   12,09   12,09    0,00   1,02   8,43
      
      After:
      
      Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
      sdb               0,00     0,00   82,50    0,00   330,00     0,00     8,00     1,00   12,10   12,10    0,00  12,12  99,99
      
      Now io_ticks does not loose time between start and end of requests, but
      for queue-depth > 1 some I/O time between adjacent starts might be lost.
      
      For load estimation "%util" is not as useful as average queue length,
      but it clearly shows how often disk queue is completely empty.
      
      Fixes: 5b18b5a7 ("block: delete part_round_stats and switch to less precise counting")
      Signed-off-by: default avatarKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2b8bd423
  2. 24 Mar, 2020 21 commits
  3. 21 Mar, 2020 5 commits
    • Paolo Valente's avatar
      block, bfq: invoke flush_idle_tree after reparent_active_queues in pd_offline · 4d38a87f
      Paolo Valente authored
      In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to
      flush the rb tree that contains all idle entities belonging to the pd
      (cgroup) being destroyed. In particular, bfq_flush_idle_tree() is
      invoked before bfq_reparent_active_queues(). Yet the latter may happen
      to add some entities to the idle tree. It happens if, in some of the
      calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(),
      the queue to move is empty and gets expired.
      
      This commit simply reverses the invocation order between
      bfq_flush_idle_tree() and bfq_reparent_active_queues().
      
      Tested-by: cki-project@redhat.com
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4d38a87f
    • Paolo Valente's avatar
      block, bfq: make reparent_leaf_entity actually work only on leaf entities · 576682fa
      Paolo Valente authored
      bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf
      entity represents just a bfq_queue in an entity tree). Yet, the input
      entity is guaranteed to always be a leaf entity only in two-level
      entity trees. In this respect, because of the error fixed by
      commit 14afc593 ("block, bfq: fix overwrite of bfq_group pointer
      in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened
      to actually have only two levels. After the latter commit, this does not
      hold any longer.
      
      This commit fixes this problem by modifying
      bfq_reparent_leaf_entity(), so that it searches an active leaf entity
      down the path that stems from the input entity. Such a leaf entity is
      guaranteed to exist when bfq_reparent_leaf_entity() is invoked.
      
      Tested-by: cki-project@redhat.com
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      576682fa
    • Paolo Valente's avatar
      block, bfq: turn put_queue into release_process_ref in __bfq_bic_change_cgroup · c8997736
      Paolo Valente authored
      A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
      goal of this put is to release a process reference to a bfq_queue. But
      process-reference releases may trigger also some extra operation, and,
      to this goal, are handled through bfq_release_process_ref(). So, turn
      the invocation of bfq_put_queue() into an invocation of
      bfq_release_process_ref().
      
      Tested-by: cki-project@redhat.com
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c8997736
    • Paolo Valente's avatar
      block, bfq: move forward the getting of an extra ref in bfq_bfqq_move · fd1bb3ae
      Paolo Valente authored
      Commit ecedd3d7 ("block, bfq: get extra ref to prevent a queue
      from being freed during a group move") gets an extra reference to a
      bfq_queue before possibly deactivating it (temporarily), in
      bfq_bfqq_move(). This prevents the bfq_queue from disappearing before
      being reactivated in its new group.
      
      Yet, the bfq_queue may also be expired (i.e., its service may be
      stopped) before the bfq_queue is deactivated. And also an expiration
      may lead to a premature freeing. This commit fixes this issue by
      simply moving forward the getting of the extra reference already
      introduced by commit ecedd3d7 ("block, bfq: get extra ref to
      prevent a queue from being freed during a group move").
      
      Reported-by: cki-project@redhat.com
      Tested-by: cki-project@redhat.com
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fd1bb3ae
    • Zhiqiang Liu's avatar
      block, bfq: fix use-after-free in bfq_idle_slice_timer_body · 2f95fa5c
      Zhiqiang Liu authored
      In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
      not in bfqd-lock critical section. The bfqq, which is not
      equal to NULL in bfq_idle_slice_timer, may be freed after passing
      to bfq_idle_slice_timer_body. So we will access the freed memory.
      
      In addition, considering the bfqq may be in race, we should
      firstly check whether bfqq is in service before doing something
      on it in bfq_idle_slice_timer_body func. If the bfqq in race is
      not in service, it means the bfqq has been expired through
      __bfq_bfqq_expire func, and wait_request flags has been cleared in
      __bfq_bfqd_reset_in_service func. So we do not need to re-clear the
      wait_request of bfqq which is not in service.
      
      KASAN log is given as follows:
      [13058.354613] ==================================================================
      [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
      [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
      [13058.354646]
      [13058.354655] CPU: 96 PID: 19767 Comm: fork13
      [13058.354661] Call trace:
      [13058.354667]  dump_backtrace+0x0/0x310
      [13058.354672]  show_stack+0x28/0x38
      [13058.354681]  dump_stack+0xd8/0x108
      [13058.354687]  print_address_description+0x68/0x2d0
      [13058.354690]  kasan_report+0x124/0x2e0
      [13058.354697]  __asan_load8+0x88/0xb0
      [13058.354702]  bfq_idle_slice_timer+0xac/0x290
      [13058.354707]  __hrtimer_run_queues+0x298/0x8b8
      [13058.354710]  hrtimer_interrupt+0x1b8/0x678
      [13058.354716]  arch_timer_handler_phys+0x4c/0x78
      [13058.354722]  handle_percpu_devid_irq+0xf0/0x558
      [13058.354731]  generic_handle_irq+0x50/0x70
      [13058.354735]  __handle_domain_irq+0x94/0x110
      [13058.354739]  gic_handle_irq+0x8c/0x1b0
      [13058.354742]  el1_irq+0xb8/0x140
      [13058.354748]  do_wp_page+0x260/0xe28
      [13058.354752]  __handle_mm_fault+0x8ec/0x9b0
      [13058.354756]  handle_mm_fault+0x280/0x460
      [13058.354762]  do_page_fault+0x3ec/0x890
      [13058.354765]  do_mem_abort+0xc0/0x1b0
      [13058.354768]  el0_da+0x24/0x28
      [13058.354770]
      [13058.354773] Allocated by task 19731:
      [13058.354780]  kasan_kmalloc+0xe0/0x190
      [13058.354784]  kasan_slab_alloc+0x14/0x20
      [13058.354788]  kmem_cache_alloc_node+0x130/0x440
      [13058.354793]  bfq_get_queue+0x138/0x858
      [13058.354797]  bfq_get_bfqq_handle_split+0xd4/0x328
      [13058.354801]  bfq_init_rq+0x1f4/0x1180
      [13058.354806]  bfq_insert_requests+0x264/0x1c98
      [13058.354811]  blk_mq_sched_insert_requests+0x1c4/0x488
      [13058.354818]  blk_mq_flush_plug_list+0x2d4/0x6e0
      [13058.354826]  blk_flush_plug_list+0x230/0x548
      [13058.354830]  blk_finish_plug+0x60/0x80
      [13058.354838]  read_pages+0xec/0x2c0
      [13058.354842]  __do_page_cache_readahead+0x374/0x438
      [13058.354846]  ondemand_readahead+0x24c/0x6b0
      [13058.354851]  page_cache_sync_readahead+0x17c/0x2f8
      [13058.354858]  generic_file_buffered_read+0x588/0xc58
      [13058.354862]  generic_file_read_iter+0x1b4/0x278
      [13058.354965]  ext4_file_read_iter+0xa8/0x1d8 [ext4]
      [13058.354972]  __vfs_read+0x238/0x320
      [13058.354976]  vfs_read+0xbc/0x1c0
      [13058.354980]  ksys_read+0xdc/0x1b8
      [13058.354984]  __arm64_sys_read+0x50/0x60
      [13058.354990]  el0_svc_common+0xb4/0x1d8
      [13058.354994]  el0_svc_handler+0x50/0xa8
      [13058.354998]  el0_svc+0x8/0xc
      [13058.354999]
      [13058.355001] Freed by task 19731:
      [13058.355007]  __kasan_slab_free+0x120/0x228
      [13058.355010]  kasan_slab_free+0x10/0x18
      [13058.355014]  kmem_cache_free+0x288/0x3f0
      [13058.355018]  bfq_put_queue+0x134/0x208
      [13058.355022]  bfq_exit_icq_bfqq+0x164/0x348
      [13058.355026]  bfq_exit_icq+0x28/0x40
      [13058.355030]  ioc_exit_icq+0xa0/0x150
      [13058.355035]  put_io_context_active+0x250/0x438
      [13058.355038]  exit_io_context+0xd0/0x138
      [13058.355045]  do_exit+0x734/0xc58
      [13058.355050]  do_group_exit+0x78/0x220
      [13058.355054]  __wake_up_parent+0x0/0x50
      [13058.355058]  el0_svc_common+0xb4/0x1d8
      [13058.355062]  el0_svc_handler+0x50/0xa8
      [13058.355066]  el0_svc+0x8/0xc
      [13058.355067]
      [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
      [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
      [13058.355077] The buggy address belongs to the page:
      [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
      [13058.366175] flags: 0x2ffffe0000008100(slab|head)
      [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
      [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
      [13058.370789] page dumped because: kasan: bad access detected
      [13058.370791]
      [13058.370792] Memory state around the buggy address:
      [13058.370797]  ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
      [13058.370801]  ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [13058.370808]                                                                 ^
      [13058.370811]  ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [13058.370815]  ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
      [13058.370817] ==================================================================
      [13058.370820] Disabling lock debugging due to kernel taint
      
      Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
      --
      V2->V3: rewrite the comment as suggested by Paolo Valente
      V1->V2: add one comment, and add Fixes and Reported-by tag.
      
      Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
      Acked-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Reported-by: default avatarWang Wang <wangwang2@huawei.com>
      Signed-off-by: default avatarZhiqiang Liu <liuzhiqiang26@huawei.com>
      Signed-off-by: default avatarFeilong Lin <linfeilong@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2f95fa5c
  4. 18 Mar, 2020 7 commits
  5. 12 Mar, 2020 3 commits