1. 30 Mar, 2017 3 commits
  2. 29 Mar, 2017 14 commits
    • Jens Axboe's avatar
      blk-mq: include errors in did_work calculation · 3e8a7069
      Jens Axboe authored
      Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
      successfully, but we really want to return whether or not the we made
      progress. Progress includes if we got an error return.  If we don't,
      this can lead to a hang in blk_mq_sched_dispatch_requests() when a
      driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
      manually ending the IO in error and return BLK_MQ_QUEUE_OK.
      Tested-by: default avatarJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3e8a7069
    • Josef Bacik's avatar
      block-mq: don't re-queue if we get a queue error · b58e1769
      Josef Bacik authored
      When try to issue a request directly and we fail we will requeue the
      request, but call blk_mq_end_request() as well.  This leads to the
      completed request being on a queuelist and getting ended twice, which
      causes list corruption in schedulers and other shenanigans.
      Signed-off-by: default avatarJosef Bacik <jbacik@fb.com>
      Reviewed-by: default avatarMing Lei <tom.leiming@gmail.com>
      Reviewed-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b58e1769
    • Tahsin Erdogan's avatar
      blkcg: allocate struct blkcg_gq outside request queue spinlock · 457e490f
      Tahsin Erdogan authored
      blkg_conf_prep() currently calls blkg_lookup_create() while holding
      request queue spinlock. This means allocating memory for struct
      blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
      failures in call paths like below:
      
        pcpu_alloc+0x68f/0x710
        __alloc_percpu_gfp+0xd/0x10
        __percpu_counter_init+0x55/0xc0
        cfq_pd_alloc+0x3b2/0x4e0
        blkg_alloc+0x187/0x230
        blkg_create+0x489/0x670
        blkg_lookup_create+0x9a/0x230
        blkg_conf_prep+0x1fb/0x240
        __cfqg_set_weight_device.isra.105+0x5c/0x180
        cfq_set_weight_on_dfl+0x69/0xc0
        cgroup_file_write+0x39/0x1c0
        kernfs_fop_write+0x13f/0x1d0
        __vfs_write+0x23/0x120
        vfs_write+0xc2/0x1f0
        SyS_write+0x44/0xb0
        entry_SYSCALL_64_fastpath+0x18/0xad
      
      In the code path above, percpu allocator cannot call vmalloc() due to
      queue spinlock.
      
      A failure in this call path gives grief to tools which are trying to
      configure io weights. We see occasional failures happen shortly after
      reboots even when system is not under any memory pressure. Machines
      with a lot of cpus are more vulnerable to this condition.
      
      Do struct blkcg_gq allocations outside the queue spinlock to allow
      blocking during memory allocations.
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      457e490f
    • Jens Axboe's avatar
      Revert "blkcg: allocate struct blkcg_gq outside request queue spinlock" · d708f0d5
      Jens Axboe authored
      I inadvertently applied the v5 version of this patch, whereas
      the agreed upon version was v5. Revert this one so we can apply
      the right one.
      
      This reverts commit 7fc6b87a.
      d708f0d5
    • Jens Axboe's avatar
      blk-mq: fix a typo and a spelling mistake · 48b99c9d
      Jens Axboe authored
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      48b99c9d
    • Sagi Grimberg's avatar
      blk-mq-pci: Fix two spelling mistakes · 018c259b
      Sagi Grimberg authored
      Signed-off-by: default avatarSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      018c259b
    • Omar Sandoval's avatar
      block: fix leak of q->rq_wb · 02ba8893
      Omar Sandoval authored
      CONFIG_DEBUG_TEST_DRIVER_REMOVE found a possible leak of q->rq_wb when a
      request queue is reregistered. This has been a problem since wbt was
      introduced, but the WARN_ON(!list_empty(&stats->callbacks)) in the
      blk-stat rework exposed it. Fix it by cleaning up wbt when we unregister
      the queue.
      
      Fixes: 87760e5e ("block: hook up writeback throttling")
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      02ba8893
    • Omar Sandoval's avatar
      blk-mq: fix leak of q->stats · 0c9539a4
      Omar Sandoval authored
      blk_alloc_queue_node() already allocates q->stats, so
      blk_mq_init_allocated_queue() is overwriting it with a new allocation.
      
      Fixes: a83b576c ("block: fix stacked driver stats init and free")
      Reviewed-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      0c9539a4
    • Omar Sandoval's avatar
      block: warn if sharing request queue across gendisks · 334335d2
      Omar Sandoval authored
      Now that the remaining drivers have been converted to one request queue
      per gendisk, let's warn if a request queue gets registered more than
      once. This will catch future drivers which might do it inadvertently or
      any old drivers that I may have missed.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      334335d2
    • Ming Lei's avatar
      block: block new I/O just after queue is set as dying · d3cfb2a0
      Ming Lei authored
      Before commit 780db207(blk-mq: decouble blk-mq freezing
      from generic bypassing), the dying flag is checked before
      entering queue, and Tejun converts the checking into .mq_freeze_depth,
      and assumes the counter is increased just after dying flag
      is set. Unfortunately we doesn't do that in blk_set_queue_dying().
      
      This patch calls blk_freeze_queue_start() in blk_set_queue_dying(),
      so that we can block new I/O coming once the queue is set as dying.
      
      Given blk_set_queue_dying() is always called in remove path
      of block device, and queue will be cleaned up later, we don't
      need to worry about undoing the counter.
      
      Cc: Tejun Heo <tj@kernel.org>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d3cfb2a0
    • Ming Lei's avatar
      block: rename blk_mq_freeze_queue_start() · 1671d522
      Ming Lei authored
      As the .q_usage_counter is used by both legacy and
      mq path, we need to block new I/O if queue becomes
      dead in blk_queue_enter().
      
      So rename it and we can use this function in both
      paths.
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      1671d522
    • Ming Lei's avatar
      block: add a read barrier in blk_queue_enter() · 5ed61d3f
      Ming Lei authored
      Without the barrier, reading DEAD flag of .q_usage_counter
      and reading .mq_freeze_depth may be reordered, then the
      following wait_event_interruptible() may never return.
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      5ed61d3f
    • Ming Lei's avatar
      blk-mq: comment on races related with timeout handler · d9d149a3
      Ming Lei authored
      This patch adds comment on two races related with
      timeout handler:
      
      - requeue from queue busy vs. timeout
      - rq free & reallocation vs. timeout
      
      Both the races themselves and current solution aren't
      explicit enough, so add comments on them.
      
      Cc: Bart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d9d149a3
    • Ming Lei's avatar
      blk-mq: don't complete un-started request in timeout handler · a4ef8e56
      Ming Lei authored
      When iterating busy requests in timeout handler,
      if the STARTED flag of one request isn't set, that means
      the request is being processed in block layer or driver, and
      isn't submitted to hardware yet.
      
      In current implementation of blk_mq_check_expired(),
      if the request queue becomes dying, un-started requests are
      handled as being completed/freed immediately. This way is
      wrong, and can cause rq corruption or double allocation[1][2],
      when doing I/O and removing&resetting NVMe device at the sametime.
      
      This patch fixes several issues reported by Yi Zhang.
      
      [1]. oops log 1
      [  581.789754] ------------[ cut here ]------------
      [  581.789758] kernel BUG at block/blk-mq.c:374!
      [  581.789760] invalid opcode: 0000 [#1] SMP
      [  581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
      edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
      irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
      intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
      intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
      pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
      sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
      syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
      crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
      dm_region_hash dm_log dm_mod
      [  581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
      [  581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [  581.789804] Workqueue: kblockd blk_mq_timeout_work
      [  581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
      [  581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
      [  581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
      [  581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
      [  581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
      [  581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
      [  581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
      [  581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
      [  581.789817] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
      [  581.789818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
      [  581.789820] Call Trace:
      [  581.789825]  blk_mq_check_expired+0x76/0x80
      [  581.789828]  bt_iter+0x45/0x50
      [  581.789830]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
      [  581.789832]  ? blk_mq_rq_timed_out+0x70/0x70
      [  581.789833]  ? blk_mq_rq_timed_out+0x70/0x70
      [  581.789840]  ? __switch_to+0x140/0x450
      [  581.789841]  blk_mq_timeout_work+0x88/0x170
      [  581.789845]  process_one_work+0x165/0x410
      [  581.789847]  worker_thread+0x137/0x4c0
      [  581.789851]  kthread+0x101/0x140
      [  581.789853]  ? rescuer_thread+0x3b0/0x3b0
      [  581.789855]  ? kthread_park+0x90/0x90
      [  581.789860]  ret_from_fork+0x2c/0x40
      [  581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
      8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
      0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
      [  581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
      [  581.789889] ---[ end trace bcaf03d9a14a0a70 ]---
      
      [2]. oops log2
      [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
      [ 6984.857373] PGD 0
      [ 6984.857374]
      [ 6984.857376] Oops: 0000 [#1] SMP
      [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
      edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
      irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
      iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
      mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
      acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
      ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
      sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
      libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
      dm_region_hash dm_log dm_mod
      [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
      4.10.0-2.el7.bz1420297.x86_64 #1
      [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
      [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
      [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
      [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
      [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
      [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
      [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
      [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
      [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
      [ 6984.857439] FS:  0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
      [ 6984.857440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
      [ 6984.857443] Call Trace:
      [ 6984.857451]  ? mempool_free+0x2b/0x80
      [ 6984.857455]  ? bio_free+0x4e/0x60
      [ 6984.857459]  blk_mq_dispatch_rq_list+0xf5/0x230
      [ 6984.857462]  blk_mq_process_rq_list+0x133/0x170
      [ 6984.857465]  __blk_mq_run_hw_queue+0x8c/0xa0
      [ 6984.857467]  blk_mq_run_work_fn+0x12/0x20
      [ 6984.857473]  process_one_work+0x165/0x410
      [ 6984.857475]  worker_thread+0x137/0x4c0
      [ 6984.857478]  kthread+0x101/0x140
      [ 6984.857480]  ? rescuer_thread+0x3b0/0x3b0
      [ 6984.857481]  ? kthread_park+0x90/0x90
      [ 6984.857489]  ret_from_fork+0x2c/0x40
      [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
      89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
      8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
      [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
      [ 6984.857512] CR2: 0000000000000010
      [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---
      
      Cc: stable@vger.kernel.org
      Reported-by: default avatarYi Zhang <yizhan@redhat.com>
      Tested-by: default avatarYi Zhang <yizhan@redhat.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarMing Lei <tom.leiming@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a4ef8e56
  3. 28 Mar, 2017 23 commits
    • Tahsin Erdogan's avatar
      blkcg: allocate struct blkcg_gq outside request queue spinlock · 7fc6b87a
      Tahsin Erdogan authored
      blkg_conf_prep() currently calls blkg_lookup_create() while holding
      request queue spinlock. This means allocating memory for struct
      blkcg_gq has to be made non-blocking. This causes occasional -ENOMEM
      failures in call paths like below:
      
        pcpu_alloc+0x68f/0x710
        __alloc_percpu_gfp+0xd/0x10
        __percpu_counter_init+0x55/0xc0
        cfq_pd_alloc+0x3b2/0x4e0
        blkg_alloc+0x187/0x230
        blkg_create+0x489/0x670
        blkg_lookup_create+0x9a/0x230
        blkg_conf_prep+0x1fb/0x240
        __cfqg_set_weight_device.isra.105+0x5c/0x180
        cfq_set_weight_on_dfl+0x69/0xc0
        cgroup_file_write+0x39/0x1c0
        kernfs_fop_write+0x13f/0x1d0
        __vfs_write+0x23/0x120
        vfs_write+0xc2/0x1f0
        SyS_write+0x44/0xb0
        entry_SYSCALL_64_fastpath+0x18/0xad
      
      In the code path above, percpu allocator cannot call vmalloc() due to
      queue spinlock.
      
      A failure in this call path gives grief to tools which are trying to
      configure io weights. We see occasional failures happen shortly after
      reboots even when system is not under any memory pressure. Machines
      with a lot of cpus are more vulnerable to this condition.
      
      Update blkg_create() function to temporarily drop the rcu and queue
      locks when it is allowed by gfp mask.
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarTahsin Erdogan <tahsin@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      7fc6b87a
    • Omar Sandoval's avatar
      jsflash: stop sharing request queue across multiple gendisks · 8b0c441e
      Omar Sandoval authored
      Compile-tested only (by hacking it to compile on x86).
      
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      8b0c441e
    • Omar Sandoval's avatar
      swim: stop sharing request queue across multiple gendisks · 103db8b2
      Omar Sandoval authored
      Compile-tested only (by hacking it to compile on x86).
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      103db8b2
    • Omar Sandoval's avatar
      parport/pf: stop sharing request queue across multiple gendisks · 3a644142
      Omar Sandoval authored
      Compile-tested only.
      
      Cc: Tim Waugh <tim@cyberelk.net>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3a644142
    • Omar Sandoval's avatar
      parport/pcd: stop sharing request queue across multiple gendisks · 547b50a1
      Omar Sandoval authored
      Compile-tested only.
      
      Cc: Tim Waugh <tim@cyberelk.net>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      547b50a1
    • Omar Sandoval's avatar
      parport/pd: stop sharing request queue across multiple gendisks · eaf487ca
      Omar Sandoval authored
      Compile-tested only.
      
      Cc: Tim Waugh <tim@cyberelk.net>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      eaf487ca
    • Omar Sandoval's avatar
      hd: stop sharing request queue across multiple gendisks · a893cd76
      Omar Sandoval authored
      Compile-tested only.
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      a893cd76
    • Shaohua Li's avatar
      blk-throttle: add latency target support · 53696b8d
      Shaohua Li authored
      One hard problem adding .low limit is to detect idle cgroup. If one
      cgroup doesn't dispatch enough IO against its low limit, we must have a
      mechanism to determine if other cgroups dispatch more IO. We added the
      think time detection mechanism before, but it doesn't work for all
      workloads. Here we add a latency based approach.
      
      We already have mechanism to calculate latency threshold for each IO
      size. For every IO dispatched from a cgorup, we compare its latency
      against its threshold and record the info. If most IO latency is below
      threshold (in the code I use 75%), the cgroup could be treated idle and
      other cgroups can dispatch more IO.
      
      Currently this latency target check is only for SSD as we can't
      calcualte the latency target for hard disk. And this is only for cgroup
      leaf node so far.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      53696b8d
    • Shaohua Li's avatar
      blk-throttle: add a mechanism to estimate IO latency · b9147dd1
      Shaohua Li authored
      User configures latency target, but the latency threshold for each
      request size isn't fixed. For a SSD, the IO latency highly depends on
      request size. To calculate latency threshold, we sample some data, eg,
      average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
      threshold of each request size will be the sample latency (I'll call it
      base latency) plus latency target. For example, the base latency for
      request size 4k is 80us and user configures latency target 60us. The 4k
      latency threshold will be 80 + 60 = 140us.
      
      To sample data, we calculate the order base 2 of rounded up IO sectors.
      If the IO size is bigger than 1M, it will be accounted as 1M. Since the
      calculation does round up, the base latency will be slightly smaller
      than actual value. Also if there isn't any IO dispatched for a specific
      IO size, we will use the base latency of smaller IO size for this IO
      size.
      
      But we shouldn't sample data at any time. The base latency is supposed
      to be latency where disk isn't congested, because we use latency
      threshold to schedule IOs between cgroups. If disk is congested, the
      latency is higher, using it for scheduling is meaningless. Hence we only
      do the sampling when block throttling is in the LOW limit, with
      assumption disk isn't congested in such state. If the assumption isn't
      true, eg, low limit is too high, calculated latency threshold will be
      higher.
      
      Hard disk is completely different. Latency depends on spindle seek
      instead of request size. Currently this feature is SSD only, we probably
      can use a fixed threshold like 4ms for hard disk though.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b9147dd1
    • Shaohua Li's avatar
      block: track request size in blk_issue_stat · 88eeca49
      Shaohua Li authored
      Currently there is no way to know the request size when the request is
      finished. Next patch will need this info. We could add extra field to
      record the size, but blk_issue_stat has enough space to record it, so
      this patch just overloads blk_issue_stat. With this, we will have 49bits
      to track time, which still is very long time.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      88eeca49
    • Shaohua Li's avatar
      blk-throttle: add interface for per-cgroup target latency · ec80991d
      Shaohua Li authored
      Here we introduce per-cgroup latency target. The target determines how a
      cgroup can afford latency increasement. We will use the target latency
      to calculate a threshold and use it to schedule IO for cgroups. If a
      cgroup's bandwidth is below its low limit but its average latency is
      below the threshold, other cgroups can safely dispatch more IO even
      their bandwidth is higher than their low limits. On the other hand, if
      the first cgroup's latency is higher than the threshold, other cgroups
      are throttled to their low limits. So the target latency determines how
      we efficiently utilize free disk resource without sacifice of worload's
      IO latency.
      
      For example, assume 4k IO average latency is 50us when disk isn't
      congested. A cgroup sets the target latency to 30us. Then the cgroup can
      accept 50+30=80us IO latency. If the cgroupt's average IO latency is
      90us and its bandwidth is below low limit, other cgroups are throttled
      to their low limit. If the cgroup's average IO latency is 60us, other
      cgroups are allowed to dispatch more IO. When other cgroups dispatch
      more IO, the first cgroup's IO latency will increase. If it increases to
      81us, we then throttle other cgroups.
      
      User will configure the interface in this way:
      echo "8:16 rbps=2097152 wbps=max latency=100 idle=200" > io.low
      
      latency is in microsecond unit
      
      By default, latency target is 0, which means to guarantee IO latency.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ec80991d
    • Shaohua Li's avatar
      blk-throttle: ignore idle cgroup limit · fa6fb5aa
      Shaohua Li authored
      Last patch introduces a way to detect idle cgroup. We use it to make
      upgrade/downgrade decision. And the new algorithm can detect completely
      idle cgroup too, so we can delete the corresponding code.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      fa6fb5aa
    • Shaohua Li's avatar
      blk-throttle: add interface to configure idle time threshold · ada75b6e
      Shaohua Li authored
      Add interface to configure the threshold. The io.low interface will
      like:
      echo "8:16 rbps=2097152 wbps=max idle=2000" > io.low
      
      idle is in microsecond unit.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      ada75b6e
    • Shaohua Li's avatar
      blk-throttle: add a simple idle detection · 9e234eea
      Shaohua Li authored
      A cgroup gets assigned a low limit, but the cgroup could never dispatch
      enough IO to cross the low limit. In such case, the queue state machine
      will remain in LIMIT_LOW state and all other cgroups will be throttled
      according to low limit. This is unfair for other cgroups. We should
      treat the cgroup idle and upgrade the state machine to lower state.
      
      We also have a downgrade logic. If the state machine upgrades because of
      cgroup idle (real idle), the state machine will downgrade soon as the
      cgroup is below its low limit. This isn't what we want. A more
      complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
      when queue gets upgraded to lower state, other cgroups could dispatch
      more IO and this cgroup can't dispatch enough IO, so the cgroup is below
      its low limit and looks like idle (fake idle). In this case, the queue
      should downgrade soon. The key to determine if we should do downgrade is
      to detect if cgroup is truely idle.
      
      Unfortunately it's very hard to determine if a cgroup is real idle. This
      patch uses the 'think time check' idea from CFQ for the purpose. Please
      note, the idea doesn't work for all workloads. For example, a workload
      with io depth 8 has disk utilization 100%, hence think time is 0, eg,
      not idle. But the workload can run higher bandwidth with io depth 16.
      Compared to io depth 16, the io depth 8 workload is idle. We use the
      idea to roughly determine if a cgroup is idle.
      
      We treat a cgroup idle if its think time is above a threshold (by
      default 1ms for SSD and 100ms for HD). The idea is think time above the
      threshold will start to harm performance. HD is much slower so a longer
      think time is ok.
      
      The patch (and the latter patches) uses 'unsigned long' to track time.
      We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
      precision, should not a big deal.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      9e234eea
    • Shaohua Li's avatar
      blk-throttle: make bandwidth change smooth · 7394e31f
      Shaohua Li authored
      When cgroups all reach low limit, cgroups can dispatch more IO. This
      could make some cgroups dispatch more IO but others not, and even some
      cgroups could dispatch less IO than their low limit. For example, cg1
      low limit 10MB/s, cg2 limit 80MB/s, assume disk maximum bandwidth is
      120M/s for the workload. Their bps could something like this:
      
      cg1/cg2 bps: T1: 10/80 -> T2: 60/60 -> T3: 10/80
      
      At T1, all cgroups reach low limit, so they can dispatch more IO later.
      Then cg1 dispatch more IO and cg2 has no room to dispatch enough IO. At
      T2, cg2 only dispatches 60M/s. Since We detect cg2 dispatches less IO
      than its low limit 80M/s, we downgrade the queue from LIMIT_MAX to
      LIMIT_LOW, then all cgroups are throttled to their low limit (T3). cg2
      will have bandwidth below its low limit at most time.
      
      The big problem here is we don't know the maximum bandwidth of the
      workload, so we can't make smart decision to avoid the situation. This
      patch makes cgroup bandwidth change smooth. After disk upgrades from
      LIMIT_LOW to LIMIT_MAX, we don't allow cgroups use all bandwidth upto
      their max limit immediately. Their bandwidth limit will be increased
      gradually to avoid above situation. So above example will became
      something like:
      
      cg1/cg2 bps: 10/80 -> 15/105 -> 20/100 -> 25/95 -> 30/90 -> 35/85 -> 40/80
      -> 45/75 -> 22/98
      
      In this way cgroups bandwidth will be above their limit in majority
      time, this still doesn't fully utilize disk bandwidth, but that's
      something we pay for sharing.
      
      Scale up is linear. The limit scales up 1/2 .low limit every
      throtl_slice after upgrade. The scale up will stop if the adjusted limit
      hits .max limit. Scale down is exponential. We cut the scale value half
      if a cgroup doesn't hit its .low limit. If the scale becomes 0, we then
      fully downgrade the queue to LIMIT_LOW state.
      
      Note this doesn't completely avoid cgroup running under its low limit.
      The best way to guarantee cgroup doesn't run under its limit is to set
      max limit. For example, if we set cg1 max limit to 40, cg2 will never
      run under its low limit.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      7394e31f
    • Shaohua Li's avatar
      blk-throttle: detect completed idle cgroup · aec24246
      Shaohua Li authored
      cgroup could be assigned a limit, but doesn't dispatch enough IO, eg the
      cgroup is idle. When this happens, the cgroup doesn't hit its limit, so
      we can't move the state machine to higher level and all cgroups will be
      throttled to their lower limit, so we waste bandwidth. Detecting idle
      cgroup is hard. This patch handles a simple case, a cgroup doesn't
      dispatch any IO. We ignore such cgroup's limit, so other cgroups can use
      the bandwidth.
      
      Please note this will be replaced with a more sophisticated algorithm
      later, but this demonstrates the idea how we handle idle cgroups, so I
      leave it here.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      aec24246
    • Shaohua Li's avatar
      blk-throttle: choose a small throtl_slice for SSD · d61fcfa4
      Shaohua Li authored
      The throtl_slice is 100ms by default. This is a long time for SSD, a lot
      of IO can run. To make cgroups have smoother throughput, we choose a
      small value (20ms) for SSD.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      d61fcfa4
    • Shaohua Li's avatar
      blk-throttle: make throtl_slice tunable · 297e3d85
      Shaohua Li authored
      throtl_slice is important for blk-throttling. It's called slice
      internally but it really is a time window blk-throttling samples data.
      blk-throttling will make decision based on the samplings. An example is
      bandwidth measurement. A cgroup's bandwidth is measured in the time
      interval of throtl_slice.
      
      A small throtl_slice meanse cgroups have smoother throughput but burn
      more CPUs. It has 100ms default value, which is not appropriate for all
      disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
      it tunable.
      
      Since throtl_slice isn't a time slice, the sysfs name
      'throttle_sample_time' reflects its character better.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      297e3d85
    • Shaohua Li's avatar
      blk-throttle: make sure expire time isn't too big · 06cceedc
      Shaohua Li authored
      cgroup could be throttled to a limit but when all cgroups cross high
      limit, queue enters a higher state and so the group should be throttled
      to a higher limit. It's possible the cgroup is sleeping because of
      throttle and other cgroups don't dispatch IO any more. In this case,
      nobody can trigger current downgrade/upgrade logic. To fix this issue,
      we could either set up a timer to wakeup the cgroup if other cgroups are
      idle or make sure this cgroup doesn't sleep too long. Setting up a timer
      means we must change the timer very frequently. This patch chooses the
      latter. Making cgroup sleep time not too big wouldn't change cgroup
      bps/iops, but could make it wakeup more frequently, which isn't a big
      issue because throtl_slice * 8 is already quite big.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      06cceedc
    • Shaohua Li's avatar
      blk-throttle: add downgrade logic · 3f0abd80
      Shaohua Li authored
      When queue state machine is in LIMIT_MAX state, but a cgroup is below
      its low limit for some time, the queue should be downgraded to lower
      state as one cgroup's low limit isn't met.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      3f0abd80
    • Shaohua Li's avatar
      blk-throttle: add upgrade logic for LIMIT_LOW state · c79892c5
      Shaohua Li authored
      When queue is in LIMIT_LOW state and all cgroups with low limit cross
      the bps/iops limitation, we will upgrade queue's state to
      LIMIT_MAX. To determine if a cgroup exceeds its limitation, we check if
      the cgroup has pending request. Since cgroup is throttled according to
      the limit, pending request means the cgroup reaches the limit.
      
      If a cgroup has limit set for both read and write, we consider the
      combination of them for upgrade. The reason is read IO and write IO can
      interfere with each other. If we do the upgrade based in one direction
      IO, the other direction IO could be severly harmed.
      
      For a cgroup hierarchy, there are two cases. Children has lower low
      limit than parent. Parent's low limit is meaningless. If children's
      bps/iops cross low limit, we can upgrade queue state. The other case is
      children has higher low limit than parent. Children's low limit is
      meaningless. As long as parent's bps/iops (which is a sum of childrens
      bps/iops) cross low limit, we can upgrade queue state.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      c79892c5
    • Shaohua Li's avatar
      blk-throttle: configure bps/iops limit for cgroup in low limit · b22c417c
      Shaohua Li authored
      each queue will have a state machine. Initially queue is in LIMIT_LOW
      state, which means all cgroups will be throttled according to their low
      limit. After all cgroups with low limit cross the limit, the queue state
      gets upgraded to LIMIT_MAX state.
      For max limit, cgroup will use the limit configured by user.
      For low limit, cgroup will use the minimal value between low limit and
      max limit configured by user. If the minimal value is 0, which means the
      cgroup doesn't configure low limit, we will use max limit to throttle
      the cgroup and the cgroup is ready to upgrade to LIMIT_MAX
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      b22c417c
    • Shaohua Li's avatar
      blk-throttle: add .low interface · cd5ab1b0
      Shaohua Li authored
      Add low limit for cgroup and corresponding cgroup interface. To be
      consistent with memcg, we allow users configure .low limit higher than
      .max limit. But the internal logic always assumes .low limit is lower
      than .max limit. So we add extra bps/iops_conf fields in throtl_grp for
      userspace configuration. Old bps/iops fields in throtl_grp will be the
      actual limit we use for throttling.
      Signed-off-by: default avatarShaohua Li <shli@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@fb.com>
      cd5ab1b0