1. 16 Aug, 2018 5 commits
    • Maciej S. Szmigiero's avatar
      block, bfq: return nbytes and not zero from struct cftype .write() method · fc8ebd01
      Maciej S. Szmigiero authored
      The value that struct cftype .write() method returns is then directly
      returned to userspace as the value returned by write() syscall, so it
      should be the number of bytes actually written (or consumed) and not zero.
      
      Returning zero from write() syscall makes programs like /bin/echo or bash
      spin.
      Signed-off-by: default avatarMaciej S. Szmigiero <mail@maciej.szmigiero.name>
      Fixes: e21b7a0b ("block, bfq: add full hierarchical scheduling and cgroups support")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      fc8ebd01
    • Paolo Valente's avatar
      block, bfq: improve code of bfq_bfqq_charge_time · f8121648
      Paolo Valente authored
      bfq_bfqq_charge_time contains some lengthy and redundant code. This
      commit trims and condenses that code.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f8121648
    • Paolo Valente's avatar
      block, bfq: reduce write overcharge · d5801088
      Paolo Valente authored
      When a sync request is dispatched, the queue that contains that
      request, and all the ancestor entities of that queue, are charged with
      the number of sectors of the request. In constrast, if the request is
      async, then the queue and its ancestor entities are charged with the
      number of sectors of the request, multiplied by an overcharge
      factor. This throttles the bandwidth for async I/O, w.r.t. to sync
      I/O, and it is done to counter the tendency of async writes to steal
      I/O throughput to reads.
      
      On the opposite end, the lower this parameter, the stabler I/O
      control, in the following respect.  The lower this parameter is, the
      less the bandwidth enjoyed by a group decreases
      - when the group does writes, w.r.t. to when it does reads;
      - when other groups do reads, w.r.t. to when they do writes.
      
      The fixes "block, bfq: always update the budget of an entity when
      needed" and "block, bfq: readd missing reset of parent-entity service"
      improved I/O control in bfq to such an extent that it has been
      possible to revise this overcharge factor downwards.  This commit
      introduces the resulting, new value.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d5801088
    • Paolo Valente's avatar
      block, bfq: always update the budget of an entity when needed · e02a0aa2
      Paolo Valente authored
      When the next child entity to serve changes for a given parent entity,
      the budget of that parent entity must be updated accordingly.
      Unfortunately, this update is not performed, by mistake, for the
      entities that happen to switch from having no child entity to serve,
      to having one child entity to serve.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e02a0aa2
    • Paolo Valente's avatar
      block, bfq: readd missing reset of parent-entity service · 8a511ba5
      Paolo Valente authored
      The received-service counter needs to be equal to 0 when an entity is
      set in service. Unfortunately, commit "block, bfq: fix service being
      wrongly set to zero in case of preemption" mistakenly removed the
      resetting of this counter for the parent entities of the bfq_queue
      being set in service. This commit fixes this issue by resetting
      service for parent entities, directly on the expiration of the
      in-service bfq_queue.
      
      Fixes: 9fae8dd5 ("block, bfq: fix service being wrongly set to zero in case of preemption")
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a511ba5
  2. 14 Aug, 2018 2 commits
  3. 11 Aug, 2018 18 commits
  4. 10 Aug, 2018 1 commit
    • Coly Li's avatar
      bcache: fix error setting writeback_rate through sysfs interface · 46451874
      Coly Li authored
      Commit ea8c5356 ("bcache: set max writeback rate when I/O request
      is idle") changes struct bch_ratelimit member rate from uint32_t to
      atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c
      to set new writeback rate, after the input is converted from memory
      buf to long int by sysfs_strtoul_clamp().
      
      The above change has a problem because there is an implicit return
      inside sysfs_strtoul_clamp() so the following atomic_long_set()
      won't be called. This error is detected by 0day system with following
      snipped smatch warnings:
      
      drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized
      symbol 'v'.
      270  sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      @271 atomic_long_set(&dc->writeback_rate.rate, v);
      
      This patch fixes the above error by using strtoul_safe_clamp() to
      convert the input buffer into a long int type result.
      
      Fixes: ea8c5356 ("bcache: set max writeback rate when I/O request is idle")
      Cc: Kai Krakow <kai@kaishome.de>
      Cc: Stefan Priebe <s.priebe@profihost.ag>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      46451874
  5. 09 Aug, 2018 14 commits
    • Jens Axboe's avatar
      null_blk: add lock drop/acquire annotation · 61884de0
      Jens Axboe authored
      sparse complains:
      
      drivers/block/null_blk_main.c:816:24: sparse: context imbalance in 'null_insert_page' - unexpected unlock
      
      Fix it by adding the necessary annotations to the function.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61884de0
    • Liu Bo's avatar
      Blk-throttle: reduce tail io latency when iops limit is enforced · 991f61fe
      Liu Bo authored
      When an application's iops has exceeded its cgroup's iops limit, surely it
      is throttled and kernel will set a timer for dispatching, thus IO latency
      includes the delay.
      
      However, the dispatch delay which is calculated by the limit and the
      elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
      once the application's iops is (iops limit + 1), it doesn't need to wait
      any longer than the remaining time of the current slice.
      
      The difference can be proved by the following fio job and cgroup iops
      setting,
      -----
      $ echo 4 > /mnt/config/nullb/disk1/mbps    # limit nullb's bandwidth to 4MB/s for testing.
      $ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
      $ cat r2.job
      [global]
      name=fio-rand-read
      filename=/dev/nullb1
      rw=randread
      bs=4k
      direct=1
      numjobs=1
      time_based=1
      runtime=60
      group_reporting=1
      
      [file1]
      size=4G
      ioengine=libaio
      iodepth=1
      rate_iops=50000
      norandommap=1
      thinktime=4ms
      -----
      
      wo patch:
      file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
      fio-3.7-66-gedfc
      Starting 1 process
      
         read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
          slat (usec): min=10, max=336, avg=27.71, stdev=17.82
          clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
           lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
          clat percentiles (usec):
           |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
           | 30.00th=[    4], 40.00th=[    4], 50.00th=[    6], 60.00th=[11731],
           | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
           | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
           | 99.99th=[28967]
      
      w/ patch:
      file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
      fio-3.7-66-gedfc
      Starting 1 process
      
         read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
          slat (usec): min=10, max=155, avg=23.24, stdev=16.79
          clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
           lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
          clat percentiles (usec):
           |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
           | 30.00th=[    4], 40.00th=[    5], 50.00th=[   47], 60.00th=[11863],
           | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
           | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
           | 99.99th=[12387]
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      991f61fe
    • Gustavo A. R. Silva's avatar
      block: paride: pd: mark expected switch fall-throughs · 0a1c749d
      Gustavo A. R. Silva authored
      In preparation to enabling -Wimplicit-fallthrough, mark switch cases
      where we are expecting to fall through.
      
      Addresses-Coverity-ID: 1056543 ("Missing break in switch")
      Addresses-Coverity-ID: 1056544 ("Missing break in switch")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0a1c749d
    • Bart Van Assche's avatar
      block: Ensure that a request queue is dissociated from the cgroup controller · 24ecc358
      Bart Van Assche authored
      Several block drivers call alloc_disk() followed by put_disk() if
      something fails before device_add_disk() is called without calling
      blk_cleanup_queue(). Make sure that also for this scenario a request
      queue is dissociated from the cgroup controller. This patch avoids
      that loading the parport_pc, paride and pf drivers triggers the
      following kernel crash:
      
      BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
      Read of size 4 at addr 0000000000000008 by task modprobe/744
      Call Trace:
      dump_stack+0x9a/0xeb
      kasan_report+0x139/0x350
      pi_init+0x42e/0x580 [paride]
      pf_init+0x2bb/0x1000 [pf]
      do_one_initcall+0x8e/0x405
      do_init_module+0xd9/0x2f2
      load_module+0x3ab4/0x4700
      SYSC_finit_module+0x176/0x1a0
      do_syscall_64+0xee/0x2b0
      entry_SYSCALL_64_after_hwframe+0x42/0xb7
      Reported-by: default avatarAlexandru Moise <00moses.alexander00@gmail.com>
      Fixes: a063057d ("block: Fix a race between request queue removal and the block cgroup controller") # v4.17
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Tested-by: default avatarAlexandru Moise <00moses.alexander00@gmail.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      24ecc358
    • Bart Van Assche's avatar
      block: Introduce blk_exit_queue() · 4cf6324b
      Bart Van Assche authored
      This patch does not change any functionality.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4cf6324b
    • Bart Van Assche's avatar
      blkcg: Introduce blkg_root_lookup() · 6bad9b21
      Bart Van Assche authored
      This new function will be used in a later patch to verify whether a
      queue has been dissociated from the cgroup controller before being
      released.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6bad9b21
    • Bart Van Assche's avatar
      block: Remove two superfluous #include directives · b1f4267c
      Bart Van Assche authored
      Commit 12f5b931 ("blk-mq: Remove generation seqeunce") removed the
      only seqcount_t and u64_stats_sync instances from <linux/blkdev.h> but
      did not remove the corresponding #include directives. Since these
      include directives are no longer needed, remove them.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Cc: Hannes Reinecke <hare@suse.com>,
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b1f4267c
    • Jianchao Wang's avatar
      blk-mq: count the hctx as active before allocating tag · d263ed99
      Jianchao Wang authored
      Currently, we count the hctx as active after allocate driver tag
      successfully. If a previously inactive hctx try to get tag first
      time, it may fails and need to wait. However, due to the stale tag
      ->active_queues, the other shared-tags users are still able to
      occupy all driver tags while there is someone waiting for tag.
      Consequently, even if the previously inactive hctx is waked up, it
      still may not be able to get a tag and could be starved.
      
      To fix it, we count the hctx as active before try to allocate driver
      tag, then when it is waiting the tag, the other shared-tag users
      will reserve budget for it.
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d263ed99
    • Greg Edwards's avatar
      block: bvec_nr_vecs() returns value for wrong slab · d6c02a9b
      Greg Edwards authored
      In commit ed996a52 ("block: simplify and cleanup bvec pool
      handling"), the value of the slab index is incremented by one in
      bvec_alloc() after the allocation is done to indicate an index value of
      0 does not need to be later freed.
      
      bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
      value.  Decrement idx before performing the lookup.
      
      Fixes: ed996a52 ("block: simplify and cleanup bvec pool handling")
      Signed-off-by: default avatarGreg Edwards <gedwards@ddn.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d6c02a9b
    • Jens Axboe's avatar
      Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block · 4884f8bf
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "This should be the last round of NVMe updates before the 4.19 merge
       window opens.  It conatins support for write protected (aka read-only)
       namespaces from Chaitanya, two ANA fixes from Hannes and a fabrics
       fix from Tal Shorer."
      
      * 'nvme-4.19' of git://git.infradead.org/nvme:
        nvme-fabrics: fix ctrl_loss_tmo < 0 to reconnect forever
        nvmet: add ns write protect support
        nvme: set gendisk read only based on nsattr
        nvme.h: add support for ns write protect definitions
        nvme.h: fixup ANA group descriptor format
        nvme: fixup crash on failed discovery
      4884f8bf
    • Shenghui Wang's avatar
      bcache: trivial - remove tailing backslash in macro BTREE_FLAG · cbb751c0
      Shenghui Wang authored
      Remove the tailing backslash in macro BTREE_FLAG in btree.h
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cbb751c0
    • Shenghui Wang's avatar
      bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section · e921efeb
      Shenghui Wang authored
      The pr_err statement in the code for sysfs_attatch section would run
      for various error codes, which maybe confusing.
      
      E.g,
      
      Run the command twice:
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
         [the backing dev got attached on the first run]
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
      
      In dmesg, after the command run twice, we can get:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be891
                     : cache set not found
      The first statement in the message was right, but the second was
      confusing.
      
      bch_cached_dev_attach has various pr_ statements for various error
      codes, except ENOENT.
      
      After the change, rerun above command twice:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      
      In dmesg we only got:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      No confusing "cache set not found" message anymore.
      
      And for some not exist SET-UUID:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \
      			/sys/block/bcache0/bcache/attach
      In dmesg we can get:
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be898
      	               : cache set not found
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e921efeb
    • Coly Li's avatar
      bcache: set max writeback rate when I/O request is idle · ea8c5356
      Coly Li authored
      Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      allows the writeback rate to be faster if there is no I/O request on a
      bcache device. It works well if there is only one bcache device attached
      to the cache set. If there are many bcache devices attached to a cache
      set, it may introduce performance regression because multiple faster
      writeback threads of the idle bcache devices will compete the btree level
      locks with the bcache device who have I/O requests coming.
      
      This patch fixes the above issue by only permitting fast writebac when
      all bcache devices attached on the cache set are idle. And if one of the
      bcache devices has new I/O request coming, minimized all writeback
      throughput immediately and let PI controller __update_writeback_rate()
      to decide the upcoming writeback rate for each bcache device.
      
      Also when all bcache devices are idle, limited wrieback rate to a small
      number is wast of thoughput, especially when backing devices are slower
      non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
      rate for each backing device if the whole cache set is idle. A faster
      writeback rate in idle time means new I/Os may have more available space
      for dirty data, and people may observe a better write performance then.
      
      Please note bcache may change its cache mode in run time, and this patch
      still works if the cache mode is switched from writeback mode and there
      is still dirty data on cache.
      
      Fixes: Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      Cc: stable@vger.kernel.org #4.16+
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Tested-by: default avatarKai Krakow <kai@kaishome.de>
      Tested-by: default avatarStefan Priebe <s.priebe@profihost.ag>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ea8c5356
    • Coly Li's avatar
      bcache: add code comments for bset.c · b467a6ac
      Coly Li authored
      This patch tries to add code comments in bset.c, to make some
      tricky code and designment to be more comprehensible. Most information
      of this patch comes from the discussion between Kent and I, he
      offers very informative details. If there is any mistake
      of the idea behind the code, no doubt that's from me misrepresentation.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b467a6ac