1. 11 Aug, 2018 8 commits
  2. 10 Aug, 2018 1 commit
    • Coly Li's avatar
      bcache: fix error setting writeback_rate through sysfs interface · 46451874
      Coly Li authored
      Commit ea8c5356 ("bcache: set max writeback rate when I/O request
      is idle") changes struct bch_ratelimit member rate from uint32_t to
      atomic_long_t and uses atomic_long_set() in drivers/md/bcache/sysfs.c
      to set new writeback rate, after the input is converted from memory
      buf to long int by sysfs_strtoul_clamp().
      
      The above change has a problem because there is an implicit return
      inside sysfs_strtoul_clamp() so the following atomic_long_set()
      won't be called. This error is detected by 0day system with following
      snipped smatch warnings:
      
      drivers/md/bcache/sysfs.c:271 __cached_dev_store() error: uninitialized
      symbol 'v'.
      270  sysfs_strtoul_clamp(writeback_rate, v, 1, INT_MAX);
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      @271 atomic_long_set(&dc->writeback_rate.rate, v);
      
      This patch fixes the above error by using strtoul_safe_clamp() to
      convert the input buffer into a long int type result.
      
      Fixes: ea8c5356 ("bcache: set max writeback rate when I/O request is idle")
      Cc: Kai Krakow <kai@kaishome.de>
      Cc: Stefan Priebe <s.priebe@profihost.ag>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      46451874
  3. 09 Aug, 2018 20 commits
    • Jens Axboe's avatar
      null_blk: add lock drop/acquire annotation · 61884de0
      Jens Axboe authored
      sparse complains:
      
      drivers/block/null_blk_main.c:816:24: sparse: context imbalance in 'null_insert_page' - unexpected unlock
      
      Fix it by adding the necessary annotations to the function.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      61884de0
    • Liu Bo's avatar
      Blk-throttle: reduce tail io latency when iops limit is enforced · 991f61fe
      Liu Bo authored
      When an application's iops has exceeded its cgroup's iops limit, surely it
      is throttled and kernel will set a timer for dispatching, thus IO latency
      includes the delay.
      
      However, the dispatch delay which is calculated by the limit and the
      elapsed jiffies is suboptimal.  As the dispatch delay is only calculated
      once the application's iops is (iops limit + 1), it doesn't need to wait
      any longer than the remaining time of the current slice.
      
      The difference can be proved by the following fio job and cgroup iops
      setting,
      -----
      $ echo 4 > /mnt/config/nullb/disk1/mbps    # limit nullb's bandwidth to 4MB/s for testing.
      $ echo "253:1 riops=100 rbps=max" > /sys/fs/cgroup/unified/cg1/io.max
      $ cat r2.job
      [global]
      name=fio-rand-read
      filename=/dev/nullb1
      rw=randread
      bs=4k
      direct=1
      numjobs=1
      time_based=1
      runtime=60
      group_reporting=1
      
      [file1]
      size=4G
      ioengine=libaio
      iodepth=1
      rate_iops=50000
      norandommap=1
      thinktime=4ms
      -----
      
      wo patch:
      file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
      fio-3.7-66-gedfc
      Starting 1 process
      
         read: IOPS=99, BW=400KiB/s (410kB/s)(23.4MiB/60001msec)
          slat (usec): min=10, max=336, avg=27.71, stdev=17.82
          clat (usec): min=2, max=28887, avg=5929.81, stdev=7374.29
           lat (usec): min=24, max=28901, avg=5958.73, stdev=7366.22
          clat percentiles (usec):
           |  1.00th=[    4],  5.00th=[    4], 10.00th=[    4], 20.00th=[    4],
           | 30.00th=[    4], 40.00th=[    4], 50.00th=[    6], 60.00th=[11731],
           | 70.00th=[11863], 80.00th=[11994], 90.00th=[12911], 95.00th=[22676],
           | 99.00th=[23725], 99.50th=[23987], 99.90th=[23987], 99.95th=[25035],
           | 99.99th=[28967]
      
      w/ patch:
      file1: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
      fio-3.7-66-gedfc
      Starting 1 process
      
         read: IOPS=100, BW=400KiB/s (410kB/s)(23.4MiB/60005msec)
          slat (usec): min=10, max=155, avg=23.24, stdev=16.79
          clat (usec): min=2, max=12393, avg=5961.58, stdev=5959.25
           lat (usec): min=23, max=12412, avg=5985.91, stdev=5951.92
          clat percentiles (usec):
           |  1.00th=[    3],  5.00th=[    3], 10.00th=[    4], 20.00th=[    4],
           | 30.00th=[    4], 40.00th=[    5], 50.00th=[   47], 60.00th=[11863],
           | 70.00th=[11994], 80.00th=[11994], 90.00th=[11994], 95.00th=[11994],
           | 99.00th=[11994], 99.50th=[11994], 99.90th=[12125], 99.95th=[12125],
           | 99.99th=[12387]
      Signed-off-by: default avatarLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      991f61fe
    • Gustavo A. R. Silva's avatar
      block: paride: pd: mark expected switch fall-throughs · 0a1c749d
      Gustavo A. R. Silva authored
      In preparation to enabling -Wimplicit-fallthrough, mark switch cases
      where we are expecting to fall through.
      
      Addresses-Coverity-ID: 1056543 ("Missing break in switch")
      Addresses-Coverity-ID: 1056544 ("Missing break in switch")
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0a1c749d
    • Bart Van Assche's avatar
      block: Ensure that a request queue is dissociated from the cgroup controller · 24ecc358
      Bart Van Assche authored
      Several block drivers call alloc_disk() followed by put_disk() if
      something fails before device_add_disk() is called without calling
      blk_cleanup_queue(). Make sure that also for this scenario a request
      queue is dissociated from the cgroup controller. This patch avoids
      that loading the parport_pc, paride and pf drivers triggers the
      following kernel crash:
      
      BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
      Read of size 4 at addr 0000000000000008 by task modprobe/744
      Call Trace:
      dump_stack+0x9a/0xeb
      kasan_report+0x139/0x350
      pi_init+0x42e/0x580 [paride]
      pf_init+0x2bb/0x1000 [pf]
      do_one_initcall+0x8e/0x405
      do_init_module+0xd9/0x2f2
      load_module+0x3ab4/0x4700
      SYSC_finit_module+0x176/0x1a0
      do_syscall_64+0xee/0x2b0
      entry_SYSCALL_64_after_hwframe+0x42/0xb7
      Reported-by: default avatarAlexandru Moise <00moses.alexander00@gmail.com>
      Fixes: a063057d ("block: Fix a race between request queue removal and the block cgroup controller") # v4.17
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Tested-by: default avatarAlexandru Moise <00moses.alexander00@gmail.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      24ecc358
    • Bart Van Assche's avatar
      block: Introduce blk_exit_queue() · 4cf6324b
      Bart Van Assche authored
      This patch does not change any functionality.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarJohannes Thumshirn <jthumshirn@suse.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4cf6324b
    • Bart Van Assche's avatar
      blkcg: Introduce blkg_root_lookup() · 6bad9b21
      Bart Van Assche authored
      This new function will be used in a later patch to verify whether a
      queue has been dissociated from the cgroup controller before being
      released.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Alexandru Moise <00moses.alexander00@gmail.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6bad9b21
    • Bart Van Assche's avatar
      block: Remove two superfluous #include directives · b1f4267c
      Bart Van Assche authored
      Commit 12f5b931 ("blk-mq: Remove generation seqeunce") removed the
      only seqcount_t and u64_stats_sync instances from <linux/blkdev.h> but
      did not remove the corresponding #include directives. Since these
      include directives are no longer needed, remove them.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Cc: Hannes Reinecke <hare@suse.com>,
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b1f4267c
    • Jianchao Wang's avatar
      blk-mq: count the hctx as active before allocating tag · d263ed99
      Jianchao Wang authored
      Currently, we count the hctx as active after allocate driver tag
      successfully. If a previously inactive hctx try to get tag first
      time, it may fails and need to wait. However, due to the stale tag
      ->active_queues, the other shared-tags users are still able to
      occupy all driver tags while there is someone waiting for tag.
      Consequently, even if the previously inactive hctx is waked up, it
      still may not be able to get a tag and could be starved.
      
      To fix it, we count the hctx as active before try to allocate driver
      tag, then when it is waiting the tag, the other shared-tag users
      will reserve budget for it.
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d263ed99
    • Greg Edwards's avatar
      block: bvec_nr_vecs() returns value for wrong slab · d6c02a9b
      Greg Edwards authored
      In commit ed996a52 ("block: simplify and cleanup bvec pool
      handling"), the value of the slab index is incremented by one in
      bvec_alloc() after the allocation is done to indicate an index value of
      0 does not need to be later freed.
      
      bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
      value.  Decrement idx before performing the lookup.
      
      Fixes: ed996a52 ("block: simplify and cleanup bvec pool handling")
      Signed-off-by: default avatarGreg Edwards <gedwards@ddn.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d6c02a9b
    • Jens Axboe's avatar
      Merge branch 'nvme-4.19' of git://git.infradead.org/nvme into for-4.19/block · 4884f8bf
      Jens Axboe authored
      Pull NVMe updates from Christoph:
      
      "This should be the last round of NVMe updates before the 4.19 merge
       window opens.  It conatins support for write protected (aka read-only)
       namespaces from Chaitanya, two ANA fixes from Hannes and a fabrics
       fix from Tal Shorer."
      
      * 'nvme-4.19' of git://git.infradead.org/nvme:
        nvme-fabrics: fix ctrl_loss_tmo < 0 to reconnect forever
        nvmet: add ns write protect support
        nvme: set gendisk read only based on nsattr
        nvme.h: add support for ns write protect definitions
        nvme.h: fixup ANA group descriptor format
        nvme: fixup crash on failed discovery
      4884f8bf
    • Shenghui Wang's avatar
      bcache: trivial - remove tailing backslash in macro BTREE_FLAG · cbb751c0
      Shenghui Wang authored
      Remove the tailing backslash in macro BTREE_FLAG in btree.h
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cbb751c0
    • Shenghui Wang's avatar
      bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section · e921efeb
      Shenghui Wang authored
      The pr_err statement in the code for sysfs_attatch section would run
      for various error codes, which maybe confusing.
      
      E.g,
      
      Run the command twice:
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
         [the backing dev got attached on the first run]
         echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      				/sys/block/bcache0/bcache/attach
      
      In dmesg, after the command run twice, we can get:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be891
                     : cache set not found
      The first statement in the message was right, but the second was
      confusing.
      
      bch_cached_dev_attach has various pr_ statements for various error
      codes, except ENOENT.
      
      After the change, rerun above command twice:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be891 > \
      			/sys/block/bcache0/bcache/attach
      
      In dmesg we only got:
      	bcache: bch_cached_dev_attach() Can't attach sda6: already attached
      No confusing "cache set not found" message anymore.
      
      And for some not exist SET-UUID:
      	echo 796b5c05-b03c-4bc7-9cbd-a8df5e8be898 > \
      			/sys/block/bcache0/bcache/attach
      In dmesg we can get:
      	bcache: __cached_dev_store() Can't attach 796b5c05-b03c-4bc7-9cbd-\
      a8df5e8be898
      	               : cache set not found
      Signed-off-by: default avatarShenghui Wang <shhuiw@foxmail.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e921efeb
    • Coly Li's avatar
      bcache: set max writeback rate when I/O request is idle · ea8c5356
      Coly Li authored
      Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      allows the writeback rate to be faster if there is no I/O request on a
      bcache device. It works well if there is only one bcache device attached
      to the cache set. If there are many bcache devices attached to a cache
      set, it may introduce performance regression because multiple faster
      writeback threads of the idle bcache devices will compete the btree level
      locks with the bcache device who have I/O requests coming.
      
      This patch fixes the above issue by only permitting fast writebac when
      all bcache devices attached on the cache set are idle. And if one of the
      bcache devices has new I/O request coming, minimized all writeback
      throughput immediately and let PI controller __update_writeback_rate()
      to decide the upcoming writeback rate for each bcache device.
      
      Also when all bcache devices are idle, limited wrieback rate to a small
      number is wast of thoughput, especially when backing devices are slower
      non-rotation devices (e.g. SATA SSD). This patch sets a max writeback
      rate for each backing device if the whole cache set is idle. A faster
      writeback rate in idle time means new I/Os may have more available space
      for dirty data, and people may observe a better write performance then.
      
      Please note bcache may change its cache mode in run time, and this patch
      still works if the cache mode is switched from writeback mode and there
      is still dirty data on cache.
      
      Fixes: Commit b1092c9a ("bcache: allow quick writeback when backing idle")
      Cc: stable@vger.kernel.org #4.16+
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Tested-by: default avatarKai Krakow <kai@kaishome.de>
      Tested-by: default avatarStefan Priebe <s.priebe@profihost.ag>
      Cc: Michael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ea8c5356
    • Coly Li's avatar
      bcache: add code comments for bset.c · b467a6ac
      Coly Li authored
      This patch tries to add code comments in bset.c, to make some
      tricky code and designment to be more comprehensible. Most information
      of this patch comes from the discussion between Kent and I, he
      offers very informative details. If there is any mistake
      of the idea behind the code, no doubt that's from me misrepresentation.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b467a6ac
    • Coly Li's avatar
      bcache: fix mistaken comments in request.c · 0cba2e71
      Coly Li authored
      This patch updates code comment in bch_keylist_realloc() by fixing
      incorrected function names, to make the code to be more comprehennsible.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0cba2e71
    • Coly Li's avatar
      bcache: fix mistaken code comments in bcache.h · cb329dec
      Coly Li authored
      This patch updates the code comment in struct cache with correct array
      names, to make the code to be more comprehensible.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cb329dec
    • Coly Li's avatar
      bcache: add a comment in super.c · e57fd746
      Coly Li authored
      This patch adds a line of code comment in super.c:register_bdev(), to
      make code to be more comprehensible.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e57fd746
    • Coly Li's avatar
      bcache: avoid unncessary cache prefetch bch_btree_node_get() · c2e8dcf7
      Coly Li authored
      In bch_btree_node_get() the read-in btree node will be partially
      prefetched into L1 cache for following bset iteration (if there is).
      But if the btree node read is failed, the perfetch operations will
      waste L1 cache space. This patch checkes whether read operation and
      only does cache prefetch when read I/O succeeded.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c2e8dcf7
    • Coly Li's avatar
      bcache: display rate debug parameters to 0 when writeback is not running · b4cb6efc
      Coly Li authored
      When writeback is not running, writeback rate should be 0, other value is
      misleading. And the following dyanmic writeback rate debug parameters
      should be 0 too,
      	rate, proportional, integral, change
      otherwise they are misleading when writeback is not running.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b4cb6efc
    • Coly Li's avatar
      bcache: do not check return value of debugfs_create_dir() · 78ac2107
      Coly Li authored
      Greg KH suggests that normal code should not care about debugfs. Therefore
      no matter successful or failed of debugfs_create_dir() execution, it is
      unncessary to check its return value.
      
      There are two functions called debugfs_create_dir() and check the return
      value, which are bch_debug_init() and closure_debug_init(). This patch
      changes these two functions from int to void type, and ignore return values
      of debugfs_create_dir().
      
      This patch does not fix exact bug, just makes things work as they should.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Suggested-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      Cc: Kai Krakow <kai@kaishome.de>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      78ac2107
  4. 08 Aug, 2018 8 commits
  5. 07 Aug, 2018 3 commits
    • Bart Van Assche's avatar
      cfq: Suppress compiler warnings about comparisons · f7ecb1b1
      Bart Van Assche authored
      This patch does not change any functionality but avoids that gcc
      reports the following warnings when building with W=1:
      
      block/cfq-iosched.c: In function ?cfq_back_seek_max_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4756:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_back_seek_max_store, &cfqd->cfq_back_max, 0, UINT_MAX, 0);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_slice_idle_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4759:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_slice_idle_store, &cfqd->cfq_slice_idle, 0, UINT_MAX, 1);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_group_idle_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4760:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_group_idle_store, &cfqd->cfq_group_idle, 0, UINT_MAX, 1);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_low_latency_store?:
      block/cfq-iosched.c:4741:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4765:1: note: in expansion of macro ?STORE_FUNCTION?
       STORE_FUNCTION(cfq_low_latency_store, &cfqd->cfq_latency, 0, 1, 0);
       ^~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_slice_idle_us_store?:
      block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4782:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
       USEC_STORE_FUNCTION(cfq_slice_idle_us_store, &cfqd->cfq_slice_idle, 0, UINT_MAX);
       ^~~~~~~~~~~~~~~~~~~
      block/cfq-iosched.c: In function ?cfq_group_idle_us_store?:
      block/cfq-iosched.c:4775:13: warning: comparison of unsigned expression < 0 is always false [-Wtype-limits]
        if (__data < (MIN))      \
                   ^
      block/cfq-iosched.c:4783:1: note: in expansion of macro ?USEC_STORE_FUNCTION?
       USEC_STORE_FUNCTION(cfq_group_idle_us_store, &cfqd->cfq_group_idle, 0, UINT_MAX);
       ^~~~~~~~~~~~~~~~~~~
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f7ecb1b1
    • Bart Van Assche's avatar
      cfq: Annotate fall-through in a switch statement · 9b4f4346
      Bart Van Assche authored
      This patch avoids that gcc complains about fall-through when building
      with W=1.
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9b4f4346
    • Anchal Agarwal's avatar
      blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait · 2887e41b
      Anchal Agarwal authored
      I am currently running a large bare metal instance (i3.metal)
      on EC2 with 72 cores, 512GB of RAM and NVME drives, with a
      4.18 kernel. I have a workload that simulates a database
      workload and I am running into lockup issues when writeback
      throttling is enabled,with the hung task detector also
      kicking in.
      
      Crash dumps show that most CPUs (up to 50 of them) are
      all trying to get the wbt wait queue lock while trying to add
      themselves to it in __wbt_wait (see stack traces below).
      
      [    0.948118] CPU: 45 PID: 0 Comm: swapper/45 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
      [    0.948119] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
      [    0.948120] task: ffff883f7878c000 task.stack: ffffc9000c69c000
      [    0.948124] RIP: 0010:native_queued_spin_lock_slowpath+0xf8/0x1a0
      [    0.948125] RSP: 0018:ffff883f7fcc3dc8 EFLAGS: 00000046
      [    0.948126] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7fce2a00
      [    0.948128] RDX: 000000000000001c RSI: 0000000000740001 RDI: ffff887f7709ca68
      [    0.948129] RBP: 0000000000000002 R08: 0000000000b80000 R09: 0000000000000000
      [    0.948130] R10: ffff883f7fcc3d78 R11: 000000000de27121 R12: 0000000000000002
      [    0.948131] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000
      [    0.948132] FS:  0000000000000000(0000) GS:ffff883f7fcc0000(0000) knlGS:0000000000000000
      [    0.948134] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    0.948135] CR2: 000000c424c77000 CR3: 0000000002010005 CR4: 00000000003606e0
      [    0.948136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    0.948137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    0.948138] Call Trace:
      [    0.948139]  <IRQ>
      [    0.948142]  do_raw_spin_lock+0xad/0xc0
      [    0.948145]  _raw_spin_lock_irqsave+0x44/0x4b
      [    0.948149]  ? __wake_up_common_lock+0x53/0x90
      [    0.948150]  __wake_up_common_lock+0x53/0x90
      [    0.948155]  wbt_done+0x7b/0xa0
      [    0.948158]  blk_mq_free_request+0xb7/0x110
      [    0.948161]  __blk_mq_complete_request+0xcb/0x140
      [    0.948166]  nvme_process_cq+0xce/0x1a0 [nvme]
      [    0.948169]  nvme_irq+0x23/0x50 [nvme]
      [    0.948173]  __handle_irq_event_percpu+0x46/0x300
      [    0.948176]  handle_irq_event_percpu+0x20/0x50
      [    0.948179]  handle_irq_event+0x34/0x60
      [    0.948181]  handle_edge_irq+0x77/0x190
      [    0.948185]  handle_irq+0xaf/0x120
      [    0.948188]  do_IRQ+0x53/0x110
      [    0.948191]  common_interrupt+0x87/0x87
      [    0.948192]  </IRQ>
      ....
      [    0.311136] CPU: 4 PID: 9737 Comm: run_linux_amd64 Not tainted 4.14.51-62.38.amzn1.x86_64 #1
      [    0.311137] Hardware name: Amazon EC2 i3.metal/Not Specified, BIOS 1.0 10/16/2017
      [    0.311138] task: ffff883f6e6a8000 task.stack: ffffc9000f1ec000
      [    0.311141] RIP: 0010:native_queued_spin_lock_slowpath+0xf5/0x1a0
      [    0.311142] RSP: 0018:ffffc9000f1efa28 EFLAGS: 00000046
      [    0.311144] RAX: 0000000000000000 RBX: ffff887f7709ca68 RCX: ffff883f7f722a00
      [    0.311145] RDX: 0000000000000035 RSI: 0000000000d80001 RDI: ffff887f7709ca68
      [    0.311146] RBP: 0000000000000202 R08: 0000000000140000 R09: 0000000000000000
      [    0.311147] R10: ffffc9000f1ef9d8 R11: 000000001a249fa0 R12: ffff887f7709ca68
      [    0.311148] R13: ffffc9000f1efad0 R14: 0000000000000000 R15: ffff887f7709ca00
      [    0.311149] FS:  000000c423f30090(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
      [    0.311150] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    0.311151] CR2: 00007feefcea4000 CR3: 0000007f7016e001 CR4: 00000000003606e0
      [    0.311152] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    0.311153] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [    0.311154] Call Trace:
      [    0.311157]  do_raw_spin_lock+0xad/0xc0
      [    0.311160]  _raw_spin_lock_irqsave+0x44/0x4b
      [    0.311162]  ? prepare_to_wait_exclusive+0x28/0xb0
      [    0.311164]  prepare_to_wait_exclusive+0x28/0xb0
      [    0.311167]  wbt_wait+0x127/0x330
      [    0.311169]  ? finish_wait+0x80/0x80
      [    0.311172]  ? generic_make_request+0xda/0x3b0
      [    0.311174]  blk_mq_make_request+0xd6/0x7b0
      [    0.311176]  ? blk_queue_enter+0x24/0x260
      [    0.311178]  ? generic_make_request+0xda/0x3b0
      [    0.311181]  generic_make_request+0x10c/0x3b0
      [    0.311183]  ? submit_bio+0x5c/0x110
      [    0.311185]  submit_bio+0x5c/0x110
      [    0.311197]  ? __ext4_journal_stop+0x36/0xa0 [ext4]
      [    0.311210]  ext4_io_submit+0x48/0x60 [ext4]
      [    0.311222]  ext4_writepages+0x810/0x11f0 [ext4]
      [    0.311229]  ? do_writepages+0x3c/0xd0
      [    0.311239]  ? ext4_mark_inode_dirty+0x260/0x260 [ext4]
      [    0.311240]  do_writepages+0x3c/0xd0
      [    0.311243]  ? _raw_spin_unlock+0x24/0x30
      [    0.311245]  ? wbc_attach_and_unlock_inode+0x165/0x280
      [    0.311248]  ? __filemap_fdatawrite_range+0xa3/0xe0
      [    0.311250]  __filemap_fdatawrite_range+0xa3/0xe0
      [    0.311253]  file_write_and_wait_range+0x34/0x90
      [    0.311264]  ext4_sync_file+0x151/0x500 [ext4]
      [    0.311267]  do_fsync+0x38/0x60
      [    0.311270]  SyS_fsync+0xc/0x10
      [    0.311272]  do_syscall_64+0x6f/0x170
      [    0.311274]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      In the original patch, wbt_done is waking up all the exclusive
      processes in the wait queue, which can cause a thundering herd
      if there is a large number of writer threads in the queue. The
      original intention of the code seems to be to wake up one thread
      only however, it uses wake_up_all() in __wbt_done(), and then
      uses the following check in __wbt_wait to have only one thread
      actually get out of the wait loop:
      
      if (waitqueue_active(&rqw->wait) &&
                  rqw->wait.head.next != &wait->entry)
                      return false;
      
      The problem with this is that the wait entry in wbt_wait is
      define with DEFINE_WAIT, which uses the autoremove wakeup function.
      That means that the above check is invalid - the wait entry will
      have been removed from the queue already by the time we hit the
      check in the loop.
      
      Secondly, auto-removing the wait entries also means that the wait
      queue essentially gets reordered "randomly" (e.g. threads re-add
      themselves in the order they got to run after being woken up).
      Additionally, new requests entering wbt_wait might overtake requests
      that were queued earlier, because the wait queue will be
      (temporarily) empty after the wake_up_all, so the waitqueue_active
      check will not stop them. This can cause certain threads to starve
      under high load.
      
      The fix is to leave the woken up requests in the queue and remove
      them in finish_wait() once the current thread breaks out of the
      wait loop in __wbt_wait. This will ensure new requests always
      end up at the back of the queue, and they won't overtake requests
      that are already in the wait queue. With that change, the loop
      in wbt_wait is also in line with many other wait loops in the kernel.
      Waking up just one thread drastically reduces lock contention, as
      does moving the wait queue add/remove out of the loop.
      
      A significant drop in lockdep's lock contention numbers is seen when
      running the test application on the patched kernel.
      Signed-off-by: default avatarAnchal Agarwal <anchalag@amazon.com>
      Signed-off-by: default avatarFrank van der Linden <fllinden@amazon.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2887e41b