1. 06 Sep, 2017 13 commits
    • Omar Sandoval's avatar
      loop: set physical block size to logical block size · bf093753
      Omar Sandoval authored
      Commit 6c6b6f28 ("loop: set physical block size to PAGE_SIZE")
      caused mkfs.xfs to barf on ppc64 [1]. Always using PAGE_SIZE as the
      physical block size still makes the most sense semantically, but let's
      just lie and always set it to the same value as the logical block size
      (same goes for io_min). In the future we might want to at least bump up
      io_min to PAGE_SIZE but I'm sick of these stupid changes so let's play
      it safe.
      
      1: https://marc.info/?l=linux-xfs&m=150459024723753&w=2Tested-by: default avatarChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      bf093753
    • Michael Lyle's avatar
      bcache: fix bch_hprint crash and improve output · 9276717b
      Michael Lyle authored
      Most importantly, solve a crash where %llu was used to format signed
      numbers.  This would cause a buffer overflow when reading sysfs
      writeback_rate_debug, as only 20 bytes were allocated for this and
      %llu writes 20 characters plus a null.
      
      Always use the units mechanism rather than having different output
      paths for simplicity.
      
      Also, correct problems with display output where 1.10 was a larger
      number than 1.09, by multiplying by 10 and then dividing by 1024 instead
      of dividing by 100.  (Remainders of >= 1000 would print as .10).
      
      Minor changes: Always display the decimal point instead of trying to
      omit it based on number of digits shown.  Decide what units to use
      based on 1000 as a threshold, not 1024 (in other words, always print
      at most 3 digits before the decimal point).
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reported-by: default avatarDmitry Yu Okunev <dyokunev@ut.mephi.ru>
      Acked-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9276717b
    • Dan Carpenter's avatar
      bcache: Update continue_at() documentation · 7b6a8570
      Dan Carpenter authored
      continue_at() doesn't have a return statement anymore.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7b6a8570
    • Dan Carpenter's avatar
      bcache: silence static checker warning · da22f0ee
      Dan Carpenter authored
      In olden times, closure_return() used to have a hidden return built in.
      We removed the hidden return but forgot to add a new return here.  If
      "c" were NULL we would oops on the next line, but fortunately "c" is
      never NULL.  Let's just remove the if statement.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      da22f0ee
    • Tang Junhui's avatar
      bcache: fix for gc and write-back race · 9baf3097
      Tang Junhui authored
      gc and write-back get raced (see the email "bcache get stucked" I sended
      before):
      gc thread                               write-back thread
      |                                       |bch_writeback_thread()
      |bch_gc_thread()                        |
      |                                       |==>read_dirty()
      |==>bch_btree_gc()                      |
      |==>btree_root() //get btree root       |
      |                //node write locker    |
      |==>bch_btree_gc_root()                 |
      |                                       |==>read_dirty_submit()
      |                                       |==>write_dirty()
      |                                       |==>continue_at(cl,
      |                                       |               write_dirty_finish,
      |                                       |               system_wq);
      |                                       |==>write_dirty_finish()//excute
      |                                       |               //in system_wq
      |                                       |==>bch_btree_insert()
      |                                       |==>bch_btree_map_leaf_nodes()
      |                                       |==>__bch_btree_map_nodes()
      |                                       |==>btree_root //try to get btree
      |                                       |              //root node read
      |                                       |              //lock
      |                                       |-----stuck here
      |==>bch_btree_set_root()
      |==>bch_journal_meta()
      |==>bch_journal()
      |==>journal_try_write()
      |==>journal_write_unlocked() //journal_full(&c->journal)
      |                            //condition satisfied
      |==>continue_at(cl, journal_write, system_wq); //try to excute
      |                               //journal_write in system_wq
      |                               //but work queue is excuting
      |                               //write_dirty_finish()
      |==>closure_sync(); //wait journal_write execute
      |                   //over and wake up gc,
      |-------------stuck here
      |==>release root node write locker
      
      This patch alloc a separate work-queue for write-back thread to avoid such
      race.
      
      (Commit log re-organized by Coly Li to pass checkpatch.pl checking)
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9baf3097
    • Tang Junhui's avatar
      bcache: increase the number of open buckets · 89b1fc54
      Tang Junhui authored
      In currently, we only alloc 6 open buckets for each cache set,
      but in usually, we always attach about 10 or so backend devices for
      each cache set, and the each bcache device are always accessed by
      about 10 or so threads in top application layer. So 6 open buckets
      are too few, It has led to that each of the same thread write data
      to different buckets, which would cause low efficiency write-back,
      and also cause buckets inefficient, and would be Very easy to run
      out of.
      
      I add debug message in bch_open_buckets_alloc() to print alloc bucket
      info, and test with ten bcache devices with a cache set, and each
      bcache device is accessed by ten threads.
      
      From the debug message, we can see that, after the modification, One
      bucket is more likely to assign to the same thread, and the data from
      the same thread are more likely to write the same bucket. Usually the
      same thread always write/read the same backend device, so it is good
      for write-back and also promote the usage efficiency of buckets.
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      89b1fc54
    • Tony Asleson's avatar
      bcache: Correct return value for sysfs attach errors · 77fa100f
      Tony Asleson authored
      If you encounter any errors in bch_cached_dev_attach it will return
      a negative error code.  The variable 'v' which stores the result is
      unsigned, thus user space sees a very large value returned for bytes
      written which can cause incorrect user space behavior.  Utilize 1
      signed variable to use throughout the function to preserve error return
      capability.
      Signed-off-by: default avatarTony Asleson <tasleson@redhat.com>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      77fa100f
    • Tang Junhui's avatar
      bcache: correct cache_dirty_target in __update_writeback_rate() · a8394090
      Tang Junhui authored
      __update_write_rate() uses a Proportion-Differentiation Controller
      algorithm to control writeback rate. A dirty target number is used in
      this PD controller to control writeback rate. A larger target number
      will make the writeback rate smaller, on the versus, a smaller target
      number will make the writeback rate larger.
      
      bcache uses the following steps to calculate the target number,
      1) cache_sectors = all-buckets-of-cache-set * buckets-size
      2) cache_dirty_target = cache_sectors * cached-device-writeback_percent
      3) target = cache_dirty_target *
      (sectors-of-cached-device/sectors-of-all-cached-devices-of-this-cache-set)
      
      The calculation at step 1) for cache_sectors is incorrect, which does
      not consider dirty blocks occupied by flash only volume.
      
      A flash only volume can be took as a bcache device without cached
      device. All data sectors allocated for it are persistent on cache device
      and marked dirty, they are not touched by bcache writeback and garbage
      collection code. So data blocks of flash only volume should be ignore
      when calculating cache_sectors of cache set.
      
      Current code does not subtract dirty sectors of flash only volume, which
      results a larger target number from the above 3 steps. And in sequence
      the cache device's writeback rate is smaller then a correct value,
      writeback speed is slower on all cached devices.
      
      This patch fixes the incorrect slower writeback rate by subtracting
      dirty sectors of flash only volumes in __update_writeback_rate().
      
      (Commit log composed by Coly Li to pass checkpatch.pl checking)
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a8394090
    • Tang Junhui's avatar
      bcache: gc does not work when triggering by manual command · 0b43f49d
      Tang Junhui authored
      I try to execute the following command to trigger gc thread:
      [root@localhost internal]# echo 1 > trigger_gc
      But it does not work, I debug the code in gc_should_run(), It works only
      if in invalidating or sectors_to_gc < 0. So set sectors_to_gc to -1 to
      meet the condition when we trigger gc by manual command.
      
      (Code comments aded by Coly Li)
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0b43f49d
    • Byungchul Park's avatar
      bcache: Don't reinvent the wheel but use existing llist API · 09b3efec
      Byungchul Park authored
      Although llist provides proper APIs, they are not used. Make them used.
      Signed-off-by: default avatarByungchul Park <byungchul.park@lge.com>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      09b3efec
    • Tang Junhui's avatar
      bcache: do not subtract sectors_to_gc for bypassed IO · 69daf03a
      Tang Junhui authored
      Since bypassed IOs use no bucket, so do not subtract sectors_to_gc to
      trigger gc thread.
      Signed-off-by: default avatartang.junhui <tang.junhui@zte.com.cn>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarEric Wheeler <bcache@linux.ewheeler.net>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      69daf03a
    • Tang Junhui's avatar
      bcache: fix sequential large write IO bypass · c81ffa32
      Tang Junhui authored
      Sequential write IOs were tested with bs=1M by FIO in writeback cache
      mode, these IOs were expected to be bypassed, but actually they did not.
      We debug the code, and find in check_should_bypass():
          if (!congested &&
              mode == CACHE_MODE_WRITEBACK &&
              op_is_write(bio_op(bio)) &&
              (bio->bi_opf & REQ_SYNC))
              goto rescale
      that means, If in writeback mode, a write IO with REQ_SYNC flag will not
      be bypassed though it is a sequential large IO, It's not a correct thing
      to do actually, so this patch remove these codes.
      Signed-off-by: default avatartang.junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Reviewed-by: default avatarEric Wheeler <bcache@linux.ewheeler.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c81ffa32
    • Jan Kara's avatar
      bcache: Fix leak of bdev reference · 4b758df2
      Jan Kara authored
      If blkdev_get_by_path() in register_bcache() fails, we try to lookup the
      block device using lookup_bdev() to detect which situation we are in to
      properly report error. However we never drop the reference returned to
      us from lookup_bdev(). Fix that.
      Signed-off-by: default avatarJan Kara <jack@suse.cz>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4b758df2
  2. 01 Sep, 2017 11 commits
  3. 31 Aug, 2017 9 commits
    • Paolo Valente's avatar
      doc, block, bfq: better describe how to properly configure bfq · 2670cd16
      Paolo Valente authored
      Many users have reported the lack of an HOWTO for properly configuring
      bfq as a function of the goal one wants to achieve (max
      responsiveness, max throughput, ...). In fact, all needed details are
      already provided in the documentation file bfq-iosched.txt. Yet the
      document lacks guidance on which parameter descriptions to look
      at. This commit adds some simple direction.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Reviewed-by: default avatarJeremy Hickman <jeremywh7@gmail.com>
      Reviewed-by: default avatarLaurentiu Nicola <lnicola@dend.ro>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      2670cd16
    • Paolo Valente's avatar
      doc, block, bfq: fix some typos and remove stale stuff · 233f0bf4
      Paolo Valente authored
      In addition to containing some typos and stale sentences, the file
      bfq-iosched.txt still mentioned a set of sysfs parameters that have
      been removed from this version of bfq. This commit fixes all these
      issues.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Reviewed-by: default avatarJeremy Hickman <jeremywh7@gmail.com>
      Reviewed-by: default avatarLaurentiu Nicola <lnicola@dend.ro>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      233f0bf4
    • Omar Sandoval's avatar
      loop: fold loop_switch() into callers · 43cade80
      Omar Sandoval authored
      The comments here are really outdated, and blk-mq made flushing much
      simpler, so just fold the two cases into the callers.
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      43cade80
    • Omar Sandoval's avatar
      loop: add ioctl for changing logical block size · 89e4fdec
      Omar Sandoval authored
      This is a different approach from the first attempt in f2c6df7d
      ("loop: support 4k physical blocksize"). Rather than extending
      LOOP_{GET,SET}_STATUS, add a separate ioctl just for setting the block
      size.
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      89e4fdec
    • Omar Sandoval's avatar
      loop: set physical block size to PAGE_SIZE · 6c6b6f28
      Omar Sandoval authored
      The physical block size is "the lowest possible sector size that the
      hardware can operate on without reverting to read-modify-write
      operations" (from the comment on blk_queue_physical_block_size()). Since
      loop does buffered I/O on the backing file by default, the RMW unit is a
      page. This isn't the case for direct I/O mode, but let's keep it simple.
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6c6b6f28
    • Omar Sandoval's avatar
      loop: get rid of lo_blocksize · 8a0740c4
      Omar Sandoval authored
      This is only used for setting the soft block size on the struct
      block_device once and then never used again.
      Reviewed-by: default avatarMing Lei <ming.lei@redhat.com>
      Reviewed-by: default avatarHannes Reinecke <hare@suse.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a0740c4
    • Paolo Valente's avatar
      block, bfq: guarantee update_next_in_service always returns an eligible entity · 24d90bb2
      Paolo Valente authored
      If the function bfq_update_next_in_service is invoked as a consequence
      of the activation or requeueing of an entity, say E, then it doesn't
      invoke bfq_lookup_next_entity to get the next-in-service entity. In
      contrast, it follows a shorter path: if E happens to be eligible (see
      commit "bfq-sq-mq: make lookup_next_entity push up vtime on
      expirations" for details on eligibility) and to have a lower virtual
      finish time than the current candidate as next-in-service entity, then
      E directly becomes the next-in-service entity. Unfortunately, there is
      a corner case for which this shorter path makes
      bfq_update_next_in_service choose a non eligible entity: it occurs if
      both E and the current next-in-service entity happen to be non
      eligible when bfq_update_next_in_service is invoked. In this case, E
      is not set as next-in-service, and, since bfq_lookup_next_entity is
      not invoked, the state of the parent entity is not updated so as to
      end up with an eligible entity as the proper next-in-service entity.
      
      In this respect, next-in-service is actually allowed to be non
      eligible while some queue is in service: since no system-virtual-time
      push-up can be performed in that case (see again commit "bfq-sq-mq:
      make lookup_next_entity push up vtime on expirations" for details),
      next-in-service is chosen, speculatively, as a function of the
      possible value that the system virtual time may get after a push
      up. But the correctness of the schedule breaks if next-in-service is
      still a non eligible entity when it is time to set in service the next
      entity. Unfortunately, this may happen in the above corner case.
      
      This commit fixes this problem by making bfq_update_next_in_service
      invoke bfq_lookup_next_entity not only if the above shorter path
      cannot be taken, but also if the shorter path is taken but fails to
      yield an eligible next-in-service entity.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      24d90bb2
    • Paolo Valente's avatar
      block, bfq: remove direct switch to an entity in higher class · a02195ce
      Paolo Valente authored
      If the function bfq_update_next_in_service is invoked as a consequence
      of the activation or requeueing of an entity, say E, and finds out
      that E belongs to a higher-priority class than that of the current
      next-in-service entity, then it sets next_in_service directly to
      E. But this may lead to anomalous schedules, because E may happen not
      be eligible for service, because its virtual start time is higher than
      the system virtual time for its service tree.
      
      This commit addresses this issue by simply removing this direct
      switch.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a02195ce
    • Paolo Valente's avatar
      block, bfq: make lookup_next_entity push up vtime on expirations · 80294c3b
      Paolo Valente authored
      To provide a very smooth service, bfq starts to serve a bfq_queue
      only if the queue is 'eligible', i.e., if the same queue would
      have started to be served in the ideal, perfectly fair system that
      bfq simulates internally. This is obtained by associating each
      queue with a virtual start time, and by computing a special system
      virtual time quantity: a queue is eligible only if the system
      virtual time has reached the virtual start time of the
      queue. Finally, bfq guarantees that, when a new queue must be set
      in service, there is always at least one eligible entity for each
      active parent entity in the scheduler. To provide this guarantee,
      the function __bfq_lookup_next_entity pushes up, for each parent
      entity on which it is invoked, the system virtual time to the
      minimum among the virtual start times of the entities in the
      active tree for the parent entity (more precisely, the push up
      occurs if the system virtual time happens to be lower than all
      such virtual start times).
      
      There is however a circumstance in which __bfq_lookup_next_entity
      cannot push up the system virtual time for a parent entity, even
      if the system virtual time is lower than the virtual start times
      of all the child entities in the active tree. It happens if one of
      the child entities is in service. In fact, in such a case, there
      is already an eligible entity, the in-service one, even if it may
      not be not present in the active tree (because in-service entities
      may be removed from the active tree).
      
      Unfortunately, in the last re-design of the
      hierarchical-scheduling engine, the reset of the pointer to the
      in-service entity for a given parent entity--reset to be done as a
      consequence of the expiration of the in-service entity--always
      happens after the function __bfq_lookup_next_entity has been
      invoked. This causes the function to think that there is still an
      entity in service for the parent entity, and then that the system
      virtual time cannot be pushed up, even if actually such a
      no-more-in-service entity has already been properly reinserted
      into the active tree (or in some other tree if no more
      active). Yet, the system virtual time *had* to be pushed up, to be
      ready to correctly choose the next queue to serve. Because of the
      lack of this push up, bfq may wrongly set in service a queue that
      had been speculatively pre-computed as the possible
      next-in-service queue, but that would no more be the one to serve
      after the expiration and the reinsertion into the active trees of
      the previously in-service entities.
      
      This commit addresses this issue by making
      __bfq_lookup_next_entity properly push up the system virtual time
      if an expiration is occurring.
      Signed-off-by: default avatarPaolo Valente <paolo.valente@linaro.org>
      Tested-by: default avatarLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: default avatarOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      80294c3b
  4. 30 Aug, 2017 6 commits
  5. 29 Aug, 2017 1 commit
    • Jens Axboe's avatar
      Merge branch 'nvme-4.14' of git://git.infradead.org/nvme into for-4.14/block-postmerge · 2b76da95
      Jens Axboe authored
      Pull NVMe changes from Christoph:
      
      "Below is the current set of NVMe updates for Linux 4.14, now against
       your postmerge branch, and with three more patches.
      
       The biggest bit comes from Sagi and refactors the RDMA driver to
       prepare for more code sharing in the setup and teardown path.  But we
       have various features and bug fixes from a lot of people as well."
      2b76da95