1. 01 Nov, 2017 9 commits
    • Ming Lei's avatar
      blk-mq: don't handle TAG_SHARED in restart · 358a3a6b
      Ming Lei authored
      Now restart is used in the following cases, and TAG_SHARED is for
      SCSI only.
      
      1) .get_budget() returns BLK_STS_RESOURCE
      - if resource in target/host level isn't satisfied, this SCSI device
      will be added in shost->starved_list, and the whole queue will be rerun
      (via SCSI's built-in RESTART) in scsi_end_request() after any request
      initiated from this host/targe is completed. Forget to mention, host level
      resource can't be an issue for blk-mq at all.
      
      - the same is true if resource in the queue level isn't satisfied.
      
      - if there isn't outstanding request on this queue, then SCSI's RESTART
      can't work(blk-mq's can't work too), and the queue will be run after
      SCSI_QUEUE_DELAY, and finally all starved sdevs will be handled by SCSI's
      RESTART when this request is finished
      
      2) scsi_dispatch_cmd() returns BLK_STS_RESOURCE
      - if there isn't onprogressing request on this queue, the queue
      will be run after SCSI_QUEUE_DELAY
      
      - otherwise, SCSI's RESTART covers the rerun.
      
      3) blk_mq_get_driver_tag() failed
      - BLK_MQ_S_TAG_WAITING covers the cross-queue RESTART for driver
      allocation.
      
      In one word, SCSI's built-in RESTART is enough to cover the queue
      rerun, and we don't need to pay special attention to TAG_SHARED wrt. restart.
      
      In my test on scsi_debug(8 luns), this patch improves IOPS by 20% ~ 30% when
      running I/O on these 8 luns concurrently.
      
      Aslo Roman Pen reported the current RESTART is very expensive especialy
      when there are lots of LUNs attached in one host, such as in his
      test, RESTART causes half of IOPS be cut.
      
      Fixes: https://marc.info/?l=linux-kernel&m=150832216727524&w=2
      Fixes: 6d8c6c0f ("blk-mq: Restart a single queue if tag sets are shared")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      358a3a6b
    • Ming Lei's avatar
      scsi: implement .get_budget and .put_budget for blk-mq · 0df21c86
      Ming Lei authored
      We need to tell blk-mq to reserve resources before queuing one request,
      so implement these two callbacks. Then blk-mq can avoid to dequeue
      request too early, and IO merging can be improved a lot.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      0df21c86
    • Ming Lei's avatar
      scsi: allow passing in null rq to scsi_prep_state_check() · aeec7762
      Ming Lei authored
      In the following patch, we will implement scsi_get_budget()
      which need to call scsi_prep_state_check() when rq isn't
      dequeued yet.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      aeec7762
    • Ming Lei's avatar
      blk-mq-sched: improve dispatching from sw queue · b347689f
      Ming Lei authored
      SCSI devices use host-wide tagset, and the shared driver tag space is
      often quite big. However, there is also a queue depth for each lun(
      .cmd_per_lun), which is often small, for example, on both lpfc and
      qla2xxx, .cmd_per_lun is just 3.
      
      So lots of requests may stay in sw queue, and we always flush all
      belonging to same hw queue and dispatch them all to driver.
      Unfortunately it is easy to cause queue busy because of the small
      .cmd_per_lun.  Once these requests are flushed out, they have to stay in
      hctx->dispatch, and no bio merge can happen on these requests, and
      sequential IO performance is harmed.
      
      This patch introduces blk_mq_dequeue_from_ctx for dequeuing a request
      from a sw queue, so that we can dispatch them in scheduler's way. We can
      then avoid dequeueing too many requests from sw queue, since we don't
      flush ->dispatch completely.
      
      This patch improves dispatching from sw queue by using the .get_budget
      and .put_budget callbacks.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b347689f
    • Ming Lei's avatar
      blk-mq: introduce .get_budget and .put_budget in blk_mq_ops · de148297
      Ming Lei authored
      For SCSI devices, there is often a per-request-queue depth, which needs
      to be respected before queuing one request.
      
      Currently blk-mq always dequeues the request first, then calls
      .queue_rq() to dispatch the request to lld. One obvious issue with this
      approach is that I/O merging may not be successful, because when the
      per-request-queue depth can't be respected, .queue_rq() has to return
      BLK_STS_RESOURCE, and then this request has to stay in hctx->dispatch
      list. This means it never gets a chance to be merged with other IO.
      
      This patch introduces .get_budget and .put_budget callback in blk_mq_ops,
      then we can try to get reserved budget first before dequeuing request.
      If the budget for queueing I/O can't be satisfied, we don't need to
      dequeue request at all. Hence the request can be left in the IO
      scheduler queue, for more merging opportunities.
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      de148297
    • Ming Lei's avatar
      block: kyber: check if there are requests in ctx in kyber_has_work() · 63ba8e31
      Ming Lei authored
      There may be request in sw queue, and not fetched to domain queue
      yet, so check it in kyber_has_work().
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      63ba8e31
    • Ming Lei's avatar
      sbitmap: introduce __sbitmap_for_each_set() · 7930d0a0
      Ming Lei authored
      For blk-mq, we need to be able to iterate software queues starting
      from any queue in a round robin fashion, so introduce this helper.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      7930d0a0
    • Ming Lei's avatar
      blk-mq-sched: move actual dispatching into one helper · caf8eb0d
      Ming Lei authored
      So that it becomes easy to support to dispatch from sw queue in the
      following patch.
      
      No functional change.
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Suggested-by: Christoph Hellwig <hch@lst.de> # for simplifying dispatch logic
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      caf8eb0d
    • Ming Lei's avatar
      blk-mq-sched: dispatch from scheduler IFF progress is made in ->dispatch · 5e3d02bb
      Ming Lei authored
      When the hw queue is busy, we shouldn't take requests from the scheduler
      queue any more, otherwise it is difficult to do IO merge.
      
      This patch fixes the awful IO performance on some SCSI devices(lpfc,
      qla2xxx, ...) when mq-deadline/kyber is used by not taking requests if
      hw queue is busy.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5e3d02bb
  2. 31 Oct, 2017 1 commit
  3. 30 Oct, 2017 6 commits
    • Liang Chen's avatar
      bcache: explicitly destroy mutex while exiting · 330a4db8
      Liang Chen authored
      mutex_destroy does nothing most of time, but it's better to call
      it to make the code future proof and it also has some meaning
      for like mutex debug.
      
      As Coly pointed out in a previous review, bcache_exit() may not be
      able to handle all the references properly if userspace registers
      cache and backing devices right before bch_debug_init runs and
      bch_debug_init failes later. So not exposing userspace interface
      until everything is ready to avoid that issue.
      Signed-off-by: default avatarLiang Chen <liangchen.linux@gmail.com>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarEric Wheeler <bcache@linux.ewheeler.net>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      330a4db8
    • tang.junhui's avatar
      bcache: fix wrong cache_misses statistics · c1573137
      tang.junhui authored
      Currently, Cache missed IOs are identified by s->cache_miss, but actually,
      there are many situations that missed IOs are not assigned a value for
      s->cache_miss in cached_dev_cache_miss(), for example, a bypassed IO
      (s->iop.bypass = 1), or the cache_bio allocate failed. In these situations,
      it will go to out_put or out_submit, and s->cache_miss is null, which leads
      bch_mark_cache_accounting() to treat this IO as a hit IO.
      
      [ML: applied by 3-way merge]
      Signed-off-by: default avatartang.junhui <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c1573137
    • Tang Junhui's avatar
      bcache: update bucket_in_use in real time · d44c2f9e
      Tang Junhui authored
      bucket_in_use is updated in gc thread which triggered by invalidating or
      writing sectors_to_gc dirty data, It's a long interval. Therefore, when we
      use it to compare with the threshold, it is often not timely, which leads
      to inaccurate judgment and often results in bucket depletion.
      
      We have send a patch before, by the means of updating bucket_in_use
      periodically In gc thread, which Coly thought that would lead high
      latency, In this patch, we add avail_nbuckets to record the count of
      available buckets, and we calculate bucket_in_use when alloc or free
      bucket in real time.
      
      [edited by ML: eliminated some whitespace errors]
      Signed-off-by: default avatarTang Junhui <tang.junhui@zte.com.cn>
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d44c2f9e
    • Elena Reshetova's avatar
      bcache: convert cached_dev.count from atomic_t to refcount_t · 3b304d24
      Elena Reshetova authored
      atomic_t variables are currently used to implement reference
      counters with the following properties:
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable cached_dev.count is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Reviewed-by: default avatarDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: default avatarHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3b304d24
    • Coly Li's avatar
      bcache: only permit to recovery read error when cache device is clean · d59b2379
      Coly Li authored
      When bcache does read I/Os, for example in writeback or writethrough mode,
      if a read request on cache device is failed, bcache will try to recovery
      the request by reading from cached device. If the data on cached device is
      not synced with cache device, then requester will get a stale data.
      
      For critical storage system like database, providing stale data from
      recovery may result an application level data corruption, which is
      unacceptible.
      
      With this patch, for a failed read request in writeback or writethrough
      mode, recovery a recoverable read request only happens when cache device
      is clean. That is to say, all data on cached device is up to update.
      
      For other cache modes in bcache, read request will never hit
      cached_dev_read_error(), they don't need this patch.
      
      Please note, because cache mode can be switched arbitrarily in run time, a
      writethrough mode might be switched from a writeback mode. Therefore
      checking dc->has_data in writethrough mode still makes sense.
      
      Changelog:
      V4: Fix parens error pointed by Michael Lyle.
      v3: By response from Kent Oversteet, he thinks recovering stale data is a
          bug to fix, and option to permit it is unnecessary. So this version
          the sysfs file is removed.
      v2: rename sysfs entry from allow_stale_data_on_failure  to
          allow_stale_data_on_failure, and fix the confusing commit log.
      v1: initial patch posted.
      
      [small change to patch comment spelling by mlyle]
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reported-by: default avatarArne Wolf <awolf@lenovo.com>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Nix <nix@esperi.org.uk>
      Cc: Kai Krakow <hurikhan77@gmail.com>
      Cc: Eric Wheeler <bcache@lists.ewheeler.net>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d59b2379
    • Bart Van Assche's avatar
      block: Fix a race between blk_cleanup_queue() and timeout handling · 4e9b6f20
      Bart Van Assche authored
      Make sure that if the timeout timer fires after a queue has been
      marked "dying" that the affected requests are finished.
      Reported-by: default avatarchenxiang (M) <chenxiang66@hisilicon.com>
      Fixes: commit 287922eb ("block: defer timeouts to a workqueue")
      Signed-off-by: default avatarBart Van Assche <bart.vanassche@wdc.com>
      Tested-by: default avatarchenxiang (M) <chenxiang66@hisilicon.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      4e9b6f20
  4. 25 Oct, 2017 7 commits
  5. 24 Oct, 2017 1 commit
  6. 17 Oct, 2017 2 commits
    • Omar Sandoval's avatar
      kyber: fix hang on domain token wait queue · 8cf46660
      Omar Sandoval authored
      When we're getting a domain token, if we fail to get a token on our
      first attempt, we put the current hardware queue on a wait queue and
      then try again just in case a token was freed after our initial attempt
      but before we got on the wait queue. If this second attempt succeeds, we
      currently leave the hardware queue on the wait queue. Usually this is
      okay; we'll just run the hardware queue one extra time when another
      token is freed. However, if the hardware queue doesn't have any other
      requests waiting, then when it it gets the extra wakeup, it won't have
      anything to free and therefore won't wake up any other hardware queues.
      If tokens are limited, then we won't make forward progress and the
      device will hang.
      Reported-by: default avatarBin Zha <zhabin.zb@alibaba-inc.com>
      Signed-off-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8cf46660
    • Wei Yongjun's avatar
      nullb: fix error return code in null_init() · 30c516d7
      Wei Yongjun authored
      Fix to return error code -ENOMEM from the null_alloc_dev() error
      handling case instead of 0, as done elsewhere in this function.
      
      Fixes: 2984c868 ("nullb: factor disk parameters")
      Signed-off-by: default avatarWei Yongjun <weiyongjun1@huawei.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      30c516d7
  7. 16 Oct, 2017 14 commits
    • Randy Dunlap's avatar
      block: fix Sphinx kernel-doc warning · 519c8e9f
      Randy Dunlap authored
      Sphinx treats symbols that end with '_' as a kind of special
      documentation indicator, so fix that by adding an ending '*'
      to it.
      
      ../block/bio.c:404: ERROR: Unknown target name: "gfp".
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      519c8e9f
    • Michael Lyle's avatar
      bcache: writeback rate clamping: make 32 bit safe · 9ce762e8
      Michael Lyle authored
      Sorry this got through to linux-block, was detected by the kbuilds test
      robot.  NSEC_PER_SEC is a long constant; 2.5 * 10^9 doesn't fit in a
      signed long constant.
      
      Fixes: e41166c5 ("bcache: writeback rate shouldn't artifically clamp")
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9ce762e8
    • Michael Lyle's avatar
      bcache: MAINTAINERS: set bcache to MAINTAINED · 52b69ff5
      Michael Lyle authored
      Also add URL for IRC channel.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      52b69ff5
    • Kent Overstreet's avatar
      77c77a98
    • Liang Chen's avatar
      bcache: safeguard a dangerous addressing in closure_queue · 6446c684
      Liang Chen authored
      The use of the union reduces the size of closure struct by taking advantage
      of the current size of its members. The offset of func in work_struct
      equals the size of the first three members, so that work.work_func will
      just reference the forth member - fn.
      
      This is smart but dangerous. It can be broken if work_struct or the other
      structs get changed, and can be a bit difficult to debug.
      Signed-off-by: default avatarLiang Chen <liangchen.linux@gmail.com>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      6446c684
    • Michael Lyle's avatar
      bcache: rearrange writeback main thread ratelimit · a8500fc8
      Michael Lyle authored
      The time spent searching for things to write back "counts" for the
      actual rate achieved, so don't flush the accumulated rate with each
      chunk.
      
      This will maintain better fidelity to user-commanded rates, but it
      may slightly increase the burstiness of writeback.  The writeback
      lock needs improvement to help mitigate this.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarKent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a8500fc8
    • Michael Lyle's avatar
      bcache: writeback rate shouldn't artifically clamp · e41166c5
      Michael Lyle authored
      The previous code artificially limited writeback rate to 1000000
      blocks/second (NSEC_PER_MSEC), which is a rate that can be met on fast
      hardware.  The rate limiting code works fine (though with decreased
      precision) up to 3 orders of magnitude faster, so use NSEC_PER_SEC.
      
      Additionally, ensure that uint32_t is used as a type for rate throughout
      the rate management so that type checking/clamp_t can work properly.
      
      bch_next_delay should be rewritten for increased precision and better
      handling of high rates and long sleep periods, but this is adequate for
      now.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reported-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e41166c5
    • Michael Lyle's avatar
      bcache: smooth writeback rate control · ae82ddbf
      Michael Lyle authored
      This works in conjunction with the new PI controller.  Currently, in
      real-world workloads, the rate controller attempts to write back 1
      sector per second.  In practice, these minimum-rate writebacks are
      between 4k and 60k in test scenarios, since bcache aggregates and
      attempts to do contiguous writes and because filesystems on top of
      bcachefs typically write 4k or more.
      
      Previously, bcache used to guarantee to write at least once per second.
      This means that the actual writeback rate would exceed the configured
      amount by a factor of 8-120 or more.
      
      This patch adjusts to be willing to sleep up to 2.5 seconds, and to
      target writing 4k/second.  On the smallest writes, it will sleep 1
      second like before, but many times it will sleep longer and load the
      backing device less.  This keeps the loading on the cache and backing
      device related to writeback more consistent when writing back at low
      rates.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      ae82ddbf
    • Michael Lyle's avatar
      bcache: implement PI controller for writeback rate · 1d316e65
      Michael Lyle authored
      bcache uses a control system to attempt to keep the amount of dirty data
      in cache at a user-configured level, while not responding excessively to
      transients and variations in write rate.  Previously, the system was a
      PD controller; but the output from it was integrated, turning the
      Proportional term into an Integral term, and turning the Derivative term
      into a crude Proportional term.  Performance of the controller has been
      uneven in production, and it has tended to respond slowly, oscillate,
      and overshoot.
      
      This patch set replaces the current control system with an explicit PI
      controller and tuning that should be correct for most hardware.  By
      default, it attempts to write at a rate that would retire 1/40th of the
      current excess blocks per second.  An integral term in turn works to
      remove steady state errors.
      
      IMO, this yields benefits in simplicity (removing weighted average
      filtering, etc) and system performance.
      
      Another small change is a tunable parameter is introduced to allow the
      user to specify a minimum rate at which dirty blocks are retired.
      
      There is a slight difference from earlier versions of the patch in
      integral handling to prevent excessive negative integral windup.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1d316e65
    • Michael Lyle's avatar
      bcache: don't write back data if reading it failed · 5fa89fb9
      Michael Lyle authored
      If an IO operation fails, and we didn't successfully read data from the
      cache, don't writeback invalid/partial data to the backing disk.
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5fa89fb9
    • Yijing Wang's avatar
      bcache: remove unused parameter · 23850102
      Yijing Wang authored
      Parameter bio is no longer used, clean it.
      Signed-off-by: default avatarYijing Wang <wangyijing@huawei.com>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      23850102
    • Eric Wheeler's avatar
      bcache: update bio->bi_opf bypass/writeback REQ_ flag hints · b41c9b02
      Eric Wheeler authored
      Flag for bypass if the IO is for read-ahead or background, unless the
      read-ahead request is for metadata (eg, from gfs2).
              Bypass if:
                      bio->bi_opf & (REQ_RAHEAD|REQ_BACKGROUND) &&
      			!(bio->bi_opf & REQ_META))
      
              Writeback if:
                      op_is_sync(bio->bi_opf) ||
      			bio->bi_opf & (REQ_META|REQ_PRIO)
      Signed-off-by: default avatarEric Wheeler <bcache@linux.ewheeler.net>
      Reviewed-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b41c9b02
    • Yijing Wang's avatar
      bcache: Remove redundant set_capacity · e89d6759
      Yijing Wang authored
      set_capacity() has been called in bcache_device_init(),
      remove the redundant one.
      Signed-off-by: default avatarYijing Wang <wangyijing@huawei.com>
      Reviewed-by: default avatarEric Wheeler <bcache@linux.ewheeler.net>
      Acked-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e89d6759
    • Coly Li's avatar
      bcache: rewrite multiple partitions support · 1dbe32ad
      Coly Li authored
      Current partition support of bcache is confusing and buggy. It tries to
      trace non-continuous device minor numbers by an ida bit string, and
      mistakenly mixed bcache device index with minor numbers. This design
      generates several negative results,
      - Index of bcache device name is not consecutive under /dev/. If there are
        3 bcache devices, they name will be,
        /dev/bcache0, /dev/bcache16, /dev/bcache32
        Only bcache code indexes bcache device name is such an interesting way.
      - First minor number of each bcache device is traced by ida bit string.
        One bcache device will occupy 16 bits, this is not a good idea. Indeed
        only one bit is enough.
      - Because minor number and bcache device index are mixed, a device index
        is allocated by ida_simple_get(), but an first minor number is sent into
        ida_simple_remove() to release the device. It confused original author
        too.
      
      Root cause of the above errors is, bcache code should not handle device
      minor numbers at all! A standard process to support multiple partitions in
      Linux kernel is,
      - Device driver provides major device number, and indexes multiple device
        instances.
      - Device driver does not allocat nor trace device minor number, only
        provides a first minor number of a given device instance, and sets how
        many minor numbers (paritions) the device instance may have.
      All rested stuffs are handled by block layer code, most of the details can
      be found from block/{genhd, partition-generic}.c files.
      
      This patch re-writes multiple partitions support for bcache. It makes
      whole things to be more clear, and uses ida bit string in a more efficeint
      way.
      - Ida bit string only traces bcache device index, not minor number. For a
        bcache device with 128 partitions, only one bit in ida bit string is
        enough.
      - Device minor number and device index are separated in concept. Device
        index is used for /dev node naming, and ida bit string trace. Minor
        number is calculated from device index and only used to initialize
        first_minor of a bcache device.
      - It does not follow any standard for 16 partitions on a bcache device.
        This patch sets 128 partitions on single bcache device at max, this is
        the limitation from GPT (GUID Partition Table) and supported by fdisk.
      
      Considering a typical device minor number is 20 bits width, each bcache
      device may have 128 partitions (7 bits), there can be 8192 bcache devices
      existing on system. For most common deployment for a single server in
      now days, it should be enough.
      
      [minor spelling fixes in commit message by Michael Lyle]
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Cc: Eric Wheeler <bcache@lists.ewheeler.net>
      Cc: Junhui Tang <tang.junhui@zte.com.cn>
      Reviewed-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarMichael Lyle <mlyle@lyle.org>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1dbe32ad