1. 15 Feb, 2019 6 commits
    • Ming Lei's avatar
      block: use bio_for_each_bvec() to compute multi-page bvec count · dcebd755
      Ming Lei authored
      First it is more efficient to use bio_for_each_bvec() in both
      blk_bio_segment_split() and __blk_recalc_rq_segments() to compute how
      many multi-page bvecs there are in the bio.
      
      Secondly once bio_for_each_bvec() is used, the bvec may need to be
      splitted because its length can be very longer than max segment size,
      so we have to split the big bvec into several segments.
      
      Thirdly when splitting multi-page bvec into segments, the max segment
      limit may be reached, so the bio split need to be considered under
      this situation too.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dcebd755
    • Ming Lei's avatar
      block: introduce bio_for_each_bvec() and rq_for_each_bvec() · d18d9174
      Ming Lei authored
      bio_for_each_bvec() is used for iterating over multi-page bvec for bio
      split & merge code.
      
      rq_for_each_bvec() can be used for drivers which may handle the
      multi-page bvec directly, so far loop is one perfect use case.
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d18d9174
    • Ming Lei's avatar
      block: introduce multi-page bvec helpers · 3d75ca0a
      Ming Lei authored
      This patch introduces helpers of 'mp_bvec_iter_*' for multi-page bvec
      support.
      
      The introduced helpers treate one bvec as real multi-page segment,
      which may include more than one pages.
      
      The existed helpers of bvec_iter_* are interfaces for supporting current
      bvec iterator which is thought as single-page by drivers, fs, dm and
      etc. These introduced helpers will build single-page bvec in flight, so
      this way won't break current bio/bvec users, which needn't any change.
      
      Follows some multi-page bvec background:
      
      - bvecs stored in bio->bi_io_vec is always multi-page style
      
      - bvec(struct bio_vec) represents one physically contiguous I/O
        buffer, now the buffer may include more than one page after
        multi-page bvec is supported, and all these pages represented
        by one bvec is physically contiguous. Before multi-page bvec
        support, at most one page is included in one bvec, we call it
        single-page bvec.
      
      - .bv_page of the bvec points to the 1st page in the multi-page bvec
      
      - .bv_offset of the bvec is the offset of the buffer in the bvec
      
      The effect on the current drivers/filesystem/dm/bcache/...:
      
      - almost everyone supposes that one bvec only includes one single
        page, so we keep the sp interface not changed, for example,
        bio_for_each_segment() still returns single-page bvec
      
      - bio_for_each_segment_all() will return single-page bvec too
      
      - during iterating, iterator variable(struct bvec_iter) is always
        updated in multi-page bvec style, and bvec_iter_advance() is kept
        not changed
      
      - returned(copied) single-page bvec is built in flight by bvec
        helpers from the stored multi-page bvec
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      3d75ca0a
    • Ming Lei's avatar
      block: remove bvec_iter_rewind() · 19d62f6d
      Ming Lei authored
      Commit 7759eb23 ("block: remove bio_rewind_iter()") removes
      bio_rewind_iter(), then no one uses bvec_iter_rewind() any more,
      so remove it.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      19d62f6d
    • Ming Lei's avatar
      block: don't use bio->bi_vcnt to figure out segment number · 1a67356e
      Ming Lei authored
      It is wrong to use bio->bi_vcnt to figure out how many segments
      there are in the bio even though CLONED flag isn't set on this bio,
      because this bio may be splitted or advanced.
      
      So always use bio_segments() in blk_recount_segments(), and it shouldn't
      cause any performance loss now because the physical segment number is figured
      out in blk_queue_split() and BIO_SEG_VALID is set meantime since
      bdced438 ("block: setup bi_phys_segments after splitting").
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Fixes: 76d8137a ("blk-merge: recaculate segment if it isn't less than max segments")
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      1a67356e
    • Christoph Hellwig's avatar
      btrfs: look at bi_size for repair decisions · 8a2ee44a
      Christoph Hellwig authored
      bio_readpage_error currently uses bi_vcnt to decide if it is worth
      retrying an I/O.  But the vector count is mostly an implementation
      artifact - it really should figure out if there is more than a
      single sector worth retrying.  Use bi_size for that and shift by
      PAGE_SHIFT.  This really should be blocks/sectors, but given that
      btrfs doesn't support a sector size different from the PAGE_SIZE
      using the page size keeps the changes to a minimum.
      Reviewed-by: default avatarOmar Sandoval <osandov@fb.com>
      Reviewed-by: default avatarDavid Sterba <dsterba@suse.com>
      Signed-off-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8a2ee44a
  2. 11 Feb, 2019 10 commits
  3. 10 Feb, 2019 2 commits
  4. 09 Feb, 2019 21 commits
    • Jens Axboe's avatar
      block: queue flag cleanup · eca7abf3
      Jens Axboe authored
      We have QUEUE_FLAG_DEFAULT defined, but it's not used anymore since
      the legacy IO stack is gone. Kill it.
      
      Sanitize the queue flags in general, they use spaces (for some
      reason), and the space is pretty sparse. With the flags renumbered,
      we can more clearly see how many we have available.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      eca7abf3
    • Jens Axboe's avatar
      block: kill QUEUE_FLAG_FLUSH_NQ · d11a3998
      Jens Axboe authored
      We have various helpers for setting/clearing this flag, and also
      a helper to check if the queue supports queueable flushes or not.
      But nobody uses them anymore, kill it with fire.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d11a3998
    • Coly Li's avatar
      bcache: use (REQ_META|REQ_PRIO) to indicate bio for metadata · dc7292a5
      Coly Li authored
      In 'commit 752f66a7 ("bcache: use REQ_PRIO to indicate bio for
      metadata")' REQ_META is replaced by REQ_PRIO to indicate metadata bio.
      This assumption is not always correct, e.g. XFS uses REQ_META to mark
      metadata bio other than REQ_PRIO. This is why Nix noticed that bcache
      does not cache metadata for XFS after the above commit.
      
      Thanks to Dave Chinner, he explains the difference between REQ_META and
      REQ_PRIO from view of file system developer. Here I quote part of his
      explanation from mailing list,
         REQ_META is used for metadata. REQ_PRIO is used to communicate to
         the lower layers that the submitter considers this IO to be more
         important that non REQ_PRIO IO and so dispatch should be expedited.
      
         IOWs, if the filesystem considers metadata IO to be more important
         that user data IO, then it will use REQ_PRIO | REQ_META rather than
         just REQ_META.
      
      Then it seems bios with REQ_META or REQ_PRIO should both be cached for
      performance optimation, because they are all probably low I/O latency
      demand by upper layer (e.g. file system).
      
      So in this patch, when we want to decide whether to bypass the cache,
      REQ_META and REQ_PRIO are both checked. Then both metadata and
      high priority I/O requests will be handled properly.
      Reported-by: default avatarNix <nix@esperi.org.uk>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Reviewed-by: default avatarAndre Noll <maan@tuebingen.mpg.de>
      Tested-by: default avatarNix <nix@esperi.org.uk>
      Cc: stable@vger.kernel.org
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dc7292a5
    • Coly Li's avatar
      bcache: fix input overflow to cache set sysfs file io_error_halflife · a91fbda4
      Coly Li authored
      Cache set sysfs entry io_error_halflife is used to set c->error_decay.
      c->error_decay is in type unsigned int, and it is converted by
      strtoul_or_return(), therefore overflow to c->error_decay is possible
      for a large input value.
      
      This patch fixes the overflow by using strtoul_safe_clamp() to convert
      input string to an unsigned long value in range [0, UINT_MAX], then
      divides by 88 and set it to c->error_decay.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      a91fbda4
    • Coly Li's avatar
      bcache: fix input overflow to cache set io_error_limit · b1500840
      Coly Li authored
      c->error_limit is in type unsigned int, it is set via cache set sysfs
      file io_error_limit. Inside the bcache code, input string is converted
      by strtoul_or_return() and set the converted value to c->error_limit.
      
      Because the converted value is unsigned long, and c->error_limit is
      unsigned int, if the input is large enought, overflow will happen to
      c->error_limit.
      
      This patch uses sysfs_strtoul_clamp() to convert input string, and set
      the range in [0, UINT_MAX] to avoid the potential overflow.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      b1500840
    • Coly Li's avatar
      bcache: fix input overflow to journal_delay_ms · 453745fb
      Coly Li authored
      c->journal_delay_ms is in type unsigned short, it is set via sysfs
      interface and converted by sysfs_strtoul() from input string to
      unsigned short value. Therefore overflow to unsigned short might be
      happen when the converted value exceed USHRT_MAX. e.g. writing
      65536 into sysfs file journal_delay_ms, c->journal_delay_ms is set to
      0.
      
      This patch uses sysfs_strtoul_clamp() to convert the input string and
      limit value range in [0, USHRT_MAX], to avoid the input overflow.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      453745fb
    • Coly Li's avatar
      bcache: fix input overflow to writeback_rate_minimum · dab71b2d
      Coly Li authored
      dc->writeback_rate_minimum is type unsigned integer variable, it is set
      via sysfs interface, and converte from input string to unsigned integer
      by d_strtoul_nonzero(). When the converted input value is larger than
      UINT_MAX, overflow to unsigned integer happens.
      
      This patch fixes the overflow by using sysfs_strotoul_clamp() to
      convert input string and limit the value in range [1, UINT_MAX], then
      the overflow can be avoided.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      dab71b2d
    • Coly Li's avatar
      bcache: fix potential div-zero error of writeback_rate_p_term_inverse · 5b5fd3c9
      Coly Li authored
      Current code already uses d_strtoul_nonzero() to convert input string
      to an unsigned integer, to make sure writeback_rate_p_term_inverse
      won't be zero value. But overflow may happen when converting input
      string to an unsigned integer value by d_strtoul_nonzero(), then
      dc->writeback_rate_p_term_inverse can still be set to 0 even if the
      sysfs file input value is not zero, e.g. 4294967296 (a.k.a UINT_MAX+1).
      
      If dc->writeback_rate_p_term_inverse is set to 0, it might cause a
      dev-zero error in following code from __update_writeback_rate(),
      	int64_t proportional_scaled =
      		div_s64(error, dc->writeback_rate_p_term_inverse);
      
      This patch replaces d_strtoul_nonzero() by sysfs_strtoul_clamp() and
      limit the value range in [1, UINT_MAX]. Then the unsigned integer
      overflow and dev-zero error can be avoided.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      5b5fd3c9
    • Coly Li's avatar
      bcache: fix potential div-zero error of writeback_rate_i_term_inverse · c3b75a21
      Coly Li authored
      dc->writeback_rate_i_term_inverse can be set via sysfs interface. It is
      in type unsigned int, and convert from input string by d_strtoul(). The
      problem is d_strtoul() does not check valid range of the input, if
      4294967296 is written into sysfs file writeback_rate_i_term_inverse,
      an overflow of unsigned integer will happen and value 0 is set to
      dc->writeback_rate_i_term_inverse.
      
      In writeback.c:__update_writeback_rate(), there are following lines of
      code,
            integral_scaled = div_s64(dc->writeback_rate_integral,
                            dc->writeback_rate_i_term_inverse);
      If dc->writeback_rate_i_term_inverse is set to 0 via sysfs interface,
      a div-zero error might be triggered in the above code.
      
      Therefore we need to add a range limitation in the sysfs interface,
      this is what this patch does, use sysfs_stroul_clamp() to replace
      d_strtoul() and restrict the input range in [1, UINT_MAX].
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      c3b75a21
    • Coly Li's avatar
      bcache: fix input overflow to writeback_delay · 369d21a7
      Coly Li authored
      Sysfs file writeback_delay is used to configure dc->writeback_delay
      which is type unsigned int. But bcache code uses sysfs_strtoul() to
      convert the input string, therefore it might be overflowed if the input
      value is too large. E.g. input value is 4294967296 but indeed 0 is
      set to dc->writeback_delay.
      
      This patch uses sysfs_strtoul_clamp() to convert the input string and
      set the result value range in [0, UINT_MAX] to avoid such unsigned
      integer overflow.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      369d21a7
    • Coly Li's avatar
      bcache: use sysfs_strtoul_bool() to set bit-field variables · f5c0b95d
      Coly Li authored
      When setting bcache parameters via sysfs, there are some variables are
      defined as bit-field value. Current bcache code in sysfs.c uses either
      d_strtoul() or sysfs_strtoul() to convert the input string to unsigned
      integer value and set it to the corresponded bit-field value.
      
      The problem is, the bit-field value only takes the lowest bit of the
      converted value. If input is 2, the expected value (like bool value)
      of the bit-field value should be 1, but indeed it is 0.
      
      The following sysfs files for bit-field variables have such problem,
      	bypass_torture_test,	for dc->bypass_torture_test
      	writeback_metadata,	for dc->writeback_metadata
      	writeback_running,	for dc->writeback_running
      	verify,			for c->verify
      	key_merging_disabled,	for c->key_merging_disabled
      	gc_always_rewrite,	for c->gc_always_rewrite
      	btree_shrinker_disabled,for c->shrinker_disabled
      	copy_gc_enabled,	for c->copy_gc_enabled
      
      This patch uses sysfs_strtoul_bool() to set such bit-field variables,
      then if the converted value is non-zero, the bit-field variables will
      be set to 1, like setting a bool value like expensive_debug_checks.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f5c0b95d
    • Coly Li's avatar
      bcache: add sysfs_strtoul_bool() for setting bit-field variables · e4db37fb
      Coly Li authored
      When setting bool values via sysfs interface, e.g. writeback_metadata,
      if writing 1 into writeback_metadata file, dc->writeback_metadata is
      set to 1, but if writing 2 into the file, dc->writeback_metadata is
      0. This is misleading, a better result should be 1 for all non-zero
      input value.
      
      It is because dc->writeback_metadata is a bit-field variable, and
      current code simply use d_strtoul() to convert a string into integer
      and takes the lowest bit value. To fix such error, we need a routine
      to convert the input string into unsigned integer, and set target
      variable to 1 if the converted integer is non-zero.
      
      This patch introduces a new macro called sysfs_strtoul_bool(), it can
      be used to convert input string into bool value, we can use it to set
      bool value for bit-field vairables.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e4db37fb
    • Coly Li's avatar
      bcache: fix input overflow to sequential_cutoff · 8c27a395
      Coly Li authored
      People may set sequential_cutoff of a cached device via sysfs file,
      but current code does not check input value overflow. E.g. if value
      4294967295 (UINT_MAX) is written to file sequential_cutoff, its value
      is 4GB, but if 4294967296 (UINT_MAX + 1) is written into, its value
      will be 0. This is an unexpected behavior.
      
      This patch replaces d_strtoi_h() by sysfs_strtoul_clamp() to convert
      input string to unsigned integer value, and limit its range in
      [0, UINT_MAX]. Then the input overflow can be fixed.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      8c27a395
    • Coly Li's avatar
      bcache: fix input integer overflow of congested threshold · f54478c6
      Coly Li authored
      Cache set congested threshold values congested_read_threshold_us and
      congested_write_threshold_us can be set via sysfs interface. These
      two values are 'unsigned int' type, but sysfs interface uses strtoul
      to convert input string. So if people input a large number like
      9999999999, the value indeed set is 1410065407, which is not expected
      behavior.
      
      This patch replaces sysfs_strtoul() by sysfs_strtoul_clamp() when
      convert input string to unsigned int value, and set value range in
      [0, UINT_MAX], to avoid the above integer overflow errors.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      f54478c6
    • Coly Li's avatar
      bcache: improve sysfs_strtoul_clamp() · 596b5a5d
      Coly Li authored
      Currently sysfs_strtoul_clamp() is defined as,
       82 #define sysfs_strtoul_clamp(file, var, min, max)                   \
       83 do {                                                               \
       84         if (attr == &sysfs_ ## file)                               \
       85                 return strtoul_safe_clamp(buf, var, min, max)      \
       86                         ?: (ssize_t) size;                         \
       87 } while (0)
      
      The problem is, if bit width of var is less then unsigned long, min and
      max may not protect var from integer overflow, because overflow happens
      in strtoul_safe_clamp() before checking min and max.
      
      To fix such overflow in sysfs_strtoul_clamp(), to make min and max take
      effect, this patch adds an unsigned long variable, and uses it to macro
      strtoul_safe_clamp() to convert an unsigned long value in range defined
      by [min, max]. Then assign this value to var. By this method, if bit
      width of var is less than unsigned long, integer overflow won't happen
      before min and max are checking.
      
      Now sysfs_strtoul_clamp() can properly handle smaller data type like
      unsigned int, of cause min and max should be defined in range of
      unsigned int too.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      596b5a5d
    • Tang Junhui's avatar
      bcache: treat stale && dirty keys as bad keys · 58ac3230
      Tang Junhui authored
      Stale && dirty keys can be produced in the follow way:
      After writeback in write_dirty_finish(), dirty keys k1 will
      replace by clean keys k2
      ==>ret = bch_btree_insert(dc->disk.c, &keys, NULL, &w->key);
      ==>btree_insert_fn(struct btree_op *b_op, struct btree *b)
      ==>static int bch_btree_insert_node(struct btree *b,
             struct btree_op *op,
             struct keylist *insert_keys,
             atomic_t *journal_ref,
      Then two steps:
      A) update k1 to k2 in btree node memory;
         bch_btree_insert_keys(b, op, insert_keys, replace_key)
      B) Write the bset(contains k2) to cache disk by a 30s delay work
         bch_btree_leaf_dirty(b, journal_ref).
      But before the 30s delay work write the bset to cache device,
      these things happened:
      A) GC works, and reclaim the bucket k2 point to;
      B) Allocator works, and invalidate the bucket k2 point to,
         and increase the gen of the bucket, and place it into free_inc
         fifo;
      C) Until now, the 30s delay work still does not finish work,
         so in the disk, the key still is k1, it is dirty and stale
         (its gen is smaller than the gen of the bucket). and then the
         machine power off suddenly happens;
      D) When the machine power on again, after the btree reconstruction,
         the stale dirty key appear.
      
      In bch_extent_bad(), when expensive_debug_checks is off, it would
      treat the dirty key as good even it is stale keys, and it would
      cause bellow probelms:
      A) In read_dirty() it would cause machine crash:
         BUG_ON(ptr_stale(dc->disk.c, &w->key, 0));
      B) It could be worse when reads hits stale dirty keys, it would
         read old incorrect data.
      
      This patch tolerate the existence of these stale && dirty keys,
      and treat them as bad key in bch_extent_bad().
      
      (Coly Li: fix indent which was modified by sender's email client)
      Signed-off-by: default avatarTang Junhui <tang.junhui.linux@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      58ac3230
    • Colin Ian King's avatar
      bcache: fix indentation issue, remove tabs on a hunk of code · e8cf978d
      Colin Ian King authored
      There is a hunk of code that is indented one level too deep, fix this
      by removing the extra tabs.
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      e8cf978d
    • Coly Li's avatar
      bcache: export backing_dev_uuid via sysfs · d4610456
      Coly Li authored
      When there are multiple bcache devices, after a reboot the name of
      bcache devices may change (e.g. current /dev/bcache1 was /dev/bcache0
      before reboot). Therefore we need the backing device UUID (sb.uuid) to
      identify each bcache device.
      
      Backing device uuid can be found by program bcache-super-show, but
      directly exporting backing_dev_uuid by sysfs file
      /sys/block/bcache<?>/bcache/backing_dev_uuid is a much simpler method.
      
      With backing_dev_uuid, and partition uuids from /dev/disk/by-partuuid/,
      now we can identify each bcache device and its partitions conveniently.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      d4610456
    • Coly Li's avatar
      bcache: export backing_dev_name via sysfs · 926d1946
      Coly Li authored
      This patch export dc->backing_dev_name to sysfs file
      /sys/block/bcache<?>/bcache/backing_dev_name, then people or user space
      tools may know the backing device name of this bcache device.
      
      Of cause it can be done by parsing sysfs links, but this method can be
      much simpler to find the link between bcache device and backing device.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      926d1946
    • Coly Li's avatar
      bcache: not use hard coded memset size in bch_cache_accounting_clear() · 83ff9318
      Coly Li authored
      In stats.c:bch_cache_accounting_clear(), a hard coded number '7' is
      used in memset(). It is because in struct cache_stats, there are 7
      atomic_t type members. This is not good when new members added into
      struct stats, the hard coded number will only clear part of memory.
      
      This patch replaces 'sizeof(unsigned long) * 7' by more generic
      'sizeof(struct cache_stats))', to avoid potential error if new
      member added into struct cache_stats.
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      83ff9318
    • Daniel Axtens's avatar
      bcache: never writeback a discard operation · 9951379b
      Daniel Axtens authored
      Some users see panics like the following when performing fstrim on a
      bcached volume:
      
      [  529.803060] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      [  530.183928] #PF error: [normal kernel read fault]
      [  530.412392] PGD 8000001f42163067 P4D 8000001f42163067 PUD 1f42168067 PMD 0
      [  530.750887] Oops: 0000 [#1] SMP PTI
      [  530.920869] CPU: 10 PID: 4167 Comm: fstrim Kdump: loaded Not tainted 5.0.0-rc1+ #3
      [  531.290204] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
      [  531.693137] RIP: 0010:blk_queue_split+0x148/0x620
      [  531.922205] Code: 60 38 89 55 a0 45 31 db 45 31 f6 45 31 c9 31 ff 89 4d 98 85 db 0f 84 7f 04 00 00 44 8b 6d 98 4c 89 ee 48 c1 e6 04 49 03 70 78 <8b> 46 08 44 8b 56 0c 48
      8b 16 44 29 e0 39 d8 48 89 55 a8 0f 47 c3
      [  532.838634] RSP: 0018:ffffb9b708df39b0 EFLAGS: 00010246
      [  533.093571] RAX: 00000000ffffffff RBX: 0000000000046000 RCX: 0000000000000000
      [  533.441865] RDX: 0000000000000200 RSI: 0000000000000000 RDI: 0000000000000000
      [  533.789922] RBP: ffffb9b708df3a48 R08: ffff940d3b3fdd20 R09: 0000000000000000
      [  534.137512] R10: ffffb9b708df3958 R11: 0000000000000000 R12: 0000000000000000
      [  534.485329] R13: 0000000000000000 R14: 0000000000000000 R15: ffff940d39212020
      [  534.833319] FS:  00007efec26e3840(0000) GS:ffff940d1f480000(0000) knlGS:0000000000000000
      [  535.224098] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  535.504318] CR2: 0000000000000008 CR3: 0000001f4e256004 CR4: 00000000001606e0
      [  535.851759] Call Trace:
      [  535.970308]  ? mempool_alloc_slab+0x15/0x20
      [  536.174152]  ? bch_data_insert+0x42/0xd0 [bcache]
      [  536.403399]  blk_mq_make_request+0x97/0x4f0
      [  536.607036]  generic_make_request+0x1e2/0x410
      [  536.819164]  submit_bio+0x73/0x150
      [  536.980168]  ? submit_bio+0x73/0x150
      [  537.149731]  ? bio_associate_blkg_from_css+0x3b/0x60
      [  537.391595]  ? _cond_resched+0x1a/0x50
      [  537.573774]  submit_bio_wait+0x59/0x90
      [  537.756105]  blkdev_issue_discard+0x80/0xd0
      [  537.959590]  ext4_trim_fs+0x4a9/0x9e0
      [  538.137636]  ? ext4_trim_fs+0x4a9/0x9e0
      [  538.324087]  ext4_ioctl+0xea4/0x1530
      [  538.497712]  ? _copy_to_user+0x2a/0x40
      [  538.679632]  do_vfs_ioctl+0xa6/0x600
      [  538.853127]  ? __do_sys_newfstat+0x44/0x70
      [  539.051951]  ksys_ioctl+0x6d/0x80
      [  539.212785]  __x64_sys_ioctl+0x1a/0x20
      [  539.394918]  do_syscall_64+0x5a/0x110
      [  539.568674]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      We have observed it where both:
      1) LVM/devmapper is involved (bcache backing device is LVM volume) and
      2) writeback cache is involved (bcache cache_mode is writeback)
      
      On one machine, we can reliably reproduce it with:
      
       # echo writeback > /sys/block/bcache0/bcache/cache_mode
         (not sure whether above line is required)
       # mount /dev/bcache0 /test
       # for i in {0..10}; do
      	file="$(mktemp /test/zero.XXX)"
      	dd if=/dev/zero of="$file" bs=1M count=256
      	sync
      	rm $file
          done
        # fstrim -v /test
      
      Observing this with tracepoints on, we see the following writes:
      
      fstrim-18019 [022] .... 91107.302026: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4260112 + 196352 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302050: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4456464 + 262144 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302075: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 4718608 + 81920 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302094: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5324816 + 180224 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302121: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5505040 + 262144 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.302145: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 5767184 + 81920 hit 0 bypass 1
      fstrim-18019 [022] .... 91107.308777: bcache_write: 73f95583-561c-408f-a93a-4cbd2498f5c8 inode 0  DS 6373392 + 180224 hit 1 bypass 0
      <crash>
      
      Note the final one has different hit/bypass flags.
      
      This is because in should_writeback(), we were hitting a case where
      the partial stripe condition was returning true and so
      should_writeback() was returning true early.
      
      If that hadn't been the case, it would have hit the would_skip test, and
      as would_skip == s->iop.bypass == true, should_writeback() would have
      returned false.
      
      Looking at the git history from 'commit 72c27061 ("bcache: Write out
      full stripes")', it looks like the idea was to optimise for raid5/6:
      
             * If a stripe is already dirty, force writes to that stripe to
      	 writeback mode - to help build up full stripes of dirty data
      
      To fix this issue, make sure that should_writeback() on a discard op
      never returns true.
      
      More details of debugging:
      https://www.spinics.net/lists/linux-bcache/msg06996.html
      
      Previous reports:
       - https://bugzilla.kernel.org/show_bug.cgi?id=201051
       - https://bugzilla.kernel.org/show_bug.cgi?id=196103
       - https://www.spinics.net/lists/linux-bcache/msg06885.html
      
      (Coly Li: minor modification to follow maximum 75 chars per line rule)
      
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: stable@vger.kernel.org
      Fixes: 72c27061 ("bcache: Write out full stripes")
      Signed-off-by: default avatarDaniel Axtens <dja@axtens.net>
      Signed-off-by: default avatarColy Li <colyli@suse.de>
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      9951379b
  5. 08 Feb, 2019 1 commit