1. 11 Sep, 2019 1 commit
    • Ming Lei's avatar
      dm raid: fix updating of max_discard_sectors limit · c8156fc7
      Ming Lei authored
      Unit of 'chunk_size' is byte, instead of sector, so fix it by setting
      the queue_limits' max_discard_sectors to rs->md.chunk_sectors.  Also,
      rename chunk_size to chunk_size_bytes.
      
      Without this fix, too big max_discard_sectors is applied on the request
      queue of dm-raid, finally raid code has to split the bio again.
      
      This re-split done by raid causes the following nested clone_endio:
      
      1) one big bio 'A' is submitted to dm queue, and served as the original
      bio
      
      2) one new bio 'B' is cloned from the original bio 'A', and .map()
      is run on this bio of 'B', and B's original bio points to 'A'
      
      3) raid code sees that 'B' is too big, and split 'B' and re-submit
      the remainded part of 'B' to dm-raid queue via generic_make_request().
      
      4) now dm will handle 'B' as new original bio, then allocate a new
      clone bio of 'C' and run .map() on 'C'. Meantime C's original bio
      points to 'B'.
      
      5) suppose now 'C' is completed by raid directly, then the following
      clone_endio() is called recursively:
      
      	clone_endio(C)
      		->clone_endio(B)		#B is original bio of 'C'
      			->bio_endio(A)
      
      'A' can be big enough to make hundreds of nested clone_endio(), then
      stack can be corrupted easily.
      
      Fixes: 61697a6a ("dm: eliminate 'split_discard_bios' flag from DM target interface")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMing Lei <ming.lei@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      c8156fc7
  2. 05 Sep, 2019 1 commit
  3. 04 Sep, 2019 1 commit
    • Gustavo A. R. Silva's avatar
      dm stats: use struct_size() helper · fb16c799
      Gustavo A. R. Silva authored
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct dm_stat {
      	...
              struct dm_stat_shared stat_shared[0];
      };
      
      Make use of the struct_size() helper instead of an open-coded version
      in order to avoid any potential type mistakes.
      
      So, replace the following form:
      
      sizeof(struct dm_stat) + (size_t)n_entries * sizeof(struct dm_stat_shared)
      
      with:
      
      struct_size(s, stat_shared, n_entries)
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      fb16c799
  4. 03 Sep, 2019 3 commits
  5. 26 Aug, 2019 6 commits
    • ZhangXiaoxu's avatar
      dm space map common: remove check for impossible sm_find_free() return value · c1499a04
      ZhangXiaoxu authored
      The function sm_find_free() just returns -ENOSPC and 0.
      So remove lone caller's check for some other error.
      Signed-off-by: default avatarZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      c1499a04
    • Gustavo A. R. Silva's avatar
      dm raid1: use struct_size() with kzalloc() · bcd67654
      Gustavo A. R. Silva authored
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct mirror_set {
      	...
              struct mirror mirror[0];
      };
      
      size = sizeof(struct mirror_set) + count * sizeof(struct mirror);
      instance = kzalloc(size, GFP_KERNEL)
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kzalloc(struct_size(instance, mirror, count), GFP_KERNEL)
      
      Notice that, in this case, variable len is not necessary, hence it
      is removed.
      
      This code was detected with the help of Coccinelle.
      Signed-off-by: default avatarGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      bcd67654
    • Huaisheng Ye's avatar
      dm writecache: optimize performance by sorting the blocks for writeback_all · 5229b489
      Huaisheng Ye authored
      During the process of writeback, the blocks, which have been placed in wbl.list
      for writeback soon, are partially ordered for the contiguous ones.
      
      When writeback_all has been set, for most cases, also by default, there will be
      a lot of blocks in pmem need to writeback at the same time.
      For this case, we could optimize the performance by sorting all blocks in
      wbl.list. writecache_writeback doesn't need to get blocks from the tail of
      wc->lru, whereas from the first rb_node from the rb_tree.
      
      The benefit is that, writecache_writeback doesn't need to have any cost to sort
      the blocks, because of all blocks are incremental originally in rb_tree.
      There will be a writecache_flush when writeback_all begins to work, that will
      eliminate duplicate blocks in cache by committed/uncommitted.
      
      Testing platform: Thinksystem SR630 with persistent memory.
      The cache comes from pmem, which has 1006MB size. The origin device is HDD, 2GB
      of which for using.
      
      Testing steps:
       1) dmsetup create mycache --table '0 4194304 writecache p /dev/sdb1 /dev/pmem4  4096 0'
       2) fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
       -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
       3) time dmsetup message /dev/mapper/mycache 0 flush
      
      Here is the results below,
      With the patch:
       # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
       -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
         iops        : min= 1582, max=199470, avg=5305.94, stdev=21273.44, samples=197
       # time dmsetup message /dev/mapper/mycache 0 flush
      real	0m44.020s
      user	0m0.002s
      sys	0m0.003s
      
      Without the patch:
       # fio -filename=/dev/mapper/mycache -direct=1 -iodepth=20 -rw=randwrite
       -ioengine=libaio -bs=4k -loops=1  -size=2g -group_reporting -name=mytest1
         iops        : min= 1202, max=197650, avg=4968.67, stdev=20480.17, samples=211
       # time dmsetup message /dev/mapper/mycache 0 flush
      real	1m39.221s
      user	0m0.001s
      sys	0m0.003s
      
      I also have checked the data accuracy with this patch by making EXT4 filesystem
      on mycache, then mount it for checking md5 of files on that.
      The test result is positive, with this patch it could save more than half of time
      when writeback_all.
      Signed-off-by: default avatarHuaisheng Ye <yehs1@lenovo.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      5229b489
    • Huaisheng Ye's avatar
      dm writecache: add unlikely for getting two block with same LBA · 62421b38
      Huaisheng Ye authored
      In function writecache_writeback, entries g and f has same original
      sector only happens at entry f has been committed, but entry g has
      NOT yet.
      
      The probability of this happening is very low in the following
      256 blocks at most of entry e.
      Signed-off-by: default avatarHuaisheng Ye <yehs1@lenovo.com>
      Acked-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      62421b38
    • Huaisheng Ye's avatar
      dm writecache: remove unused member pointer in writeback_struct · 58912dbc
      Huaisheng Ye authored
      The stucture member pointer page in writeback_struct never has been
      used actually. Remove it.
      Signed-off-by: default avatarHuaisheng Ye <yehs1@lenovo.com>
      Acked-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      58912dbc
    • Mikulas Patocka's avatar
      dm zoned: fix invalid memory access · 0c8e9c2d
      Mikulas Patocka authored
      Commit 75d66ffb ("dm zoned: properly
      handle backing device failure") triggers a coverity warning:
      
      *** CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
      /drivers/md/dm-zoned-target.c: 137 in dmz_submit_bio()
      131             clone->bi_private = bioctx;
      132
      133             bio_advance(bio, clone->bi_iter.bi_size);
      134
      135             refcount_inc(&bioctx->ref);
      136             generic_make_request(clone);
      >>>     CID 1452808:  Memory - illegal accesses  (USE_AFTER_FREE)
      >>>     Dereferencing freed pointer "clone".
      137             if (clone->bi_status == BLK_STS_IOERR)
      138                     return -EIO;
      139
      140             if (bio_op(bio) == REQ_OP_WRITE && dmz_is_seq(zone))
      141                     zone->wp_block += nr_blocks;
      142
      
      The "clone" bio may be processed and freed before the check
      "clone->bi_status == BLK_STS_IOERR" - so this check can access invalid
      memory.
      
      Fixes: 75d66ffb ("dm zoned: properly handle backing device failure")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      0c8e9c2d
  6. 23 Aug, 2019 4 commits
    • Jaskaran Khurana's avatar
      dm verity: add root hash pkcs#7 signature verification · 88cd3e6c
      Jaskaran Khurana authored
      The verification is to support cases where the root hash is not secured
      by Trusted Boot, UEFI Secureboot or similar technologies.
      
      One of the use cases for this is for dm-verity volumes mounted after
      boot, the root hash provided during the creation of the dm-verity volume
      has to be secure and thus in-kernel validation implemented here will be
      used before we trust the root hash and allow the block device to be
      created.
      
      The signature being provided for verification must verify the root hash
      and must be trusted by the builtin keyring for verification to succeed.
      
      The hash is added as a key of type "user" and the description is passed
      to the kernel so it can look it up and use it for verification.
      
      Adds CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG which can be turned on if root
      hash verification is needed.
      
      Kernel commandline dm_verity module parameter 'require_signatures' will
      indicate whether to force root hash signature verification (for all dm
      verity volumes).
      Signed-off-by: default avatarJaskaran Khurana <jaskarankhurana@linux.microsoft.com>
      Tested-and-Reviewed-by: default avatarMilan Broz <gmazyland@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      88cd3e6c
    • Ard Biesheuvel's avatar
      dm crypt: reuse eboiv skcipher for IV generation · 39d13a1a
      Ard Biesheuvel authored
      Instead of instantiating a separate cipher to perform the encryption
      needed to produce the IV, reuse the skcipher used for the block data
      and invoke it one additional time for each block to encrypt a zero
      vector and use the output as the IV.
      
      For CBC mode, this is equivalent to using the bare block cipher, but
      without the risk of ending up with a non-time invariant implementation
      of AES when the skcipher itself is time variant (e.g., arm64 without
      Crypto Extensions has a NEON based time invariant implementation of
      cbc(aes) but no time invariant implementation of the core cipher other
      than aes-ti, which is not enabled by default).
      
      This approach is a compromise between dm-crypt API flexibility and
      reducing dependence on parts of the crypto API that should not usually
      be exposed to other subsystems, such as the bare cipher API.
      Signed-off-by: default avatarArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: default avatarMilan Broz <gmazyland@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      39d13a1a
    • Mikulas Patocka's avatar
      dm: make dm_table_find_target return NULL · 123d87d5
      Mikulas Patocka authored
      Currently, if we pass too high sector number to dm_table_find_target, it
      returns zeroed dm_target structure and callers test if the structure is
      zeroed with the macro dm_target_is_valid.
      
      However, returning NULL is common practice to indicate errors.
      
      This patch refactors the dm code, so that dm_table_find_target returns
      NULL and its callers test the returned value for NULL. The macro
      dm_target_is_valid is deleted. In alloc_targets, we no longer allocate an
      extra zeroed target.
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      123d87d5
    • Mikulas Patocka's avatar
      dm table: fix invalid memory accesses with too high sector number · 1cfd5d33
      Mikulas Patocka authored
      If the sector number is too high, dm_table_find_target() should return a
      pointer to a zeroed dm_target structure (the caller should test it with
      dm_target_is_valid).
      
      However, for some table sizes, the code in dm_table_find_target() that
      performs btree lookup will access out of bound memory structures.
      
      Fix this bug by testing the sector number at the beginning of
      dm_table_find_target(). Also, add an "inline" keyword to the function
      dm_table_get_size() because this is a hot path.
      
      Fixes: 512875bd ("dm: table detect io beyond device")
      Cc: stable@vger.kernel.org
      Reported-by: default avatarZhang Tao <kontais@zoho.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      1cfd5d33
  7. 22 Aug, 2019 2 commits
    • ZhangXiaoxu's avatar
      dm space map metadata: fix missing store of apply_bops() return value · ae148243
      ZhangXiaoxu authored
      In commit 6096d91a ("dm space map metadata: fix occasional leak
      of a metadata block on resize"), we refactor the commit logic to a new
      function 'apply_bops'.  But when that logic was replaced in out() the
      return value was not stored.  This may lead out() returning a wrong
      value to the caller.
      
      Fixes: 6096d91a ("dm space map metadata: fix occasional leak of a metadata block on resize")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      ae148243
    • ZhangXiaoxu's avatar
      dm btree: fix order of block initialization in btree_split_beneath · e4f9d601
      ZhangXiaoxu authored
      When btree_split_beneath() splits a node to two new children, it will
      allocate two blocks: left and right.  If right block's allocation
      failed, the left block will be unlocked and marked dirty.  If this
      happened, the left block'ss content is zero, because it wasn't
      initialized with the btree struct before the attempot to allocate the
      right block.  Upon return, when flushing the left block to disk, the
      validator will fail when check this block.  Then a BUG_ON is raised.
      
      Fix this by completely initializing the left block before allocating and
      initializing the right block.
      
      Fixes: 4dcb8b57 ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      e4f9d601
  8. 21 Aug, 2019 3 commits
  9. 15 Aug, 2019 8 commits
    • Mikulas Patocka's avatar
      dm integrity: fix a crash due to BUG_ON in __journal_read_write() · 5729b6e5
      Mikulas Patocka authored
      Fix a crash that was introduced by the commit 724376a0. The crash is
      reported here: https://gitlab.com/cryptsetup/cryptsetup/issues/468
      
      When reading from the integrity device, the function
      dm_integrity_map_continue calls find_journal_node to find out if the
      location to read is present in the journal. Then, it calculates how many
      sectors are consecutively stored in the journal. Then, it locks the range
      with add_new_range and wait_and_add_new_range.
      
      The problem is that during wait_and_add_new_range, we hold no locks (we
      don't hold ic->endio_wait.lock and we don't hold a range lock), so the
      journal may change arbitrarily while wait_and_add_new_range sleeps.
      
      The code then goes to __journal_read_write and hits
      BUG_ON(journal_entry_get_sector(je) != logical_sector); because the
      journal has changed.
      
      In order to fix this bug, we need to re-check the journal location after
      wait_and_add_new_range. We restrict the length to one block in order to
      not complicate the code too much.
      
      Fixes: 724376a0 ("dm integrity: implement fair range locks")
      Cc: stable@vger.kernel.org # v4.19+
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      5729b6e5
    • Dmitry Fomichev's avatar
      ad1bd578
    • Dmitry Fomichev's avatar
    • Dmitry Fomichev's avatar
      dm zoned: properly handle backing device failure · 75d66ffb
      Dmitry Fomichev authored
      dm-zoned is observed to lock up or livelock in case of hardware
      failure or some misconfiguration of the backing zoned device.
      
      This patch adds a new dm-zoned target function that checks the status of
      the backing device. If the request queue of the backing device is found
      to be in dying state or the SCSI backing device enters offline state,
      the health check code sets a dm-zoned target flag prompting all further
      incoming I/O to be rejected. In order to detect backing device failures
      timely, this new function is called in the request mapping path, at the
      beginning of every reclaim run and before performing any metadata I/O.
      
      The proper way out of this situation is to do
      
      dmsetup remove <dm-zoned target>
      
      and recreate the target when the problem with the backing device
      is resolved.
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDmitry Fomichev <dmitry.fomichev@wdc.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      75d66ffb
    • Dmitry Fomichev's avatar
      dm zoned: improve error handling in i/o map code · d7428c50
      Dmitry Fomichev authored
      Some errors are ignored in the I/O path during queueing chunks
      for processing by chunk works. Since at least these errors are
      transient in nature, it should be possible to retry the failed
      incoming commands.
      
      The fix -
      
      Errors that can happen while queueing chunks are carried upwards
      to the main mapping function and it now returns DM_MAPIO_REQUEUE
      for any incoming requests that can not be properly queued.
      
      Error logging/debug messages are added where needed.
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDmitry Fomichev <dmitry.fomichev@wdc.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      d7428c50
    • Dmitry Fomichev's avatar
      dm zoned: improve error handling in reclaim · b234c6d7
      Dmitry Fomichev authored
      There are several places in reclaim code where errors are not
      propagated to the main function, dmz_reclaim(). This function
      is responsible for unlocking zones that might be still locked
      at the end of any failed reclaim iterations. As the result,
      some device zones may be left permanently locked for reclaim,
      degrading target's capability to reclaim zones.
      
      This patch fixes these issues as follows -
      
      Make sure that dmz_reclaim_buf(), dmz_reclaim_seq_data() and
      dmz_reclaim_rnd_data() return error codes to the caller.
      
      dmz_reclaim() function is renamed to dmz_do_reclaim() to avoid
      clashing with "struct dmz_reclaim" and is modified to return the
      error to the caller.
      
      dmz_get_zone_for_reclaim() now returns an error instead of NULL
      pointer and reclaim code checks for that error.
      
      Error logging/debug messages are added where necessary.
      
      Fixes: 3b1a94c8 ("dm zoned: drive-managed zoned block device target")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDmitry Fomichev <dmitry.fomichev@wdc.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      b234c6d7
    • Dmitry Fomichev's avatar
      dm kcopyd: always complete failed jobs · d1fef414
      Dmitry Fomichev authored
      This patch fixes a problem in dm-kcopyd that may leave jobs in
      complete queue indefinitely in the event of backing storage failure.
      
      This behavior has been observed while running 100% write file fio
      workload against an XFS volume created on top of a dm-zoned target
      device. If the underlying storage of dm-zoned goes to offline state
      under I/O, kcopyd sometimes never issues the end copy callback and
      dm-zoned reclaim work hangs indefinitely waiting for that completion.
      
      This behavior was traced down to the error handling code in
      process_jobs() function that places the failed job to complete_jobs
      queue, but doesn't wake up the job handler. In case of backing device
      failure, all outstanding jobs may end up going to complete_jobs queue
      via this code path and then stay there forever because there are no
      more successful I/O jobs to wake up the job handler.
      
      This patch adds a wake() call to always wake up kcopyd job wait queue
      for all I/O jobs that fail before dm_io() gets called for that job.
      
      The patch also sets the write error status in all sub jobs that are
      failed because their master job has failed.
      
      Fixes: b73c67c2 ("dm kcopyd: add sequential write feature")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarDmitry Fomichev <dmitry.fomichev@wdc.com>
      Reviewed-by: default avatarDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      d1fef414
    • Mikulas Patocka's avatar
      Revert "dm bufio: fix deadlock with loop device" · cf3591ef
      Mikulas Patocka authored
      Revert the commit bd293d07. The proper
      fix has been made available with commit d0a255e7 ("loop: set
      PF_MEMALLOC_NOIO for the worker thread").
      
      Note that the fix offered by commit bd293d07 doesn't really prevent
      the deadlock from occuring - if we look at the stacktrace reported by
      Junxiao Bi, we see that it hangs in bit_wait_io and not on the mutex -
      i.e. it has already successfully taken the mutex. Changing the mutex
      from mutex_lock to mutex_trylock won't help with deadlocks that happen
      afterwards.
      
      PID: 474    TASK: ffff8813e11f4600  CPU: 10  COMMAND: "kswapd0"
         #0 [ffff8813dedfb938] __schedule at ffffffff8173f405
         #1 [ffff8813dedfb990] schedule at ffffffff8173fa27
         #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec
         #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186
         #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f
         #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8
         #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81
         #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio]
         #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio]
         #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio]
        #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce
        #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778
        #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f
        #13 [ffff8813dedfbec0] kthread at ffffffff810a8428
        #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: bd293d07 ("dm bufio: fix deadlock with loop device")
      Depends-on: d0a255e7 ("loop: set PF_MEMALLOC_NOIO for the worker thread")
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      cf3591ef
  10. 11 Aug, 2019 3 commits
  11. 10 Aug, 2019 8 commits