1. 16 Aug, 2018 1 commit
  2. 13 Aug, 2018 1 commit
  3. 09 Aug, 2018 2 commits
    • Ilya Dryomov's avatar
      dm cache metadata: set dirty on all cache blocks after a crash · 5b1fe7be
      Ilya Dryomov authored
      Quoting Documentation/device-mapper/cache.txt:
      
        The 'dirty' state for a cache block changes far too frequently for us
        to keep updating it on the fly.  So we treat it as a hint.  In normal
        operation it will be written when the dm device is suspended.  If the
        system crashes all cache blocks will be assumed dirty when restarted.
      
      This got broken in commit f177940a ("dm cache metadata: switch to
      using the new cursor api for loading metadata") in 4.9, which removed
      the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk
      flag) when loading cache blocks.  This results in data corruption on an
      unclean shutdown with dirty cache blocks on the fast device.  After the
      crash those blocks are considered clean and may get evicted from the
      cache at any time.  This can be demonstrated by doing a lot of reads
      to trigger individual evictions, but uncache is more predictable:
      
        ### Disable auto-activation in lvm.conf to be able to do uncache in
        ### time (i.e. see uncache doing flushing) when the fix is applied.
      
        # xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb
        # vgcreate vg_cache /dev/vdb /dev/vdc
        # lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb
        # lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc
        # lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc
        # lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev
        # lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev
        # xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # dmsetup status vg_cache-lv_slowdev
        0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw -
                                                                  ^^^^
                                      7065 * 64k = 441M yet to be written to the slow device
        # echo b >/proc/sysrq-trigger
      
        # vgchange -ay vg_cache
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # lvconvert --uncache vg_cache/lv_slowdev
        Flushing 0 blocks for cache vg_cache/lv_slowdev.
        Logical volume "lv_cachedev" successfully removed
        Logical volume vg_cache/lv_slowdev is not cached.
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
        0fe00010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
      
      This is the case with both v1 and v2 cache pool metatata formats.
      
      After applying this patch:
      
        # vgchange -ay vg_cache
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # lvconvert --uncache vg_cache/lv_slowdev
        Flushing 3724 blocks for cache vg_cache/lv_slowdev.
        ...
        Flushing 71 blocks for cache vg_cache/lv_slowdev.
        Logical volume "lv_cachedev" successfully removed
        Logical volume vg_cache/lv_slowdev is not cached.
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
      
      Cc: stable@vger.kernel.org
      Fixes: f177940a ("dm cache metadata: switch to using the new cursor api for loading metadata")
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      5b1fe7be
    • Mike Snitzer's avatar
      dm snapshot: remove stale FIXME in snapshot_map() · c9a5e6a9
      Mike Snitzer authored
      Commit ae1093be ("dm snapshot: use mutex instead of rw_semaphore")
      eliminated the need to worry about read vs write locking.  So remove a
      FIXME in snapshot_map() that is concerned about selectively taking a
      write lock.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      c9a5e6a9
  4. 08 Aug, 2018 2 commits
    • David Jeffery's avatar
      dm snapshot: improve performance by switching out_of_order_list to rbtree · 3db2776d
      David Jeffery authored
      copy_complete()'s processing of out_of_order_list can result in
      quadratic complexity in the worst case.  As such it was the source of
      consuming too much cpu and the source of significant loss in
      performance.
      
      Fix this by converting out_of_order_list to an rbtree.  This improved
      a dm-snapshot test copy workload from 32 seconds to 4 seconds.
      Signed-off-by: default avatarDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Tested-by: default avatarBrett Hull <bhull@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      3db2776d
    • John Pittman's avatar
      dm kcopyd: avoid softlockup in run_complete_job · 784c9a29
      John Pittman authored
      It was reported that softlockups occur when using dm-snapshot ontop of
      slow (rbd) storage.  E.g.:
      
      [ 4047.990647] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:23:26177]
      ...
      [ 4048.034151] Workqueue: kcopyd do_work [dm_mod]
      [ 4048.034156] RIP: 0010:copy_callback+0x41/0x160 [dm_snapshot]
      ...
      [ 4048.034190] Call Trace:
      [ 4048.034196]  ? __chunk_is_tracked+0x70/0x70 [dm_snapshot]
      [ 4048.034200]  run_complete_job+0x5f/0xb0 [dm_mod]
      [ 4048.034205]  process_jobs+0x91/0x220 [dm_mod]
      [ 4048.034210]  ? kcopyd_put_pages+0x40/0x40 [dm_mod]
      [ 4048.034214]  do_work+0x46/0xa0 [dm_mod]
      [ 4048.034219]  process_one_work+0x171/0x370
      [ 4048.034221]  worker_thread+0x1fc/0x3f0
      [ 4048.034224]  kthread+0xf8/0x130
      [ 4048.034226]  ? max_active_store+0x80/0x80
      [ 4048.034227]  ? kthread_bind+0x10/0x10
      [ 4048.034231]  ret_from_fork+0x35/0x40
      [ 4048.034233] Kernel panic - not syncing: softlockup: hung tasks
      
      Fix this by calling cond_resched() after run_complete_job()'s callout to
      the dm_kcopyd_notify_fn (which is dm-snap.c:copy_callback in the above
      trace).
      Signed-off-by: default avatarJohn Pittman <jpittman@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      784c9a29
  5. 07 Aug, 2018 2 commits
    • Mike Snitzer's avatar
      dm cache metadata: save in-core policy_hint_size to on-disk superblock · fd2fa954
      Mike Snitzer authored
      policy_hint_size starts as 0 during __write_initial_superblock().  It
      isn't until the policy is loaded that policy_hint_size is set in-core
      (cmd->policy_hint_size).  But it never got recorded in the on-disk
      superblock because __commit_transaction() didn't deal with transfering
      the in-core cmd->policy_hint_size to the on-disk superblock.
      
      The in-core cmd->policy_hint_size gets initialized by metadata_open()'s
      __begin_transaction_flags() which re-reads all superblock fields.
      Because the superblock's policy_hint_size was never properly stored, when
      the cache was created, hints_array_available() would always return false
      when re-activating a previously created cache.  This means
      __load_mappings() always considered the hints invalid and never made use
      of the hints (these hints served to optimize).
      
      Another detremental side-effect of this oversight is the cache_check
      utility would fail with: "invalid hint width: 0"
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      fd2fa954
    • Hou Tao's avatar
      dm thin: stop no_space_timeout worker when switching to write-mode · 75294442
      Hou Tao authored
      Now both check_for_space() and do_no_space_timeout() will read & write
      pool->pf.error_if_no_space.  If these functions run concurrently, as
      shown in the following case, the default setting of "queue_if_no_space"
      can get lost.
      
      precondition:
          * error_if_no_space = false (aka "queue_if_no_space")
          * pool is in Out-of-Data-Space (OODS) mode
          * no_space_timeout worker has been queued
      
      CPU 0:                          CPU 1:
      // delete a thin device
      process_delete_mesg()
      // check_for_space() invoked by commit()
      set_pool_mode(pool, PM_WRITE)
          pool->pf.error_if_no_space = \
           pt->requested_pf.error_if_no_space
      
      				// timeout, pool is still in OODS mode
      				do_no_space_timeout
      				    // "queue_if_no_space" config is lost
      				    pool->pf.error_if_no_space = true
          pool->pf.mode = new_mode
      
      Fix it by stopping no_space_timeout worker when switching to write mode.
      
      Fixes: bcc696fa ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      75294442
  6. 31 Jul, 2018 1 commit
  7. 30 Jul, 2018 1 commit
    • Andy Grover's avatar
      dm thin: include metadata_low_watermark threshold in pool status · 63c8ecb6
      Andy Grover authored
      The metadata low watermark threshold is set by the kernel.  But the
      kernel depends on userspace to extend the thinpool metadata device when
      the threshold is crossed.
      
      Since the metadata low watermark threshold is not visible to userspace,
      upon receiving an event, userspace cannot tell that the kernel wants the
      metadata device extended, instead of some other eventing condition.
      Making it visible (but not settable) enables userspace to affirmatively
      know the kernel is asking for a metadata device extension, by comparing
      metadata_low_watermark against nr_free_blocks_metadata, also reported in
      status.
      
      Current solutions like dmeventd have their own thresholds for extending
      the data and metadata devices, and both devices are checked against
      their thresholds on each event.  This lessens the value of the kernel-set
      threshold, since userspace will either extend the metadata device sooner,
      when receiving another event; or will receive the metadata lowater event
      and do nothing, if dmeventd's threshold is less than the kernel's.
      (This second case is dangerous. The metadata lowater event will not be
      re-sent, so no further event will be generated before the metadata
      device is out if space, unless some other event causes userspace to
      recheck its thresholds.)
      Signed-off-by: default avatarAndy Grover <agrover@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      63c8ecb6
  8. 27 Jul, 2018 16 commits
  9. 22 Jul, 2018 8 commits
  10. 21 Jul, 2018 6 commits