1. 09 Aug, 2018 2 commits
    • Ilya Dryomov's avatar
      dm cache metadata: set dirty on all cache blocks after a crash · 5b1fe7be
      Ilya Dryomov authored
      Quoting Documentation/device-mapper/cache.txt:
      
        The 'dirty' state for a cache block changes far too frequently for us
        to keep updating it on the fly.  So we treat it as a hint.  In normal
        operation it will be written when the dm device is suspended.  If the
        system crashes all cache blocks will be assumed dirty when restarted.
      
      This got broken in commit f177940a ("dm cache metadata: switch to
      using the new cursor api for loading metadata") in 4.9, which removed
      the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk
      flag) when loading cache blocks.  This results in data corruption on an
      unclean shutdown with dirty cache blocks on the fast device.  After the
      crash those blocks are considered clean and may get evicted from the
      cache at any time.  This can be demonstrated by doing a lot of reads
      to trigger individual evictions, but uncache is more predictable:
      
        ### Disable auto-activation in lvm.conf to be able to do uncache in
        ### time (i.e. see uncache doing flushing) when the fix is applied.
      
        # xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb
        # vgcreate vg_cache /dev/vdb /dev/vdc
        # lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb
        # lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc
        # lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc
        # lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev
        # lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev
        # xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # dmsetup status vg_cache-lv_slowdev
        0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw -
                                                                  ^^^^
                                      7065 * 64k = 441M yet to be written to the slow device
        # echo b >/proc/sysrq-trigger
      
        # vgchange -ay vg_cache
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # lvconvert --uncache vg_cache/lv_slowdev
        Flushing 0 blocks for cache vg_cache/lv_slowdev.
        Logical volume "lv_cachedev" successfully removed
        Logical volume vg_cache/lv_slowdev is not cached.
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
        0fe00010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
      
      This is the case with both v1 and v2 cache pool metatata formats.
      
      After applying this patch:
      
        # vgchange -ay vg_cache
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        # lvconvert --uncache vg_cache/lv_slowdev
        Flushing 3724 blocks for cache vg_cache/lv_slowdev.
        ...
        Flushing 71 blocks for cache vg_cache/lv_slowdev.
        Logical volume "lv_cachedev" successfully removed
        Logical volume vg_cache/lv_slowdev is not cached.
        # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
        0fe00000:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
        0fe00010:  bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
      
      Cc: stable@vger.kernel.org
      Fixes: f177940a ("dm cache metadata: switch to using the new cursor api for loading metadata")
      Signed-off-by: default avatarIlya Dryomov <idryomov@gmail.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      5b1fe7be
    • Mike Snitzer's avatar
      dm snapshot: remove stale FIXME in snapshot_map() · c9a5e6a9
      Mike Snitzer authored
      Commit ae1093be ("dm snapshot: use mutex instead of rw_semaphore")
      eliminated the need to worry about read vs write locking.  So remove a
      FIXME in snapshot_map() that is concerned about selectively taking a
      write lock.
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      c9a5e6a9
  2. 08 Aug, 2018 2 commits
    • David Jeffery's avatar
      dm snapshot: improve performance by switching out_of_order_list to rbtree · 3db2776d
      David Jeffery authored
      copy_complete()'s processing of out_of_order_list can result in
      quadratic complexity in the worst case.  As such it was the source of
      consuming too much cpu and the source of significant loss in
      performance.
      
      Fix this by converting out_of_order_list to an rbtree.  This improved
      a dm-snapshot test copy workload from 32 seconds to 4 seconds.
      Signed-off-by: default avatarDavid Jeffery <djeffery@redhat.com>
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Tested-by: default avatarBrett Hull <bhull@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      3db2776d
    • John Pittman's avatar
      dm kcopyd: avoid softlockup in run_complete_job · 784c9a29
      John Pittman authored
      It was reported that softlockups occur when using dm-snapshot ontop of
      slow (rbd) storage.  E.g.:
      
      [ 4047.990647] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:23:26177]
      ...
      [ 4048.034151] Workqueue: kcopyd do_work [dm_mod]
      [ 4048.034156] RIP: 0010:copy_callback+0x41/0x160 [dm_snapshot]
      ...
      [ 4048.034190] Call Trace:
      [ 4048.034196]  ? __chunk_is_tracked+0x70/0x70 [dm_snapshot]
      [ 4048.034200]  run_complete_job+0x5f/0xb0 [dm_mod]
      [ 4048.034205]  process_jobs+0x91/0x220 [dm_mod]
      [ 4048.034210]  ? kcopyd_put_pages+0x40/0x40 [dm_mod]
      [ 4048.034214]  do_work+0x46/0xa0 [dm_mod]
      [ 4048.034219]  process_one_work+0x171/0x370
      [ 4048.034221]  worker_thread+0x1fc/0x3f0
      [ 4048.034224]  kthread+0xf8/0x130
      [ 4048.034226]  ? max_active_store+0x80/0x80
      [ 4048.034227]  ? kthread_bind+0x10/0x10
      [ 4048.034231]  ret_from_fork+0x35/0x40
      [ 4048.034233] Kernel panic - not syncing: softlockup: hung tasks
      
      Fix this by calling cond_resched() after run_complete_job()'s callout to
      the dm_kcopyd_notify_fn (which is dm-snap.c:copy_callback in the above
      trace).
      Signed-off-by: default avatarJohn Pittman <jpittman@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      784c9a29
  3. 07 Aug, 2018 2 commits
    • Mike Snitzer's avatar
      dm cache metadata: save in-core policy_hint_size to on-disk superblock · fd2fa954
      Mike Snitzer authored
      policy_hint_size starts as 0 during __write_initial_superblock().  It
      isn't until the policy is loaded that policy_hint_size is set in-core
      (cmd->policy_hint_size).  But it never got recorded in the on-disk
      superblock because __commit_transaction() didn't deal with transfering
      the in-core cmd->policy_hint_size to the on-disk superblock.
      
      The in-core cmd->policy_hint_size gets initialized by metadata_open()'s
      __begin_transaction_flags() which re-reads all superblock fields.
      Because the superblock's policy_hint_size was never properly stored, when
      the cache was created, hints_array_available() would always return false
      when re-activating a previously created cache.  This means
      __load_mappings() always considered the hints invalid and never made use
      of the hints (these hints served to optimize).
      
      Another detremental side-effect of this oversight is the cache_check
      utility would fail with: "invalid hint width: 0"
      
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      fd2fa954
    • Hou Tao's avatar
      dm thin: stop no_space_timeout worker when switching to write-mode · 75294442
      Hou Tao authored
      Now both check_for_space() and do_no_space_timeout() will read & write
      pool->pf.error_if_no_space.  If these functions run concurrently, as
      shown in the following case, the default setting of "queue_if_no_space"
      can get lost.
      
      precondition:
          * error_if_no_space = false (aka "queue_if_no_space")
          * pool is in Out-of-Data-Space (OODS) mode
          * no_space_timeout worker has been queued
      
      CPU 0:                          CPU 1:
      // delete a thin device
      process_delete_mesg()
      // check_for_space() invoked by commit()
      set_pool_mode(pool, PM_WRITE)
          pool->pf.error_if_no_space = \
           pt->requested_pf.error_if_no_space
      
      				// timeout, pool is still in OODS mode
      				do_no_space_timeout
      				    // "queue_if_no_space" config is lost
      				    pool->pf.error_if_no_space = true
          pool->pf.mode = new_mode
      
      Fix it by stopping no_space_timeout worker when switching to write mode.
      
      Fixes: bcc696fa ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      75294442
  4. 31 Jul, 2018 1 commit
  5. 30 Jul, 2018 1 commit
    • Andy Grover's avatar
      dm thin: include metadata_low_watermark threshold in pool status · 63c8ecb6
      Andy Grover authored
      The metadata low watermark threshold is set by the kernel.  But the
      kernel depends on userspace to extend the thinpool metadata device when
      the threshold is crossed.
      
      Since the metadata low watermark threshold is not visible to userspace,
      upon receiving an event, userspace cannot tell that the kernel wants the
      metadata device extended, instead of some other eventing condition.
      Making it visible (but not settable) enables userspace to affirmatively
      know the kernel is asking for a metadata device extension, by comparing
      metadata_low_watermark against nr_free_blocks_metadata, also reported in
      status.
      
      Current solutions like dmeventd have their own thresholds for extending
      the data and metadata devices, and both devices are checked against
      their thresholds on each event.  This lessens the value of the kernel-set
      threshold, since userspace will either extend the metadata device sooner,
      when receiving another event; or will receive the metadata lowater event
      and do nothing, if dmeventd's threshold is less than the kernel's.
      (This second case is dangerous. The metadata lowater event will not be
      re-sent, so no further event will be generated before the metadata
      device is out if space, unless some other event causes userspace to
      recheck its thresholds.)
      Signed-off-by: default avatarAndy Grover <agrover@redhat.com>
      Signed-off-by: default avatarMike Snitzer <snitzer@redhat.com>
      63c8ecb6
  6. 27 Jul, 2018 16 commits
  7. 22 Jul, 2018 8 commits
  8. 21 Jul, 2018 8 commits
    • Linus Torvalds's avatar
      Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · ea75a2c7
      Linus Torvalds authored
      Pull core kernel fixes from Ingo Molnar:
       "This is mostly the copy_to_user_mcsafe() related fixes from Dan
        Williams, and an ORC fix for Clang"
      
      * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handling
        lib/iov_iter: Fix pipe handling in _copy_to_iter_mcsafe()
        lib/iov_iter: Document _copy_to_iter_flushcache()
        lib/iov_iter: Document _copy_to_iter_mcsafe()
        objtool: Use '.strtab' if '.shstrtab' doesn't exist, to support ORC tables on Clang
      ea75a2c7
    • Linus Torvalds's avatar
      Merge tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · ffb48e79
      Linus Torvalds authored
      Pull powerpc fixes from Michael Ellerman:
       "Two regression fixes, one for xmon disassembly formatting and the
        other to fix the E500 build.
      
        Two commits to fix a potential security issue in the VFIO code under
        obscure circumstances.
      
        And finally a fix to the Power9 idle code to restore SPRG3, which is
        user visible and used for sched_getcpu().
      
        Thanks to: Alexey Kardashevskiy, David Gibson. Gautham R. Shenoy,
        James Clarke"
      
      * tag 'powerpc-4.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
        powerpc/powernv: Fix save/restore of SPRG3 on entry/exit from stop (idle)
        powerpc/Makefile: Assemble with -me500 when building for E500
        KVM: PPC: Check if IOMMU page is contained in the pinned physical page
        vfio/spapr: Use IOMMU pageshift rather than pagesize
        powerpc/xmon: Fix disassembly since printf changes
      ffb48e79
    • Linus Torvalds's avatar
      Merge tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 55b636b4
      Linus Torvalds authored
      Pull btrfs fix from David Sterba:
       "A fix of a corruption regarding fsync and clone, under some very
        specific conditions explained in the patch.
      
        The fix is marked for stable 3.16+ so I'd like to get it merged now
        given the impact"
      
      * tag 'for-4.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
        Btrfs: fix file data corruption after cloning a range and fsync
      55b636b4
    • Linus Torvalds's avatar
      mm: make vm_area_alloc() initialize core fields · 490fc053
      Linus Torvalds authored
      Like vm_area_dup(), it initializes the anon_vma_chain head, and the
      basic mm pointer.
      
      The rest of the fields end up being different for different users,
      although the plan is to also initialize the 'vm_ops' field to a dummy
      entry.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      490fc053
    • Linus Torvalds's avatar
      mm: make vm_area_dup() actually copy the old vma data · 95faf699
      Linus Torvalds authored
      .. and re-initialize th eanon_vma_chain head.
      
      This removes some boiler-plate from the users, and also makes it clear
      why it didn't need use the 'zalloc()' version.
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95faf699
    • Linus Torvalds's avatar
      mm: use helper functions for allocating and freeing vm_area structs · 3928d4f5
      Linus Torvalds authored
      The vm_area_struct is one of the most fundamental memory management
      objects, but the management of it is entirely open-coded evertwhere,
      ranging from allocation and freeing (using kmem_cache_[z]alloc and
      kmem_cache_free) to initializing all the fields.
      
      We want to unify this in order to end up having some unified
      initialization of the vmas, and the first step to this is to at least
      have basic allocation functions.
      
      Right now those functions are literally just wrappers around the
      kmem_cache_*() calls.  This is a purely mechanical conversion:
      
          # new vma:
          kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL) -> vm_area_alloc()
      
          # copy old vma
          kmem_cache_alloc(vm_area_cachep, GFP_KERNEL) -> vm_area_dup(old)
      
          # free vma
          kmem_cache_free(vm_area_cachep, vma) -> vm_area_free(vma)
      
      to the point where the old vma passed in to the vm_area_dup() function
      isn't even used yet (because I've left all the old manual initialization
      alone).
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3928d4f5
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 191a3afa
      Linus Torvalds authored
      Merge fixes from Andrew Morton:
       "5 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm: memcg: fix use after free in mem_cgroup_iter()
        mm/huge_memory.c: fix data loss when splitting a file pmd
        fat: fix memory allocation failure handling of match_strdup()
        MAINTAINERS: Peter has moved
        mm/memblock: add missing include <linux/bootmem.h>
      191a3afa
    • Jing Xia's avatar
      mm: memcg: fix use after free in mem_cgroup_iter() · 9f15bde6
      Jing Xia authored
      It was reported that a kernel crash happened in mem_cgroup_iter(), which
      can be triggered if the legacy cgroup-v1 non-hierarchical mode is used.
      
      Unable to handle kernel paging request at virtual address 6b6b6b6b6b6b8f
      ......
      Call trace:
        mem_cgroup_iter+0x2e0/0x6d4
        shrink_zone+0x8c/0x324
        balance_pgdat+0x450/0x640
        kswapd+0x130/0x4b8
        kthread+0xe8/0xfc
        ret_from_fork+0x10/0x20
      
        mem_cgroup_iter():
            ......
            if (css_tryget(css))    <-- crash here
      	    break;
            ......
      
      The crashing reason is that mem_cgroup_iter() uses the memcg object whose
      pointer is stored in iter->position, which has been freed before and
      filled with POISON_FREE(0x6b).
      
      And the root cause of the use-after-free issue is that
      invalidate_reclaim_iterators() fails to reset the value of iter->position
      to NULL when the css of the memcg is released in non- hierarchical mode.
      
      Link: http://lkml.kernel.org/r/1531994807-25639-1-git-send-email-jing.xia@unisoc.com
      Fixes: 6df38689 ("mm: memcontrol: fix possible memcg leak due to interrupted reclaim")
      Signed-off-by: default avatarJing Xia <jing.xia.mail@gmail.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <chunyan.zhang@unisoc.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9f15bde6