1. 12 Aug, 2017 1 commit
  2. 11 Aug, 2017 17 commits
  3. 10 Aug, 2017 22 commits
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · 27df704d
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "21 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (21 commits)
        userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage
        zram: rework copy of compressor name in comp_algorithm_store()
        rmap: do not call mmu_notifier_invalidate_page() under ptl
        mm: fix list corruptions on shmem shrinklist
        mm/balloon_compaction.c: don't zero ballooned pages
        MAINTAINERS: copy virtio on balloon_compaction.c
        mm: fix KSM data corruption
        mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem
        mm: make tlb_flush_pending global
        mm: refactor TLB gathering API
        Revert "mm: numa: defer TLB flush for THP migration as long as possible"
        mm: migrate: fix barriers around tlb_flush_pending
        mm: migrate: prevent racy access to tlb_flush_pending
        fault-inject: fix wrong should_fail() decision in task context
        test_kmod: fix small memory leak on filesystem tests
        test_kmod: fix the lock in register_test_dev_kmod()
        test_kmod: fix bug which allows negative values on two config options
        test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY"
        userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case
        mm: ratelimit PFNs busy info message
        ...
      27df704d
    • Mike Rapoport's avatar
      userfaultfd: replace ENOSPC with ESRCH in case mm has gone during copy/zeropage · e86b298b
      Mike Rapoport authored
      When the process exit races with outstanding mcopy_atomic, it would be
      better to return ESRCH error.  When such race occurs the process and
      it's mm are going away and returning "no such process" to the uffd
      monitor seems better fit than ENOSPC.
      
      Link: http://lkml.kernel.org/r/1502111545-32305-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
      Suggested-by: default avatarMichal Hocko <mhocko@suse.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e86b298b
    • Matthias Kaehlcke's avatar
      zram: rework copy of compressor name in comp_algorithm_store() · f357e345
      Matthias Kaehlcke authored
      comp_algorithm_store() passes the size of the source buffer to strlcpy()
      instead of the destination buffer size.  Make it explicit that the two
      buffers have the same size and use strcpy() instead of strlcpy().  The
      latter can be done safely since the function ensures that the string in
      the source buffer is terminated.
      
      Link: http://lkml.kernel.org/r/20170803163350.45245-1-mka@chromium.orgSigned-off-by: default avatarMatthias Kaehlcke <mka@chromium.org>
      Reviewed-by: default avatarDouglas Anderson <dianders@chromium.org>
      Reviewed-by: default avatarSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f357e345
    • Kirill A. Shutemov's avatar
      rmap: do not call mmu_notifier_invalidate_page() under ptl · aac2fea9
      Kirill A. Shutemov authored
      MMU notifiers can sleep, but in page_mkclean_one() we call
      mmu_notifier_invalidate_page() under page table lock.
      
      Let's instead use mmu_notifier_invalidate_range() outside
      page_vma_mapped_walk() loop.
      
      [jglisse@redhat.com: try_to_unmap_one() do not call mmu_notifier under ptl]
        Link: http://lkml.kernel.org/r/20170809204333.27485-1-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170804134928.l4klfcnqatni7vsc@black.fi.intel.com
      Fixes: c7ab0d2f ("mm: convert try_to_unmap_one() to use page_vma_mapped_walk()")
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: default avatarJérôme Glisse <jglisse@redhat.com>
      Reported-by: default avataraxie <axie@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: "Writer, Tim" <Tim.Writer@amd.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      aac2fea9
    • Cong Wang's avatar
      mm: fix list corruptions on shmem shrinklist · d041353d
      Cong Wang authored
      We saw many list corruption warnings on shmem shrinklist:
      
        WARNING: CPU: 18 PID: 177 at lib/list_debug.c:59 __list_del_entry+0x9e/0xc0
        list_del corruption. prev->next should be ffff9ae5694b82d8, but was ffff9ae5699ba960
        Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
        CPU: 18 PID: 177 Comm: kswapd1 Not tainted 4.9.34-t3.el7.twitter.x86_64 #1
        Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
        Call Trace:
          dump_stack+0x4d/0x66
          __warn+0xcb/0xf0
          warn_slowpath_fmt+0x4f/0x60
          __list_del_entry+0x9e/0xc0
          shmem_unused_huge_shrink+0xfa/0x2e0
          shmem_unused_huge_scan+0x20/0x30
          super_cache_scan+0x193/0x1a0
          shrink_slab.part.41+0x1e3/0x3f0
          shrink_slab+0x29/0x30
          shrink_node+0xf9/0x2f0
          kswapd+0x2d8/0x6c0
          kthread+0xd7/0xf0
          ret_from_fork+0x22/0x30
      
        WARNING: CPU: 23 PID: 639 at lib/list_debug.c:33 __list_add+0x89/0xb0
        list_add corruption. prev->next should be next (ffff9ae5699ba960), but was ffff9ae5694b82d8. (prev=ffff9ae5694b82d8).
        Modules linked in: intel_rapl sb_edac edac_core x86_pkg_temp_thermal coretemp iTCO_wdt iTCO_vendor_support crct10dif_pclmul crc32_pclmul ghash_clmulni_intel raid0 dcdbas shpchp wmi hed i2c_i801 ioatdma lpc_ich i2c_smbus acpi_cpufreq tcp_diag inet_diag sch_fq_codel ipmi_si ipmi_devintf ipmi_msghandler igb ptp crc32c_intel pps_core i2c_algo_bit i2c_core dca ipv6 crc_ccitt
        CPU: 23 PID: 639 Comm: systemd-udevd Tainted: G        W       4.9.34-t3.el7.twitter.x86_64 #1
        Hardware name: Dell Inc. PowerEdge C6220/0W6W6G, BIOS 2.2.3 11/07/2013
        Call Trace:
          dump_stack+0x4d/0x66
          __warn+0xcb/0xf0
          warn_slowpath_fmt+0x4f/0x60
          __list_add+0x89/0xb0
          shmem_setattr+0x204/0x230
          notify_change+0x2ef/0x440
          do_truncate+0x5d/0x90
          path_openat+0x331/0x1190
          do_filp_open+0x7e/0xe0
          do_sys_open+0x123/0x200
          SyS_open+0x1e/0x20
          do_syscall_64+0x61/0x170
          entry_SYSCALL64_slow_path+0x25/0x25
      
      The problem is that shmem_unused_huge_shrink() moves entries from the
      global sbinfo->shrinklist to its local lists and then releases the
      spinlock.  However, a parallel shmem_setattr() could access one of these
      entries directly and add it back to the global shrinklist if it is
      removed, with the spinlock held.
      
      The logic itself looks solid since an entry could be either in a local
      list or the global list, otherwise it is removed from one of them by
      list_del_init().  So probably the race condition is that, one CPU is in
      the middle of INIT_LIST_HEAD() but the other CPU calls list_empty()
      which returns true too early then the following list_add_tail() sees a
      corrupted entry.
      
      list_empty_careful() is designed to fix this situation.
      
      [akpm@linux-foundation.org: add comments]
      Link: http://lkml.kernel.org/r/20170803054630.18775-1-xiyou.wangcong@gmail.com
      Fixes: 779750d2 ("shmem: split huge pages beyond i_size under memory pressure")
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d041353d
    • Wei Wang's avatar
      mm/balloon_compaction.c: don't zero ballooned pages · af54aed9
      Wei Wang authored
      Revert commit bb01b64c ("mm/balloon_compaction.c: enqueue zero page
      to balloon device")'
      
      Zeroing ballon pages is rather time consuming, especially when a lot of
      pages are in flight. E.g. 7GB worth of ballooned memory takes 2.8s with
      __GFP_ZERO while it takes ~491ms without it.
      
      The original commit argued that zeroing will help ksmd to merge these
      pages on the host but this argument is assuming that the host actually
      marks balloon pages for ksm which is not universally true.  So we pay
      performance penalty for something that even might not be used in the end
      which is wrong.  The host can zero out pages on its own when there is a
      need.
      
      [mhocko@kernel.org: new changelog text]
      Link: http://lkml.kernel.org/r/1501761557-9758-1-git-send-email-wei.w.wang@intel.com
      Fixes: bb01b64c ("mm/balloon_compaction.c: enqueue zero page to balloon device")
      Signed-off-by: default avatarWei Wang <wei.w.wang@intel.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: zhenwei.pi <zhenwei.pi@youruncloud.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      af54aed9
    • Michael S. Tsirkin's avatar
      MAINTAINERS: copy virtio on balloon_compaction.c · c0a6a5ae
      Michael S. Tsirkin authored
      Changes to mm/balloon_compaction.c can easily break virtio, and virtio
      is the only user of that interface.  Add a line to MAINTAINERS so
      whoever changes that file remembers to copy us.
      
      Link: http://lkml.kernel.org/r/1501764010-24456-1-git-send-email-mst@redhat.comSigned-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Acked-by: default avatarWei Wang <wei.w.wang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c0a6a5ae
    • Minchan Kim's avatar
      mm: fix KSM data corruption · b3a81d08
      Minchan Kim authored
      Nadav reported KSM can corrupt the user data by the TLB batching
      race[1].  That means data user written can be lost.
      
      Quote from Nadav Amit:
       "For this race we need 4 CPUs:
      
        CPU0: Caches a writable and dirty PTE entry, and uses the stale value
        for write later.
      
        CPU1: Runs madvise_free on the range that includes the PTE. It would
        clear the dirty-bit. It batches TLB flushes.
      
        CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
        We care about the fact that it clears the PTE write-bit, and of
        course, batches TLB flushes.
      
        CPU3: Runs KSM. Our purpose is to pass the following test in
        write_protect_page():
      
      	if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
      	    (pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))
      
        Since it will avoid TLB flush. And we want to do it while the PTE is
        stale. Later, and before replacing the page, we would be able to
        change the page.
      
        Note that all the operations the CPU1-3 perform canhappen in parallel
        since they only acquire mmap_sem for read.
      
        We start with two identical pages. Everything below regards the same
        page/PTE.
      
        CPU0        CPU1        CPU2        CPU3
        ----        ----        ----        ----
        Write the same
        value on page
      
        [cache PTE as
         dirty in TLB]
      
                    MADV_FREE
                    pte_mkclean()
      
                                4 > clear_refs
                                pte_wrprotect()
      
                                            write_protect_page()
                                            [ success, no flush ]
      
                                            pages_indentical()
                                            [ ok ]
      
        Write to page
        different value
      
        [Ok, using stale
         PTE]
      
                                            replace_page()
      
        Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
        CPU0 already wrote on the page, but KSM ignored this write, and it got
        lost"
      
      In above scenario, MADV_FREE is fixed by changing TLB batching API
      including [set|clear]_tlb_flush_pending.  Remained thing is soft-dirty
      part.
      
      This patch changes soft-dirty uses TLB batching API instead of
      flush_tlb_mm and KSM checks pending TLB flush by using
      mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
      there are other parallel threads pending TLB flush.
      
      [1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Tested-by: default avatarNadav Amit <namit@vmware.com>
      Reviewed-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b3a81d08
    • Minchan Kim's avatar
      mm: fix MADV_[FREE|DONTNEED] TLB flush miss problem · 99baac21
      Minchan Kim authored
      Nadav reported parallel MADV_DONTNEED on same range has a stale TLB
      problem and Mel fixed it[1] and found same problem on MADV_FREE[2].
      
      Quote from Mel Gorman:
       "The race in question is CPU 0 running madv_free and updating some PTEs
        while CPU 1 is also running madv_free and looking at the same PTEs.
        CPU 1 may have writable TLB entries for a page but fail the pte_dirty
        check (because CPU 0 has updated it already) and potentially fail to
        flush.
      
        Hence, when madv_free on CPU 1 returns, there are still potentially
        writable TLB entries and the underlying PTE is still present so that a
        subsequent write does not necessarily propagate the dirty bit to the
        underlying PTE any more. Reclaim at some unknown time at the future
        may then see that the PTE is still clean and discard the page even
        though a write has happened in the meantime. I think this is possible
        but I could have missed some protection in madv_free that prevents it
        happening."
      
      This patch aims for solving both problems all at once and is ready for
      other problem with KSM, MADV_FREE and soft-dirty story[3].
      
      TLB batch API(tlb_[gather|finish]_mmu] uses [inc|dec]_tlb_flush_pending
      and mmu_tlb_flush_pending so that when tlb_finish_mmu is called, we can
      catch there are parallel threads going on.  In that case, forcefully,
      flush TLB to prevent for user to access memory via stale TLB entry
      although it fail to gather page table entry.
      
      I confirmed this patch works with [4] test program Nadav gave so this
      patch supersedes "mm: Always flush VMA ranges affected by zap_page_range
      v2" in current mmotm.
      
      NOTE:
      
      This patch modifies arch-specific TLB gathering interface(x86, ia64,
      s390, sh, um).  It seems most of architecture are straightforward but
      s390 need to be careful because tlb_flush_mmu works only if
      mm->context.flush_mm is set to non-zero which happens only a pte entry
      really is cleared by ptep_get_and_clear and friends.  However, this
      problem never changes the pte entries but need to flush to prevent
      memory access from stale tlb.
      
      [1] http://lkml.kernel.org/r/20170725101230.5v7gvnjmcnkzzql3@techsingularity.net
      [2] http://lkml.kernel.org/r/20170725100722.2dxnmgypmwnrfawp@suse.de
      [3] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com
      [4] https://patchwork.kernel.org/patch/9861621/
      
      [minchan@kernel.org: decrease tlb flush pending count in tlb_finish_mmu]
        Link: http://lkml.kernel.org/r/20170808080821.GA31730@bbox
      Link: http://lkml.kernel.org/r/20170802000818.4760-7-namit@vmware.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarNadav Amit <namit@vmware.com>
      Reported-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      99baac21
    • Minchan Kim's avatar
      mm: make tlb_flush_pending global · 0a2dd266
      Minchan Kim authored
      Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
      COMPACTION] but upcoming patches to solve subtle TLB flush batching
      problem will use it regardless of compaction/NUMA so this patch doesn't
      remove the dependency.
      
      [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
      Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a2dd266
    • Minchan Kim's avatar
      mm: refactor TLB gathering API · 56236a59
      Minchan Kim authored
      This patch is a preparatory patch for solving race problems caused by
      TLB batch.  For that, we will increase/decrease TLB flush pending count
      of mm_struct whenever tlb_[gather|finish]_mmu is called.
      
      Before making it simple, this patch separates architecture specific part
      and rename it to arch_tlb_[gather|finish]_mmu and generic part just
      calls it.
      
      It shouldn't change any behavior.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-5-namit@vmware.comSigned-off-by: default avatarMinchan Kim <minchan@kernel.org>
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56236a59
    • Nadav Amit's avatar
      Revert "mm: numa: defer TLB flush for THP migration as long as possible" · a9b80250
      Nadav Amit authored
      While deferring TLB flushes is a good practice, the reverted patch
      caused pending TLB flushes to be checked while the page-table lock is
      not taken.  As a result, in architectures with weak memory model (PPC),
      Linux may miss a memory-barrier, miss the fact TLB flushes are pending,
      and cause (in theory) a memory corruption.
      
      Since the alternative of using smp_mb__after_unlock_lock() was
      considered a bit open-coded, and the performance impact is expected to
      be small, the previous patch is reverted.
      
      This reverts b0943d61 ("mm: numa: defer TLB flush for THP migration
      as long as possible").
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-4-namit@vmware.comSigned-off-by: default avatarNadav Amit <namit@vmware.com>
      Suggested-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a9b80250
    • Nadav Amit's avatar
      mm: migrate: fix barriers around tlb_flush_pending · 0a2c4048
      Nadav Amit authored
      Reading tlb_flush_pending while the page-table lock is taken does not
      require a barrier, since the lock/unlock already acts as a barrier.
      Removing the barrier in mm_tlb_flush_pending() to address this issue.
      
      However, migrate_misplaced_transhuge_page() calls mm_tlb_flush_pending()
      while the page-table lock is already released, which may present a
      problem on architectures with weak memory model (PPC).  To deal with
      this case, a new parameter is added to mm_tlb_flush_pending() to
      indicate if it is read without the page-table lock taken, and calling
      smp_mb__after_unlock_lock() in this case.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-3-namit@vmware.comSigned-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a2c4048
    • Nadav Amit's avatar
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit authored
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: default avatarNadav Amit <namit@vmware.com>
      Acked-by: default avatarMel Gorman <mgorman@suse.de>
      Acked-by: default avatarRik van Riel <riel@redhat.com>
      Acked-by: default avatarMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
    • Akinobu Mita's avatar
      fault-inject: fix wrong should_fail() decision in task context · 9eeb52ae
      Akinobu Mita authored
      Commit 1203c8e6 ("fault-inject: simplify access check for fail-nth")
      unintentionally broke a conditional statement in should_fail().  Any
      faults are not injected in the task context by the change when the
      systematic fault injection is not used.
      
      This change restores to the previous correct behaviour.
      
      Link: http://lkml.kernel.org/r/1501633700-3488-1-git-send-email-akinobu.mita@gmail.com
      Fixes: 1203c8e6 ("fault-inject: simplify access check for fail-nth")
      Signed-off-by: default avatarAkinobu Mita <akinobu.mita@gmail.com>
      Reported-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Tested-by: default avatarLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9eeb52ae
    • Dan Carpenter's avatar
      test_kmod: fix small memory leak on filesystem tests · 4e98ebe5
      Dan Carpenter authored
      The break was in the wrong place so file system tests don't work as
      intended, leaking memory at each test switch.
      
      [mcgrof@kernel.org: massaged commit subject, noted memory leak issue without the fix]
      Link: http://lkml.kernel.org/r/20170802211450.27928-6-mcgrof@kernel.org
      Fixes: 39258f448d71 ("kmod: add test driver to stress test the module loader")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Reported-by: default avatarDavid Binderman <dcb314@hotmail.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4e98ebe5
    • Dan Carpenter's avatar
      test_kmod: fix the lock in register_test_dev_kmod() · 9c567713
      Dan Carpenter authored
      We accidentally just drop the lock twice instead of taking it and then
      releasing it.  This isn't a big issue unless you are adding more than
      one device to test on, and the kmod.sh doesn't do that yet, however this
      obviously is the correct thing to do.
      
      [mcgrof@kernel.org: massaged subject, explain what happens]
      Link: http://lkml.kernel.org/r/20170802211450.27928-5-mcgrof@kernel.org
      Fixes: 39258f448d71 ("kmod: add test driver to stress test the module loader")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: David Binderman <dcb314@hotmail.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9c567713
    • Luis R. Rodriguez's avatar
      test_kmod: fix bug which allows negative values on two config options · 434b06ae
      Luis R. Rodriguez authored
      Parsing with kstrtol() enables values to be negative, and we failed to
      check for negative values when parsing with test_dev_config_update_uint_sync()
      or test_dev_config_update_uint_range().
      
      test_dev_config_update_uint_range() has a minimum check though so an
      issue is not present there.  test_dev_config_update_uint_sync() is only
      used for the number of threads to use (config_num_threads_store()), and
      indeed this would fail with an attempt for a large allocation.
      
      Although the issue is only present in practice with the first fix both
      by using kstrtoul() instead of kstrtol().
      
      Link: http://lkml.kernel.org/r/20170802211450.27928-4-mcgrof@kernel.org
      Fixes: 39258f448d71 ("kmod: add test driver to stress test the module loader")
      Signed-off-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: David Binderman <dcb314@hotmail.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      434b06ae
    • Colin Ian King's avatar
      test_kmod: fix spelling mistake: "EMTPY" -> "EMPTY" · a4afe8cd
      Colin Ian King authored
      Trivial fix to spelling mistake in snprintf text
      
      [mcgrof@kernel.org: massaged commit message]
      Link: http://lkml.kernel.org/r/20170802211450.27928-3-mcgrof@kernel.org
      Fixes: 39258f448d71 ("kmod: add test driver to stress test the module loader")
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarLuis R. Rodriguez <mcgrof@kernel.org>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Binderman <dcb314@hotmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a4afe8cd
    • Andrea Arcangeli's avatar
      userfaultfd: hugetlbfs: remove superfluous page unlock in VM_SHARED case · 5af10dfd
      Andrea Arcangeli authored
      huge_add_to_page_cache->add_to_page_cache implicitly unlocks the page
      before returning in case of errors.
      
      The error returned was -EEXIST by running UFFDIO_COPY on a non-hole
      offset of a VM_SHARED hugetlbfs mapping.  It was an userland bug that
      triggered it and the kernel must cope with it returning -EEXIST from
      ioctl(UFFDIO_COPY) as expected.
      
        page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
        kernel BUG at mm/filemap.c:964!
        invalid opcode: 0000 [#1] SMP
        CPU: 1 PID: 22582 Comm: qemu-system-x86 Not tainted 4.11.11-300.fc26.x86_64 #1
        RIP: unlock_page+0x4a/0x50
        Call Trace:
          hugetlb_mcopy_atomic_pte+0xc0/0x320
          mcopy_atomic+0x96f/0xbe0
          userfaultfd_ioctl+0x218/0xe90
          do_vfs_ioctl+0xa5/0x600
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x1a/0xa9
      
      Link: http://lkml.kernel.org/r/20170802165145.22628-2-aarcange@redhat.comSigned-off-by: default avatarAndrea Arcangeli <aarcange@redhat.com>
      Tested-by: default avatarMaxime Coquelin <maxime.coquelin@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexey Perevalov <a.perevalov@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5af10dfd
    • Jonathan Toppins's avatar
      mm: ratelimit PFNs busy info message · 75dddef3
      Jonathan Toppins authored
      The RDMA subsystem can generate several thousand of these messages per
      second eventually leading to a kernel crash.  Ratelimit these messages
      to prevent this crash.
      
      Doug said:
       "I've been carrying a version of this for several kernel versions. I
        don't remember when they started, but we have one (and only one) class
        of machines: Dell PE R730xd, that generate these errors. When it
        happens, without a rate limit, we get rcu timeouts and kernel oopses.
        With the rate limit, we just get a lot of annoying kernel messages but
        the machine continues on, recovers, and eventually the memory
        operations all succeed"
      
      And:
       "> Well... why are all these EBUSY's occurring? It sounds inefficient
        > (at least) but if it is expected, normal and unavoidable then
        > perhaps we should just remove that message altogether?
      
        I don't have an answer to that question. To be honest, I haven't
        looked real hard. We never had this at all, then it started out of the
        blue, but only on our Dell 730xd machines (and it hits all of them),
        but no other classes or brands of machines. And we have our 730xd
        machines loaded up with different brands and models of cards (for
        instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
        ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
        meant it wasn't tied to any particular brand/model of RDMA hardware.
        To me, it always smelled of a hardware oddity specific to maybe the
        CPUs or mainboard chipsets in these machines, so given that I'm not an
        mm expert anyway, I never chased it down.
      
        A few other relevant details: it showed up somewhere around 4.8/4.9 or
        thereabouts. It never happened before, but the prinkt has been there
        since the 3.18 days, so possibly the test to trigger this message was
        changed, or something else in the allocator changed such that the
        situation started happening on these machines?
      
        And, like I said, it is specific to our 730xd machines (but they are
        all identical, so that could mean it's something like their specific
        ram configuration is causing the allocator to hit this on these
        machine but not on other machines in the cluster, I don't want to say
        it's necessarily the model of chipset or CPU, there are other bits of
        identicalness between these machines)"
      
      Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.comSigned-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
      Reviewed-by: default avatarDoug Ledford <dledford@redhat.com>
      Tested-by: default avatarDoug Ledford <dledford@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      75dddef3
    • Johannes Weiner's avatar
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner authored
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb