1. 26 Sep, 2020 9 commits
    • Laurent Dufour's avatar
      mm: replace memmap_context by meminit_context · c1d0da83
      Laurent Dufour authored
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c1d0da83
    • Mikulas Patocka's avatar
      arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback · a1cd6c2a
      Mikulas Patocka authored
      If we copy less than 8 bytes and if the destination crosses a cache
      line, __copy_user_flushcache would invalidate only the first cache line.
      
      This patch makes it invalidate the second cache line as well.
      
      Fixes: 0aed55af ("x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations")
      Signed-off-by: default avatarMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDan Williams <dan.j.wiilliams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/alpine.LRH.2.02.2009161451140.21915@file01.intranet.prod.int.rdu2.redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      a1cd6c2a
    • Jason Yan's avatar
      lib/memregion.c: include memregion.h · ffa550cd
      Jason Yan authored
      This addresses the following sparse warning:
      
        lib/memregion.c:8:5: warning: symbol 'memregion_alloc' was not declared. Should it be static?
        lib/memregion.c:14:6: warning: symbol 'memregion_free' was not declared. Should it be static?
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarJason Yan <yanaijie@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20200921142852.875312-1-yanaijie@huawei.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ffa550cd
    • Nick Desaulniers's avatar
      lib/string.c: implement stpcpy · 1e1b6d63
      Nick Desaulniers authored
      LLVM implemented a recent "libcall optimization" that lowers calls to
      `sprintf(dest, "%s", str)` where the return value is used to
      `stpcpy(dest, str) - dest`.
      
      This generally avoids the machinery involved in parsing format strings.
      `stpcpy` is just like `strcpy` except it returns the pointer to the new
      tail of `dest`.  This optimization was introduced into clang-12.
      
      Implement this so that we don't observe linkage failures due to missing
      symbol definitions for `stpcpy`.
      
      Similar to last year's fire drill with: commit 5f074f3e
      ("lib/string.c: implement a basic bcmp")
      
      The kernel is somewhere between a "freestanding" environment (no full
      libc) and "hosted" environment (many symbols from libc exist with the
      same type, function signature, and semantics).
      
      As Peter Anvin notes, there's not really a great way to inform the
      compiler that you're targeting a freestanding environment but would like
      to opt-in to some libcall optimizations (see pr/47280 below), rather
      than opt-out.
      
      Arvind notes, -fno-builtin-* behaves slightly differently between GCC
      and Clang, and Clang is missing many __builtin_* definitions, which I
      consider a bug in Clang and am working on fixing.
      
      Masahiro summarizes the subtle distinction between compilers justly:
        To prevent transformation from foo() into bar(), there are two ways in
        Clang to do that; -fno-builtin-foo, and -fno-builtin-bar.  There is
        only one in GCC; -fno-buitin-foo.
      
      (Any difference in that behavior in Clang is likely a bug from a missing
      __builtin_* definition.)
      
      Masahiro also notes:
        We want to disable optimization from foo() to bar(),
        but we may still benefit from the optimization from
        foo() into something else. If GCC implements the same transform, we
        would run into a problem because it is not -fno-builtin-bar, but
        -fno-builtin-foo that disables that optimization.
      
        In this regard, -fno-builtin-foo would be more future-proof than
        -fno-built-bar, but -fno-builtin-foo is still potentially overkill. We
        may want to prevent calls from foo() being optimized into calls to
        bar(), but we still may want other optimization on calls to foo().
      
      It seems that compilers today don't quite provide the fine grain control
      over which libcall optimizations pseudo-freestanding environments would
      prefer.
      
      Finally, Kees notes that this interface is unsafe, so we should not
      encourage its use.  As such, I've removed the declaration from any
      header, but it still needs to be exported to avoid linkage errors in
      modules.
      Reported-by: default avatarSami Tolvanen <samitolvanen@google.com>
      Suggested-by: default avatarAndy Lavr <andy.lavr@gmail.com>
      Suggested-by: default avatarArvind Sankar <nivedita@alum.mit.edu>
      Suggested-by: default avatarJoe Perches <joe@perches.com>
      Suggested-by: default avatarKees Cook <keescook@chromium.org>
      Suggested-by: default avatarMasahiro Yamada <masahiroy@kernel.org>
      Suggested-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: default avatarNick Desaulniers <ndesaulniers@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Tested-by: default avatarNathan Chancellor <natechancellor@gmail.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200914161643.938408-1-ndesaulniers@google.com
      Link: https://bugs.llvm.org/show_bug.cgi?id=47162
      Link: https://bugs.llvm.org/show_bug.cgi?id=47280
      Link: https://github.com/ClangBuiltLinux/linux/issues/1126
      Link: https://man7.org/linux/man-pages/man3/stpcpy.3.html
      Link: https://pubs.opengroup.org/onlinepubs/9699919799/functions/stpcpy.html
      Link: https://reviews.llvm.org/D85963Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1e1b6d63
    • Zi Yan's avatar
      mm/migrate: correct thp migration stats · 6c5c7b9f
      Zi Yan authored
      PageTransHuge returns true for both thp and hugetlb, so thp stats was
      counting both thp and hugetlb migrations.  Exclude hugetlb migration by
      setting is_thp variable right.
      
      Clean up thp handling code too when we are there.
      
      Fixes: 1a5bae25 ("mm/vmstat: add events for THP migration without split")
      Signed-off-by: default avatarZi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Link: https://lkml.kernel.org/r/20200917210413.1462975-1-zi.yan@sent.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6c5c7b9f
    • Vasily Gorbik's avatar
      mm/gup: fix gup_fast with dynamic page table folding · d3f7b1bb
      Vasily Gorbik authored
      Currently to make sure that every page table entry is read just once
      gup_fast walks perform READ_ONCE and pass pXd value down to the next
      gup_pXd_range function by value e.g.:
      
        static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                                 unsigned int flags, struct page **pages, int *nr)
        ...
                pudp = pud_offset(&p4d, addr);
      
      This function passes a reference on that local value copy to pXd_offset,
      and might get the very same pointer in return.  This happens when the
      level is folded (on most arches), and that pointer should not be
      iterated.
      
      On s390 due to the fact that each task might have different 5,4 or
      3-level address translation and hence different levels folded the logic
      is more complex and non-iteratable pointer to a local copy leads to
      severe problems.
      
      Here is an example of what happens with gup_fast on s390, for a task
      with 3-level paging, crossing a 2 GB pud boundary:
      
        // addr = 0x1007ffff000, end = 0x10080001000
        static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end,
                                 unsigned int flags, struct page **pages, int *nr)
        {
              unsigned long next;
              pud_t *pudp;
      
              // pud_offset returns &p4d itself (a pointer to a value on stack)
              pudp = pud_offset(&p4d, addr);
              do {
                      // on second iteratation reading "random" stack value
                      pud_t pud = READ_ONCE(*pudp);
      
                      // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390
                      next = pud_addr_end(addr, end);
                      ...
              } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack
      
              return 1;
        }
      
      This happens since s390 moved to common gup code with commit
      d1874a0c ("s390/mm: make the pxd_offset functions more robust") and
      commit 1a42010c ("s390/mm: convert to the generic
      get_user_pages_fast code").
      
      s390 tried to mimic static level folding by changing pXd_offset
      primitives to always calculate top level page table offset in pgd_offset
      and just return the value passed when pXd_offset has to act as folded.
      
      What is crucial for gup_fast and what has been overlooked is that
      PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly.
      And the latter is not possible with dynamic folding.
      
      To fix the issue in addition to pXd values pass original pXdp pointers
      down to gup_pXd_range functions.  And introduce pXd_offset_lockless
      helpers, which take an additional pXd entry value parameter.  This has
      already been discussed in
      
        https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1
      
      Fixes: 1a42010c ("s390/mm: convert to the generic get_user_pages_fast code")
      Signed-off-by: default avatarVasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Reviewed-by: default avatarAlexander Gordeev <agordeev@linux.ibm.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hoursSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3f7b1bb
    • Muchun Song's avatar
      mm: memcontrol: fix missing suffix of workingset_restore · 8d3fe09d
      Muchun Song authored
      We forget to add the suffix to the workingset_restore string, so fix it.
      
      And also update the documentation of cgroup-v2.rst.
      
      Fixes: 170b04b7 ("mm/workingset: prepare the workingset detection infrastructure for anon LRU")
      Signed-off-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Link: https://lkml.kernel.org/r/20200916100030.71698-1-songmuchun@bytedance.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8d3fe09d
    • Gao Xiang's avatar
      mm, THP, swap: fix allocating cluster for swapfile by mistake · 41663430
      Gao Xiang authored
      SWP_FS is used to make swap_{read,write}page() go through the
      filesystem, and it's only used for swap files over NFS.  So, !SWP_FS
      means non NFS for now, it could be either file backed or device backed.
      Something similar goes with legacy SWP_FILE.
      
      So in order to achieve the goal of the original patch, SWP_BLKDEV should
      be used instead.
      
      FS corruption can be observed with SSD device + XFS + fragmented
      swapfile due to CONFIG_THP_SWAP=y.
      
      I reproduced the issue with the following details:
      
      Environment:
      
        QEMU + upstream kernel + buildroot + NVMe (2 GB)
      
      Kernel config:
      
        CONFIG_BLK_DEV_NVME=y
        CONFIG_THP_SWAP=y
      
      Some reproducible steps:
      
        mkfs.xfs -f /dev/nvme0n1
        mkdir /tmp/mnt
        mount /dev/nvme0n1 /tmp/mnt
        bs="32k"
        sz="1024m"    # doesn't matter too much, I also tried 16m
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw
        xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw
      
        mkswap /tmp/mnt/sw
        swapon /tmp/mnt/sw
      
        stress --vm 2 --vm-bytes 600M   # doesn't matter too much as well
      
      Symptoms:
       - FS corruption (e.g. checksum failure)
       - memory corruption at: 0xd2808010
       - segfault
      
      Fixes: f0eea189 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device")
      Fixes: 38d8b4e6 ("mm, THP, swap: delay splitting THP during swap out")
      Signed-off-by: default avatarGao Xiang <hsiangkao@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Acked-by: default avatarRafael Aquini <aquini@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Eric Sandeen <esandeen@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.comSigned-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      41663430
    • Linus Torvalds's avatar
      Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · 7c7ec322
      Linus Torvalds authored
      Pull more kvm fixes from Paolo Bonzini:
       "Five small fixes.
      
        The nested migration bug will be fixed with a better API in 5.10 or
        5.11, for now this is a fix that works with existing userspace but
        keeps the current ugly API"
      
      * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
        KVM: SVM: Add a dedicated INVD intercept routine
        KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE
        KVM: x86: fix MSR_IA32_TSC read for nested migration
        selftests: kvm: Fix assert failure in single-step test
        KVM: x86: VMX: Make smaller physical guest address space support user-configurable
      7c7ec322
  2. 25 Sep, 2020 15 commits
  3. 24 Sep, 2020 6 commits
  4. 23 Sep, 2020 10 commits