1. 15 Dec, 2020 40 commits
    • Mike Rapoport's avatar
      ia64: make SPARSEMEM default and disable DISCONTIGMEM · 214496cb
      Mike Rapoport authored
      SPARSEMEM memory model suitable for systems with large holes in their
      phyiscal memory layout. With SPARSEMEM_VMEMMAP enabled it provides
      pfn_to_page() and page_to_pfn() as fast as FLATMEM.
      
      Make it the default memory model for IA-64 and disable DISCONTIGMEM which
      is considered obsolete for quite some time.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-8-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      214496cb
    • Mike Rapoport's avatar
      ia64: forbid using VIRTUAL_MEM_MAP with FLATMEM · ea34f78f
      Mike Rapoport authored
      Virtual memory map was intended to avoid wasting memory on the memory map
      on systems with large holes in the physical memory layout. Long ago it been
      superseded first by DISCONTIGMEM and then by SPARSEMEM. Moreover,
      SPARSEMEM_VMEMMAP provide the same functionality in much more portable way.
      
      As the first step to removing the VIRTUAL_MEM_MAP forbid it's usage with
      FLATMEM and panic on systems with large holes in the physical memory
      layout that try to run FLATMEM kernels.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-7-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ea34f78f
    • Mike Rapoport's avatar
      ia64: split virtual map initialization out of paging_init() · 1f112129
      Mike Rapoport authored
      For both FLATMEM and DISCONTIGMEM/SPARSEMEM the virtual map initialization
      is spread over paging_init() for no good reason.
      
      Split out the bits related to virtual map initialization to a helper
      functions, one for FLATMEM and another for !FLATMEM configurations.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-6-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      1f112129
    • Mike Rapoport's avatar
      ia64: discontig: paging_init(): remove local max_pfn calculation · b90b5547
      Mike Rapoport authored
      The maximal PFN in the system is calculated during find_memory() time and
      it is stored at max_low_pfn then.
      
      Use this value in paging_init() and remove the redundant detection of
      max_pfn in that function.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-5-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b90b5547
    • Mike Rapoport's avatar
      ia64: remove 'ifdef CONFIG_ZONE_DMA32' statements · 5d37fc0b
      Mike Rapoport authored
      After the removal of SN2 platform (commit cf07cb1f ("ia64: remove
      support for the SGI SN2 platform") IA-64 always has ZONE_DMA32 and there is
      no point to guard code with this configuration option.
      
      Remove ifdefery associated with CONFIG_ZONE_DMA32
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-4-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      5d37fc0b
    • Mike Rapoport's avatar
      ia64: remove custom __early_pfn_to_nid() · 03e92a5e
      Mike Rapoport authored
      The ia64 implementation of __early_pfn_to_nid() essentially relies on the
      same data as the generic implementation.
      
      The correspondence between memory ranges and nodes is set in memblock
      during early memory initialization in register_active_ranges() function.
      
      The initialization of sparsemem that requires early_pfn_to_nid() happens
      later and it can use the memblock information like the other architectures.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-3-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      03e92a5e
    • Mike Rapoport's avatar
      alpha: switch from DISCONTIGMEM to SPARSEMEM · 36d40290
      Mike Rapoport authored
      Patch series "arch, mm: deprecate DISCONTIGMEM", v2.
      
      It's been a while since DISCONTIGMEM is generally considered deprecated,
      but it is still used by four architectures.  This set replaces
      DISCONTIGMEM with a different way to handle holes in the memory map and
      marks DISCONTIGMEM configuration as BROKEN in Kconfigs of these
      architectures with the intention to completely remove it in several
      releases.
      
      While for 64-bit alpha and ia64 the switch to SPARSEMEM is quite obvious
      and was a matter of moving some bits around, for smaller 32-bit arc and
      m68k SPARSEMEM is not necessarily the best thing to do.
      
      On 32-bit machines SPARSEMEM would require large sections to make section
      index fit in the page flags, but larger sections mean that more memory is
      wasted for unused memory map.
      
      Besides, pfn_to_page() and page_to_pfn() become less efficient, at least
      on arc.
      
      So I've decided to generalize arm's approach for freeing of unused parts
      of the memory map with FLATMEM and enable it for both arc and m68k.  The
      details are in the description of patches 10 (arc) and 13 (m68k).
      
      This patch (of 13):
      
      Enable SPARSEMEM support on alpha and deprecate DISCONTIGMEM.
      
      The required changes are mostly around moving duplicated definitions of
      page access and address conversion macros to a common place and making sure
      they are available for all memory models.
      
      The DISCONTINGMEM support is marked as BROKEN an will be removed in a
      couple of releases.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20201101170454.9567-2-rppt@kernel.orgSigned-off-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      36d40290
    • Marco Elver's avatar
      lkdtm: disable KASAN for rodata.o · 6d5a88cd
      Marco Elver authored
      Building lkdtm with KASAN and Clang 11 or later results in the following
      error when attempting to load the module:
      
        kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
        BUG: unable to handle page fault for address: ffffffffc019cd70
        #PF: supervisor instruction fetch in kernel mode
        #PF: error_code(0x0011) - permissions violation
        ...
        RIP: 0010:asan.module_ctor+0x0/0xffffffffffffa290 [lkdtm]
        ...
        Call Trace:
         do_init_module+0x17c/0x570
         load_module+0xadee/0xd0b0
         __x64_sys_finit_module+0x16c/0x1a0
         do_syscall_64+0x34/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is that rodata.o generates a dummy function that lives in
      .rodata to validate that .rodata can't be executed; however, Clang 11 adds
      KASAN globals support by generating module constructors to initialize
      globals redzones.  When Clang 11 adds a module constructor to rodata.o, it
      is also added to .rodata: any attempt to call it on initialization results
      in the above error.
      
      Therefore, disable KASAN instrumentation for rodata.o.
      
      Link: https://lkml.kernel.org/r/20201214191413.3164796-1-elver@google.comSigned-off-by: default avatarMarco Elver <elver@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      6d5a88cd
    • Walter Wu's avatar
      kasan: update documentation for generic kasan · 4784be28
      Walter Wu authored
      Generic KASAN also supports to record the last two workqueue stacks and
      print them in KASAN report.  So that need to update documentation.
      
      Link: https://lkml.kernel.org/r/20201203023037.30792-1-walter-zh.wu@mediatek.comSigned-off-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      4784be28
    • Walter Wu's avatar
      lib/test_kasan.c: add workqueue test case · 214c783d
      Walter Wu authored
      Adds a test to verify workqueue stack recording and print it in
      KASAN report.
      
      The KASAN report was as follows(cleaned up slightly):
      
       BUG: KASAN: use-after-free in kasan_workqueue_uaf
      
       Freed by task 54:
        kasan_save_stack+0x24/0x50
        kasan_set_track+0x24/0x38
        kasan_set_free_info+0x20/0x40
        __kasan_slab_free+0x10c/0x170
        kasan_slab_free+0x10/0x18
        kfree+0x98/0x270
        kasan_workqueue_work+0xc/0x18
      
       Last potentially related work creation:
        kasan_save_stack+0x24/0x50
        kasan_record_wq_stack+0xa8/0xb8
        insert_work+0x48/0x288
        __queue_work+0x3e8/0xc40
        queue_work_on+0xf4/0x118
        kasan_workqueue_uaf+0xfc/0x190
      
      Link: https://lkml.kernel.org/r/20201203022748.30681-1-walter-zh.wu@mediatek.comSigned-off-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      214c783d
    • Walter Wu's avatar
      kasan: print workqueue stack · ef133461
      Walter Wu authored
      The aux_stack[2] is reused to record the call_rcu() call stack and
      enqueuing work call stacks.  So that we need to change the auxiliary stack
      title for common title, print them in KASAN report.
      
      Link: https://lkml.kernel.org/r/20201203022715.30635-1-walter-zh.wu@mediatek.comSigned-off-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ef133461
    • Walter Wu's avatar
      workqueue: kasan: record workqueue stack · e89a85d6
      Walter Wu authored
      Patch series "kasan: add workqueue stack for generic KASAN", v5.
      
      Syzbot reports many UAF issues for workqueue, see [1].
      
      In some of these access/allocation happened in process_one_work(), we
      see the free stack is useless in KASAN report, it doesn't help
      programmers to solve UAF for workqueue issue.
      
      This patchset improves KASAN reports by making them to have workqueue
      queueing stack.  It is useful for programmers to solve use-after-free or
      double-free memory issue.
      
      Generic KASAN also records the last two workqueue stacks and prints them
      in KASAN report.  It is only suitable for generic KASAN.
      
      [1] https://groups.google.com/g/syzkaller-bugs/search?q=%22use-after-free%22+process_one_work
      [2] https://bugzilla.kernel.org/show_bug.cgi?id=198437
      
      This patch (of 4):
      
      When analyzing use-after-free or double-free issue, recording the
      enqueuing work stacks is helpful to preserve usage history which
      potentially gives a hint about the affected code.
      
      For workqueue it has turned out to be useful to record the enqueuing work
      call stacks.  Because user can see KASAN report to determine whether it is
      root cause.  They don't need to enable debugobjects, but they have a
      chance to find out the root cause.
      
      Link: https://lkml.kernel.org/r/20201203022148.29754-1-walter-zh.wu@mediatek.com
      Link: https://lkml.kernel.org/r/20201203022442.30006-1-walter-zh.wu@mediatek.comSigned-off-by: default avatarWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Reviewed-by: default avatarDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      e89a85d6
    • Vincenzo Frascino's avatar
      mm/vmalloc.c: fix kasan shadow poisoning size · c041098c
      Vincenzo Frascino authored
      The size of vm area can be affected by the presence or not of the guard
      page.  In particular when VM_NO_GUARD is present, the actual accessible
      size has to be considered like the real size minus the guard page.
      
      Currently kasan does not keep into account this information during the
      poison operation and in particular tries to poison the guard page as well.
      
      This approach, even if incorrect, does not cause an issue because the tags
      for the guard page are written in the shadow memory.  With the future
      introduction of the Tag-Based KASAN, being the guard page inaccessible by
      nature, the write tag operation on this page triggers a fault.
      
      Fix kasan shadow poisoning size invoking get_vm_area_size() instead of
      accessing directly the field in the data structure to detect the correct
      value.
      
      Link: https://lkml.kernel.org/r/20201027160213.32904-1-vincenzo.frascino@arm.com
      Fixes: d98c9e83 ("kasan: fix crashes on access to memory mapped by vm_map_ram()")
      Signed-off-by: default avatarVincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c041098c
    • Alex Shi's avatar
      docs/vm: remove unused 3 items explanation for /proc/vmstat · 56db19fe
      Alex Shi authored
      Commit 5647bc29 ("mm: compaction: Move migration fail/success
      stats to migrate.c"), removed 3 items in /proc/vmstat. but the docs
      still has their explanation. let's remove them.
      
      "compact_blocks_moved",
      "compact_pages_moved",
      "compact_pagemigrate_failed",
      
      Link: https://lkml.kernel.org/r/1605520282-51993-1-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      56db19fe
    • Waiman Long's avatar
      mm/vmalloc: Fix unlock order in s_stop() · 0a7dd4e9
      Waiman Long authored
      When multiple locks are acquired, they should be released in reverse
      order. For s_start() and s_stop() in mm/vmalloc.c, that is not the
      case.
      
        s_start: mutex_lock(&vmap_purge_lock); spin_lock(&vmap_area_lock);
        s_stop : mutex_unlock(&vmap_purge_lock); spin_unlock(&vmap_area_lock);
      
      This unlock sequence, though allowed, is not optimal. If a waiter is
      present, mutex_unlock() will need to go through the slowpath of waking
      up the waiter with preemption disabled. Fix that by releasing the
      spinlock first before the mutex.
      
      Link: https://lkml.kernel.org/r/20201213180843.16938-1-longman@redhat.com
      Fixes: e36176be ("mm/vmalloc: rework vmap_area_lock")
      Signed-off-by: default avatarWaiman Long <longman@redhat.com>
      Reviewed-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0a7dd4e9
    • Baolin Wang's avatar
    • Alex Shi's avatar
      mm/vmalloc: add 'align' parameter explanation for pvm_determine_end_from_reverse · 799fa85d
      Alex Shi authored
      Kernel-doc markup has a issue on pvm_determine_end_from_reverse:
      
        mm/vmalloc.c:3145: warning: Function parameter or member 'align' not described in 'pvm_determine_end_from_reverse'
      
      Add a explanation for it to remove the warning.
      
      Link: https://lkml.kernel.org/r/1605605088-30668-3-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      799fa85d
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: rework the drain logic · 96e2db45
      Uladzislau Rezki (Sony) authored
      A current "lazy drain" model suffers from at least two issues.
      
      First one is related to the unsorted list of vmap areas, thus in order to
      identify the [min:max] range of areas to be drained, it requires a full
      list scan.  What is a time consuming if the list is too long.
      
      Second one and as a next step is about merging all fragments with a free
      space.  What is also a time consuming because it has to iterate over
      entire list which holds outstanding lazy areas.
      
      See below the "preemptirqsoff" tracer that illustrates a high latency.  It
      is ~24676us.  Our workloads like audio and video are effected by such long
      latency:
      
      <snip>
        tracer: preemptirqsoff
      
        preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+
        --------------------------------------------------------------------
        latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8)
           -----------------
           | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16)
           -----------------
         => started at: __purge_vmap_area_lazy
         => ended at:   __purge_vmap_area_lazy
      
                         _------=> CPU#
                        / _-----=> irqs-off
                       | / _----=> need-resched
                       || / _---=> hardirq/softirq
                       ||| / _--=> preempt-depth
                       |||| /     delay
         cmd     pid   ||||| time  |   caller
            \   /      |||||  \    |   /
      crtc_com-261     1...1    1us*: _raw_spin_lock <-__purge_vmap_area_lazy
      [...]
      crtc_com-261     1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy
      crtc_com-261     1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy
      crtc_com-261     1...1 24683us : <stack trace>
       => free_vmap_area_noflush
       => remove_vm_area
       => __vunmap
       => vfree
       => drm_property_free_blob
       => drm_mode_object_unreference
       => drm_property_unreference_blob
       => __drm_atomic_helper_crtc_destroy_state
       => sde_crtc_destroy_state
       => drm_atomic_state_default_clear
       => drm_atomic_state_clear
       => drm_atomic_state_free
       => complete_commit
       => _msm_drm_commit_work_cb
       => kthread_worker_fn
       => kthread
       => ret_from_fork
      <snip>
      
      To address those two issues we can redesign a purging of the outstanding
      lazy areas.  Instead of queuing vmap areas to the list, we replace it by
      the separate rb-tree.  In hat case an area is located in the tree/list in
      ascending order.  It will give us below advantages:
      
      a) Outstanding vmap areas are merged creating bigger coalesced blocks,
         thus it becomes less fragmented.
      
      b) It is possible to calculate a flush range [min:max] without scanning
         all elements.  It is O(1) access time or complexity;
      
      c) The final merge of areas with the rb-tree that represents a free
         space is faster because of (a).  As a result the lock contention is
         also reduced.
      
      Link: https://lkml.kernel.org/r/20201116220033.1837-2-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      96e2db45
    • Uladzislau Rezki (Sony)'s avatar
      mm/vmalloc: use free_vm_area() if an allocation fails · 8945a723
      Uladzislau Rezki (Sony) authored
      There is a dedicated and separate function that finds and removes a
      continuous kernel virtual area.  As a final step it also releases the
      "area", a descriptor of corresponding vm_struct.
      
      Use free_vmap_area() in the __vmalloc_node_range() instead of open coded
      steps which are exactly the same, to perform a cleanup.
      
      Link: https://lkml.kernel.org/r/20201116220033.1837-1-urezki@gmail.comSigned-off-by: default avatarUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8945a723
    • Andrew Morton's avatar
      mm/vmalloc.c:__vmalloc_area_node(): avoid 32-bit overflow · 34fe6537
      Andrew Morton authored
      With a machine with 3 TB (more than 2 TB memory).  If you use vmalloc to
      allocate > 2 TB memory, the array_size below will be overflowed.
      
      The array_size is an unsigned int and can only be used to allocate less
      than 2 TB memory.  If you pass 2*1028*1028*1024*1024 = 2 * 2^40 in the
      argument of vmalloc.  The array_size will become 2*2^31 = 2^32.  The 2^32
      cannot be store with a 32 bit integer.
      
      The fix is to change the type of array_size to unsigned long.
      
      [akpm@linux-foundation.org: rework for current mainline]
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=210023
      Reported-by: <hsinhuiwu@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      34fe6537
    • Daniel Vetter's avatar
      locking/selftests: add testcases for fs_reclaim · d5037d1d
      Daniel Vetter authored
      Since I butchered this I figured better to make sure we have testcases for
      this now.  Since we only have a locking context for __GFP_FS that's the
      only thing we're testing right now.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-4-daniel.vetter@ffwll.chSigned-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d5037d1d
    • Daniel Vetter's avatar
      mm: extract might_alloc() debug check · 95d6c701
      Daniel Vetter authored
      Extracted from slab.h, which seems to have the most complete version
      including the correct might_sleep() check.  Roll it out to slob.c.
      
      Motivated by a discussion with Paul about possibly changing call_rcu
      behaviour to allocate memory, but only roughly every 500th call.
      
      There are a lot fewer places in the kernel that care about whether
      allocating memory is allowed or not (due to deadlocks with reclaim code)
      than places that care whether sleeping is allowed.  But debugging these
      also tends to be a lot harder, so nice descriptive checks could come in
      handy.  I might have some use eventually for annotations in drivers/gpu.
      
      Note that unlike fs_reclaim_acquire/release gfpflags_allow_blocking does
      not consult the PF_MEMALLOC flags.  But there is no flag equivalent for
      GFP_NOWAIT, hence this check can't go wrong due to
      memalloc_no*_save/restore contexts.  Willy is working on a patch series
      which might change this:
      
      https://lore.kernel.org/linux-mm/20200625113122.7540-7-willy@infradead.org/
      
      I think best would be if that updates gfpflags_allow_blocking(), since
      there's a ton of callers all over the place for that already.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-3-daniel.vetter@ffwll.chSigned-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Acked-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      95d6c701
    • Daniel Vetter's avatar
      mm: track mmu notifiers in fs_reclaim_acquire/release · f920e413
      Daniel Vetter authored
      fs_reclaim_acquire/release nicely catch recursion issues when allocating
      GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep
      the excessive caches in check).  For mmu notifier recursions we do have
      lockdep annotations since 23b68395 ("mm/mmu_notifiers: add a lockdep
      map for invalidate_range_start/end").
      
      But these only fire if a path actually results in some pte invalidation -
      for most small allocations that's very rarely the case.  The other trouble
      is that pte invalidation can happen any time when __GFP_RECLAIM is set.
      Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good
      enough to avoid potential mmu notifier recursion.
      
      I was pondering whether we should just do the general annotation, but
      there's always the risk for false positives.  Plus I'm assuming that the
      core fs and io code is a lot better reviewed and tested than random mmu
      notifier code in drivers.  Hence why I decide to only annotate for that
      specific case.
      
      Furthermore even if we'd create a lockdep map for direct reclaim, we'd
      still need to explicit pull in the mmu notifier map - there's a lot more
      places that do pte invalidation than just direct reclaim, these two
      contexts arent the same.
      
      Note that the mmu notifiers needing their own independent lockdep map is
      also the reason we can't hold them from fs_reclaim_acquire to
      fs_reclaim_release - it would nest with the acquistion in the pte
      invalidation code, causing a lockdep splat.  And we can't remove the
      annotations from pte invalidation and all the other places since they're
      called from many other places than page reclaim.  Hence we can only do the
      equivalent of might_lock, but on the raw lockdep map.
      
      With this we can also remove the lockdep priming added in 66204f1d
      ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly
      more powerful.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-2-daniel.vetter@ffwll.chSigned-off-by: default avatarDaniel Vetter <daniel.vetter@intel.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f920e413
    • Dmitry Safonov's avatar
      mm: forbid splitting special mappings · 871402e0
      Dmitry Safonov authored
      Don't allow splitting of vm_special_mapping's.  It affects vdso/vvar
      areas.  Uprobes have only one page in xol_area so they aren't affected.
      
      Those restrictions were enforced by checks in .mremap() callbacks.
      Restrict resizing with generic .split() callback.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-7-dima@arista.comSigned-off-by: default avatarDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      871402e0
    • Dmitry Safonov's avatar
      mremap: check if it's possible to split original vma · 73d5e062
      Dmitry Safonov authored
      If original VMA can't be split at the desired address, do_munmap() will
      fail and leave both new-copied VMA and old VMA.  De-facto it's
      MREMAP_DONTUNMAP behaviour, which is unexpected.
      
      Currently, it may fail such way for hugetlbfs and dax device mappings.
      
      Minimize such unpleasant situations to OOM by checking .may_split() before
      attempting to create a VMA copy.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-6-dima@arista.comSigned-off-by: default avatarDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      73d5e062
    • Dmitry Safonov's avatar
      vm_ops: rename .split() callback to .may_split() · dd3b614f
      Dmitry Safonov authored
      Rename the callback to reflect that it's not called *on* or *after* split,
      but rather some time before the splitting to check if it's possible.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-5-dima@arista.comSigned-off-by: default avatarDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dd3b614f
    • Dmitry Safonov's avatar
      mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio · cd544fd1
      Dmitry Safonov authored
      As kernel expect to see only one of such mappings, any further operations
      on the VMA-copy may be unexpected by the kernel.  Maybe it's being on the
      safe side, but there doesn't seem to be any expected use-case for this, so
      restrict it now.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
      Fixes: commit e346b381 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      cd544fd1
    • Dmitry Safonov's avatar
      mm/mremap: for MREMAP_DONTUNMAP check security_vm_enough_memory_mm() · ad8ee77e
      Dmitry Safonov authored
      Currently memory is accounted post-mremap() with MREMAP_DONTUNMAP, which
      may break overcommit policy.  So, check if there's enough memory before
      doing actual VMA copy.
      
      Don't unset VM_ACCOUNT on MREMAP_DONTUNMAP.  By semantics, such mremap()
      is actually a memory allocation.  That also simplifies the error-path a
      little.
      
      Also, as it's memory allocation on success don't reset hiwater_vm value.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-3-dima@arista.com
      Fixes: commit e346b381 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      ad8ee77e
    • Dmitry Safonov's avatar
      mm/mremap: account memory on do_munmap() failure · 51df7bcb
      Dmitry Safonov authored
      Patch series "mremap: move_vma() fixes".
      
      This patch (of 6):
      
      move_vma() copies VMA without adding it to account, then unmaps old part
      of VMA.  On failure it unmaps the new VMA.  With hacks accounting in
      munmap is disabled as it's a copy of existing VMA.
      
      Account the memory on munmap() failure which was previously copied into
      a new VMA.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-1-dima@arista.com
      Link: https://lkml.kernel.org/r/20201013013416.390574-2-dima@arista.com
      Fixes: commit e2ea8374 ("[PATCH] mremap: move_vma fixes and cleanup")
      Signed-off-by: default avatarDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      51df7bcb
    • Matthew Wilcox (Oracle)'s avatar
      mm: move free_unref_page to mm/internal.h · 0966aeb4
      Matthew Wilcox (Oracle) authored
      Code outside mm/ should not be calling free_unref_page().  Also move
      free_unref_page_list().
      
      Link: https://lkml.kernel.org/r/20201125034655.27687-2-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      0966aeb4
    • Matthew Wilcox (Oracle)'s avatar
      sparc: fix handling of page table constructor failure · 06517c9a
      Matthew Wilcox (Oracle) authored
      The page has just been allocated, so its refcount is 1.  free_unref_page()
      is for use on pages which have a zero refcount.  Use __free_page() like
      the other implementations of pte_alloc_one().
      
      Link: https://lkml.kernel.org/r/20201125034655.27687-1-willy@infradead.org
      Fixes: 1ae9ae5f ("sparc: handle pgtable_page_ctor() fail")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      06517c9a
    • Axel Rasmussen's avatar
      mm: mmap_lock: add tracepoints around lock acquisition · 2b5067a8
      Axel Rasmussen authored
      The goal of these tracepoints is to be able to debug lock contention
      issues.  This lock is acquired on most (all?) mmap / munmap / page fault
      operations, so a multi-threaded process which does a lot of these can
      experience significant contention.
      
      We trace just before we start acquisition, when the acquisition returns
      (whether it succeeded or not), and when the lock is released (or
      downgraded).  The events are broken out by lock type (read / write).
      
      The events are also broken out by memcg path.  For container-based
      workloads, users often think of several processes in a memcg as a single
      logical "task", so collecting statistics at this level is useful.
      
      The end goal is to get latency information.  This isn't directly included
      in the trace events.  Instead, users are expected to compute the time
      between "start locking" and "acquire returned", using e.g.  synthetic
      events or BPF.  The benefit we get from this is simpler code.
      
      Because we use tracepoint_enabled() to decide whether or not to trace,
      this patch has effectively no overhead unless tracepoints are enabled at
      runtime.  If tracepoints are enabled, there is a performance impact, but
      how much depends on exactly what e.g.  the BPF program does.
      
      [axelrasmussen@google.com: fix use-after-free race and css ref leak in tracepoints]
        Link: https://lkml.kernel.org/r/20201130233504.3725241-1-axelrasmussen@google.com
      [axelrasmussen@google.com: v3]
        Link: https://lkml.kernel.org/r/20201207213358.573750-1-axelrasmussen@google.com
      [rostedt@goodmis.org: in-depth examples of tracepoint_enabled() usage, and per-cpu-per-context buffer design]
      
      Link: https://lkml.kernel.org/r/20201105211739.568279-2-axelrasmussen@google.comSigned-off-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2b5067a8
    • Alex Shi's avatar
      mm/page_vma_mapped.c: add colon to fix kernel-doc markups error for check_pte · 777f303c
      Alex Shi authored
      check_pte() needs a correct colon for kernel-doc markup, otherwise, gcc
      has the following warning for W=1, mm/page_vma_mapped.c:86: warning:
      Function parameter or member 'pvmw' not described in 'check_pte'
      
      Link: https://lkml.kernel.org/r/1605597167-25145-1-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      777f303c
    • Alex Shi's avatar
      mm/mapping_dirty_helpers: enhance the kernel-doc markups · f5b7e739
      Alex Shi authored
      Add and change parameter explanation for wp_pte and clean_record_pte, to
      avoid W1 warning:
      
        mm/mapping_dirty_helpers.c:34: warning: Function parameter or member 'end' not described in 'wp_pte'
        mm/mapping_dirty_helpers.c:88: warning: Function parameter or member 'end' not described in 'clean_record_pte'
      
      Link: https://lkml.kernel.org/r/1605605088-30668-2-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: default avatarAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5b7e739
    • John Hubbard's avatar
      mm: cleanup: remove unused tsk arg from __access_remote_vm · d3f5ffca
      John Hubbard authored
      Despite a comment that said that page fault accounting would be charged to
      whatever task_struct* was passed into __access_remote_vm(), the tsk
      argument was actually unused.
      
      Making page fault accounting actually use this task struct is quite a
      project, so there is no point in keeping the tsk argument.
      
      Delete both the comment, and the argument.
      
      [rppt@linux.ibm.com: changelog addition]
      
      Link: https://lkml.kernel.org/r/20201026074137.4147787-1-jhubbard@nvidia.comSigned-off-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: default avatarMike Rapoport <rppt@linux.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      d3f5ffca
    • Kalesh Singh's avatar
      x86: mremap speedup - Enable HAVE_MOVE_PUD · be37c98d
      Kalesh Singh authored
      HAVE_MOVE_PUD enables remapping pages at the PUD level if both the
      source and destination addresses are PUD-aligned.
      
      With HAVE_MOVE_PUD enabled it can be inferred that there is
      approximately a 13x improvement in performance on x86.  (See data
      below).
      
      ------- Test Results ---------
      
      The following results were obtained using a 5.4 kernel, by remapping
      a PUD-aligned, 1GB sized region to a PUD-aligned destination.
      The results from 10 iterations of the test are given below:
      
      Total mremap times for 1GB data on x86. All times are in nanoseconds.
      
        Control        HAVE_MOVE_PUD
      
        180394         15089
        235728         14056
        238931         25741
        187330         13838
        241742         14187
        177925         14778
        182758         14728
        160872         14418
        205813         15107
        245722         13998
      
        205721.5       15594    <-- Mean time in nanoseconds
      
      A 1GB mremap completion time drops from ~205 microseconds
      to ~15 microseconds on x86. (~13x speed up).
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-6-kaleshsingh@google.comSigned-off-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: default avatarIngo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      be37c98d
    • Kalesh Singh's avatar
      arm64: mremap speedup - enable HAVE_MOVE_PUD · f5308c89
      Kalesh Singh authored
      HAVE_MOVE_PUD enables remapping pages at the PUD level if both the source
      and destination addresses are PUD-aligned.
      
      With HAVE_MOVE_PUD enabled it can be inferred that there is approximately
      a 19x improvement in performance on arm64.  (See data below).
      
      ------- Test Results ---------
      
      The following results were obtained using a 5.4 kernel, by remapping a
      PUD-aligned, 1GB sized region to a PUD-aligned destination.  The results
      from 10 iterations of the test are given below:
      
      Total mremap times for 1GB data on arm64. All times are in nanoseconds.
      
        Control          HAVE_MOVE_PUD
      
        1247761          74271
        1219896          46771
        1094792          59687
        1227760          48385
        1043698          76666
        1101771          50365
        1159896          52500
        1143594          75261
        1025833          61354
        1078125          48697
      
        1134312.6        59395.7    <-- Mean time in nanoseconds
      
      A 1GB mremap completion time drops from ~1.1 milliseconds to ~59
      microseconds on arm64.  (~19x speed up).
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-5-kaleshsingh@google.comSigned-off-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f5308c89
    • Kalesh Singh's avatar
      mm: speedup mremap on 1GB or larger regions · c49dd340
      Kalesh Singh authored
      Android needs to move large memory regions for garbage collection.  The GC
      requires moving physical pages of multi-gigabyte heap using mremap.
      During this move, the application threads have to be paused for
      correctness.  It is critical to keep this pause as short as possible to
      avoid jitters during user interaction.
      
      Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD level if
      the source and destination addresses are PUD-aligned.  For
      CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves PGD
      entries, since the PUD entry is “folded back” onto the PGD entry.  Add
      HAVE_MOVE_PUD so that architectures where moving at the PUD level isn't
      supported/tested can turn this off by not selecting the config.
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-4-kaleshsingh@google.comSigned-off-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Acked-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c49dd340
    • Kalesh Singh's avatar
      kselftests: vm: add mremap tests · 7df66625
      Kalesh Singh authored
      Patch series "Speed up mremap on large regions", v4.
      
      mremap time can be optimized by moving entries at the PMD/PUD level if the
      source and destination addresses are PMD/PUD-aligned and PMD/PUD-sized.
      Enable moving at the PMD and PUD levels on arm64 and x86.  Other
      architectures where this type of move is supported and known to be safe
      can also opt-in to these optimizations by enabling HAVE_MOVE_PMD and
      HAVE_MOVE_PUD.
      
      Observed Performance Improvements for remapping a PUD-aligned 1GB-sized
      region on x86 and arm64:
      
          - HAVE_MOVE_PMD is already enabled on x86 : N/A
          - Enabling HAVE_MOVE_PUD on x86   : ~13x speed up
      
          - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up
          - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up
      
                Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD
                give a total of ~150x speed up on arm64.
      
      This patch (of 4):
      
      Test mremap on regions of various sizes and alignments and validate data
      after remapping.  Also provide total time for remapping the region which
      is useful for performance comparison of the mremap optimizations that move
      pages at the PMD/PUD levels if HAVE_MOVE_PMD and/or HAVE_MOVE_PUD are
      enabled.
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-1-kaleshsingh@google.com
      Link: https://lkml.kernel.org/r/20201014005320.2233162-2-kaleshsingh@google.comSigned-off-by: default avatarKalesh Singh <kaleshsingh@google.com>
      Reviewed-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: SeongJae Park <sjpark@amazon.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      7df66625
    • Dan Williams's avatar
      xen/unpopulated-alloc: consolidate pgmap manipulation · 3a250629
      Dan Williams authored
      Cleanup fill_list() to keep all the pgmap manipulations in a single
      location of the function.  Update the exit unwind path accordingly.
      
      Link: http://lore.kernel.org/r/6186fa28-d123-12db-6171-a75cb6e615a5@oracle.com
      Link: https://lkml.kernel.org/r/160272253442.3136502.16683842453317773487.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reported-by: default avatarBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      3a250629