1. 13 Oct, 2022 14 commits
    • Alistair Popple's avatar
      hmm-tests: add test for migrate_device_range() · ad4c3652
      Alistair Popple authored
      Link: https://lkml.kernel.org/r/a73cf109de0224cfd118d22be58ddebac3ae2897.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ad4c3652
    • Alistair Popple's avatar
      nouveau/dmem: evict device private memory during release · 24988123
      Alistair Popple authored
      When the module is unloaded or a GPU is unbound from the module it is
      possible for device private pages to still be mapped in currently running
      processes.  This can lead to a hangs and RCU stall warnings when unbinding
      the device as memunmap_pages() will wait in an uninterruptible state until
      all device pages have been freed which may never happen.
      
      Fix this by migrating device mappings back to normal CPU memory prior to
      freeing the GPU memory chunks and associated device private pages.
      
      Link: https://lkml.kernel.org/r/66277601fb8fda9af408b33da9887192bf895bda.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24988123
    • Alistair Popple's avatar
      nouveau/dmem: refactor nouveau_dmem_fault_copy_one() · d9b71939
      Alistair Popple authored
      nouveau_dmem_fault_copy_one() is used during handling of CPU faults via
      the migrate_to_ram() callback and is used to copy data from GPU to CPU
      memory.  It is currently specific to fault handling, however a future
      patch implementing eviction of data during teardown needs similar
      functionality.
      
      Refactor out the core functionality so that it is not specific to fault
      handling.
      
      Link: https://lkml.kernel.org/r/20573d7b4e641a78fde9935f948e64e71c9e709e.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarLyude Paul <lyude@redhat.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d9b71939
    • Alistair Popple's avatar
      mm/migrate_device.c: add migrate_device_range() · e778406b
      Alistair Popple authored
      Device drivers can use the migrate_vma family of functions to migrate
      existing private anonymous mappings to device private pages.  These pages
      are backed by memory on the device with drivers being responsible for
      copying data to and from device memory.
      
      Device private pages are freed via the pgmap->page_free() callback when
      they are unmapped and their refcount drops to zero.  Alternatively they
      may be freed indirectly via migration back to CPU memory in response to a
      pgmap->migrate_to_ram() callback called whenever the CPU accesses an
      address mapped to a device private page.
      
      In other words drivers cannot control the lifetime of data allocated on
      the devices and must wait until these pages are freed from userspace. 
      This causes issues when memory needs to reclaimed on the device, either
      because the device is going away due to a ->release() callback or because
      another user needs to use the memory.
      
      Drivers could use the existing migrate_vma functions to migrate data off
      the device.  However this would require them to track the mappings of each
      page which is both complicated and not always possible.  Instead drivers
      need to be able to migrate device pages directly so they can free up
      device memory.
      
      To allow that this patch introduces the migrate_device family of functions
      which are functionally similar to migrate_vma but which skips the initial
      lookup based on mapping.
      
      Link: https://lkml.kernel.org/r/868116aab70b0c8ee467d62498bb2cf0ef907295.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e778406b
    • Alistair Popple's avatar
      mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page() · 241f6885
      Alistair Popple authored
      migrate_device_coherent_page() reuses the existing migrate_vma family of
      functions to migrate a specific page without providing a valid mapping or
      vma.  This looks a bit odd because it means we are calling migrate_vma_*()
      without setting a valid vma, however it was considered acceptable at the
      time because the details were internal to migrate_device.c and there was
      only a single user.
      
      One of the reasons the details could be kept internal was that this was
      strictly for migrating device coherent memory.  Such memory can be copied
      directly by the CPU without intervention from a driver.  However this
      isn't true for device private memory, and a future change requires similar
      functionality for device private memory.  So refactor the code into
      something more sensible for migrating device memory without a vma.
      
      Link: https://lkml.kernel.org/r/c7b2ff84e9b33d022cf4a40f87d051f281a16d8f.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      241f6885
    • Alistair Popple's avatar
      mm/memremap.c: take a pgmap reference on page allocation · 0dc45ca1
      Alistair Popple authored
      ZONE_DEVICE pages have a struct dev_pagemap which is allocated by a
      driver.  When the struct page is first allocated by the kernel in
      memremap_pages() a reference is taken on the associated pagemap to ensure
      it is not freed prior to the pages being freed.
      
      Prior to 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page
      refcount") pages were considered free and returned to the driver when the
      reference count dropped to one.  However the pagemap reference was not
      dropped until the page reference count hit zero.  This would occur as part
      of the final put_page() in memunmap_pages() which would wait for all pages
      to be freed prior to returning.
      
      When the extra refcount was removed the pagemap reference was no longer
      being dropped in put_page().  Instead memunmap_pages() was changed to
      explicitly drop the pagemap references.  This means that memunmap_pages()
      can complete even though pages are still mapped by the kernel which can
      lead to kernel crashes, particularly if a driver frees the pagemap.
      
      To fix this drivers should take a pagemap reference when allocating the
      page.  This reference can then be returned when the page is freed.
      
      Link: https://lkml.kernel.org/r/12d155ec727935ebfbb4d639a03ab374917ea51b.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Fixes: 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page refcount")
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0dc45ca1
    • Alistair Popple's avatar
      mm: free device private pages have zero refcount · ef233450
      Alistair Popple authored
      Since 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page
      refcount") device private pages have no longer had an extra reference
      count when the page is in use.  However before handing them back to the
      owning device driver we add an extra reference count such that free pages
      have a reference count of one.
      
      This makes it difficult to tell if a page is free or not because both free
      and in use pages will have a non-zero refcount.  Instead we should return
      pages to the drivers page allocator with a zero reference count.  Kernel
      code can then safely use kernel functions such as get_page_unless_zero().
      
      Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef233450
    • Alistair Popple's avatar
      mm/memory.c: fix race when faulting a device private page · 16ce101d
      Alistair Popple authored
      Patch series "Fix several device private page reference counting issues",
      v2
      
      This series aims to fix a number of page reference counting issues in
      drivers dealing with device private ZONE_DEVICE pages.  These result in
      use-after-free type bugs, either from accessing a struct page which no
      longer exists because it has been removed or accessing fields within the
      struct page which are no longer valid because the page has been freed.
      
      During normal usage it is unlikely these will cause any problems.  However
      without these fixes it is possible to crash the kernel from userspace. 
      These crashes can be triggered either by unloading the kernel module or
      unbinding the device from the driver prior to a userspace task exiting. 
      In modules such as Nouveau it is also possible to trigger some of these
      issues by explicitly closing the device file-descriptor prior to the task
      exiting and then accessing device private memory.
      
      This involves some minor changes to both PowerPC and AMD GPU code. 
      Unfortunately I lack hardware to test either of those so any help there
      would be appreciated.  The changes mimic what is done in for both Nouveau
      and hmm-tests though so I doubt they will cause problems.
      
      
      This patch (of 8):
      
      When the CPU tries to access a device private page the migrate_to_ram()
      callback associated with the pgmap for the page is called.  However no
      reference is taken on the faulting page.  Therefore a concurrent migration
      of the device private page can free the page and possibly the underlying
      pgmap.  This results in a race which can crash the kernel due to the
      migrate_to_ram() function pointer becoming invalid.  It also means drivers
      can't reliably read the zone_device_data field because the page may have
      been freed with memunmap_pages().
      
      Close the race by getting a reference on the page while holding the ptl to
      ensure it has not been freed.  Unfortunately the elevated reference count
      will cause the migration required to handle the fault to fail.  To avoid
      this failure pass the faulting page into the migrate_vma functions so that
      if an elevated reference count is found it can be checked to see if it's
      expected or not.
      
      [mpe@ellerman.id.au: fix build]
        Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
      Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
      Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16ce101d
    • Xin Hao's avatar
      mm/damon: use damon_sz_region() in appropriate place · ab63f63f
      Xin Hao authored
      In many places we can use damon_sz_region() to instead of "r->ar.end -
      r->ar.start".
      
      Link: https://lkml.kernel.org/r/20220927001946.85375-2-xhao@linux.alibaba.comSigned-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Suggested-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab63f63f
    • Xin Hao's avatar
      mm/damon: move sz_damon_region to damon_sz_region · 652e0446
      Xin Hao authored
      Rename sz_damon_region() to damon_sz_region(), and move it to
      "include/linux/damon.h", because in many places, we can to use this func.
      
      Link: https://lkml.kernel.org/r/20220927001946.85375-1-xhao@linux.alibaba.comSigned-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Suggested-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      652e0446
    • Xiaoke Wang's avatar
      lib/test_meminit: add checks for the allocation functions · ea091fa5
      Xiaoke Wang authored
      alloc_pages(), kmalloc() and vmalloc() are all memory allocation functions
      which can return NULL when some internal memory failures happen.  So it is
      better to check the return of them to catch the failure in time for better
      test them.
      
      Link: https://lkml.kernel.org/r/tencent_D44A49FFB420EDCCBFB9221C8D14DFE12908@qq.comSigned-off-by: default avatarXiaoke Wang <xkernel.wang@foxmail.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea091fa5
    • Alexander Potapenko's avatar
      kmsan: unpoison @tlb in arch_tlb_gather_mmu() · ac801e7e
      Alexander Potapenko authored
      This is an optimization to reduce stackdepot pressure.
      
      struct mmu_gather contains 7 1-bit fields packed into a 32-bit unsigned
      int value.  The remaining 25 bits remain uninitialized and are never used,
      but KMSAN updates the origin for them in zap_pXX_range() in mm/memory.c,
      thus creating very long origin chains.  This is technically correct, but
      consumes too much memory.
      
      Unpoisoning the whole structure will prevent creating such chains.
      
      Link: https://lkml.kernel.org/r/20220905122452.2258262-20-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Liu Shixin <liushixin2@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac801e7e
    • Matthew Wilcox (Oracle)'s avatar
      ext4,f2fs: fix readahead of verity data · 4fa0e3ff
      Matthew Wilcox (Oracle) authored
      The recent change of page_cache_ra_unbounded() arguments was buggy in the
      two callers, causing us to readahead the wrong pages.  Move the definition
      of ractl down to after the index is set correctly.  This affected
      performance on configurations that use fs-verity.
      
      Link: https://lkml.kernel.org/r/20221012193419.1453558-1-willy@infradead.org
      Fixes: 73bb49da ("mm/readahead: make page_cache_ra_unbounded take a readahead_control")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatarJintao Yin <nicememory@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4fa0e3ff
    • Carlos Llamas's avatar
      mm/mmap: undo ->mmap() when arch_validate_flags() fails · deb0f656
      Carlos Llamas authored
      Commit c462ac28 ("mm: Introduce arch_validate_flags()") added a late
      check in mmap_region() to let architectures validate vm_flags.  The check
      needs to happen after calling ->mmap() as the flags can potentially be
      modified during this callback.
      
      If arch_validate_flags() check fails we unmap and free the vma.  However,
      the error path fails to undo the ->mmap() call that previously succeeded
      and depending on the specific ->mmap() implementation this translates to
      reference increments, memory allocations and other operations what will
      not be cleaned up.
      
      There are several places (mainly device drivers) where this is an issue.
      However, one specific example is bpf_map_mmap() which keeps count of the
      mappings in map->writecnt.  The count is incremented on ->mmap() and then
      decremented on vm_ops->close().  When arch_validate_flags() fails this
      count is off since bpf_map_mmap_close() is never called.
      
      One can reproduce this issue in arm64 devices with MTE support.  Here the
      vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set
      previously.  From userspace then is enough to pass the PROT_MTE flag to
      mmap() syscall to trigger the arch_validate_flags() failure.
      
      The following program reproduces this issue:
      
        #include <stdio.h>
        #include <unistd.h>
        #include <linux/unistd.h>
        #include <linux/bpf.h>
        #include <sys/mman.h>
      
        int main(void)
        {
      	union bpf_attr attr = {
      		.map_type = BPF_MAP_TYPE_ARRAY,
      		.key_size = sizeof(int),
      		.value_size = sizeof(long long),
      		.max_entries = 256,
      		.map_flags = BPF_F_MMAPABLE,
      	};
      	int fd;
      
      	fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
      	mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0);
      
      	return 0;
        }
      
      By manually adding some log statements to the vm_ops callbacks we can
      confirm that when passing PROT_MTE to mmap() the map->writecnt is off upon
      ->release():
      
      With PROT_MTE flag:
        root@debian:~# ./bpf-test
        [  111.263874] bpf_map_write_active_inc: map=9 writecnt=1
        [  111.288763] bpf_map_release: map=9 writecnt=1
      
      Without PROT_MTE flag:
        root@debian:~# ./bpf-test
        [  157.816912] bpf_map_write_active_inc: map=10 writecnt=1
        [  157.830442] bpf_map_write_active_dec: map=10 writecnt=0
        [  157.832396] bpf_map_release: map=10 writecnt=0
      
      This patch fixes the above issue by calling vm_ops->close() when the
      arch_validate_flags() check fails, after this we can proceed to unmap and
      free the vma on the error path.
      
      Link: https://lkml.kernel.org/r/20220930003844.1210987-1-cmllamas@google.com
      Fixes: c462ac28 ("mm: Introduce arch_validate_flags()")
      Signed-off-by: default avatarCarlos Llamas <cmllamas@google.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      deb0f656
  2. 12 Oct, 2022 6 commits
  3. 07 Oct, 2022 5 commits
    • Mike Kravetz's avatar
      hugetlb: allocate vma lock for all sharable vmas · bbff39cc
      Mike Kravetz authored
      The hugetlb vma lock was originally designed to synchronize pmd sharing. 
      As such, it was only necessary to allocate the lock for vmas that were
      capable of pmd sharing.  Later in the development cycle, it was discovered
      that it could also be used to simplify fault/truncation races as described
      in [1].  However, a subsequent change to allocate the lock for all vmas
      that use the page cache was never made.  A fault/truncation race could
      leave pages in a file past i_size until the file is removed.
      
      Remove the previous restriction and allocate lock for all VM_MAYSHARE
      vmas.  Warn in the unlikely event of allocation failure.
      
      [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t
      
      Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com
      Fixes: "hugetlb: clean up code checking for fault/truncation races"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbff39cc
    • Mike Kravetz's avatar
      hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer · ecfbd733
      Mike Kravetz authored
      hugetlb file truncation/hole punch code may need to back out and take
      locks in order in the routine hugetlb_unmap_file_folio().  This code could
      race with vma freeing as pointed out in [1] and result in accessing a
      stale vma pointer.  To address this, take the vma_lock when clearing the
      vma_lock->vma pointer.
      
      [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
      
      [mike.kravetz@oracle.com: address build issues]
        Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey
      Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com
      Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ecfbd733
    • Mike Kravetz's avatar
      hugetlb: fix vma lock handling during split vma and range unmapping · 131a79b4
      Mike Kravetz authored
      Patch series "hugetlb: fixes for new vma lock series".
      
      In review of the series "hugetlb: Use new vma lock for huge pmd sharing
      synchronization", Miaohe Lin pointed out two key issues:
      
      1) There is a race in the routine hugetlb_unmap_file_folio when locks
         are dropped and reacquired in the correct order [1].
      
      2) With the switch to using vma lock for fault/truncate synchronization,
         we need to make sure lock exists for all VM_MAYSHARE vmas, not just
         vmas capable of pmd sharing.
      
      These two issues are addressed here.  In addition, having a vma lock
      present in all VM_MAYSHARE vmas, uncovered some issues around vma
      splitting.  Those are also addressed.
      
      [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
      
      
      This patch (of 3):
      
      The hugetlb vma lock hangs off the vm_private_data field and is specific
      to the vma.  When vm_area_dup() is called as part of vma splitting, the
      vma lock pointer is copied to the new vma.  This will result in issues
      such as double freeing of the structure.  Update the hugetlb open vm_ops
      to allocate a new vma lock for the new vma.
      
      The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
      to prevent subsequent pmd sharing.  hugetlb_vma_lock_free attempted to
      anticipate this by checking both VM_MAYSHARE and VM_SHARED.  However, if
      only VM_MAYSHARE was set we would miss the free.  With the introduction of
      the vma lock, a vma can not participate in pmd sharing if vm_private_data
      is NULL.  Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
      free the vma lock to prevent sharing.  Also, update the sharing code to
      make sure vma lock is indeed a condition for pmd sharing. 
      hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.
      
      Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
      Fixes: "hugetlb: add vma based lock for pmd sharing"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      131a79b4
    • Yu Zhao's avatar
    • Yu Zhao's avatar
      mm/mglru: don't sync disk for each aging cycle · 14aa8b2d
      Yu Zhao authored
      wakeup_flusher_threads() was added under the assumption that if a system
      runs out of clean cold pages, it might want to write back dirty pages more
      aggressively so that they can become clean and be dropped.
      
      However, doing so can breach the rate limit a system wants to impose on
      writeback, resulting in early SSD wearout.
      
      Link: https://lkml.kernel.org/r/YzSiWq9UEER5LKup@google.com
      Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14aa8b2d
  4. 03 Oct, 2022 15 commits
    • Johannes Weiner's avatar
      mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol · e55b9f96
      Johannes Weiner authored
      Since 2d1c4980 ("mm: memcontrol: make swap tracking an integral part
      of memory control"), CONFIG_MEMCG_SWAP hasn't been a user-visible config
      option anymore, it just means CONFIG_MEMCG && CONFIG_SWAP.
      
      Update the sites accordingly and drop the symbol.
      
      [ While touching the docs, remove two references to CONFIG_MEMCG_KMEM,
        which hasn't been a user-visible symbol for over half a decade. ]
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-5-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e55b9f96
    • Johannes Weiner's avatar
      mm: memcontrol: use do_memsw_account() in a few more places · b94c4e94
      Johannes Weiner authored
      It's slightly more descriptive and consistent with other places that
      distinguish cgroup1's combined memory+swap accounting scheme from
      cgroup2's dedicated swap accounting.
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-4-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b94c4e94
    • Johannes Weiner's avatar
      mm: memcontrol: deprecate swapaccounting=0 mode · b25806dc
      Johannes Weiner authored
      The swapaccounting= commandline option already does very little today.  To
      close a trivial containment failure case, the swap ownership tracking part
      of the swap controller has recently become mandatory (see commit
      2d1c4980 ("mm: memcontrol: make swap tracking an integral part of
      memory control") for details), which makes up the majority of the work
      during swapout, swapin, and the swap slot map.
      
      The only thing left under this flag is the page_counter operations and the
      visibility of the swap control files in the first place, which are rather
      meager savings.  There also aren't many scenarios, if any, where
      controlling the memory of a cgroup while allowing it unlimited access to a
      global swap space is a workable resource isolation strategy.
      
      On the other hand, there have been several bugs and confusion around the
      many possible swap controller states (cgroup1 vs cgroup2 behavior, memory
      accounting without swap accounting, memcg runtime disabled).
      
      This puts the maintenance overhead of retaining the toggle above its
      practical benefits.  Deprecate it.
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-3-hannes@cmpxchg.orgSigned-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: default avatarShakeel Butt <shakeelb@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b25806dc
    • Johannes Weiner's avatar
      mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled · c91bdc93
      Johannes Weiner authored
      Patch series "memcg swap fix & cleanups".
      
      
      This patch (of 4):
      
      Since commit 2d1c4980 ("mm: memcontrol: make swap tracking an integral
      part of memory control"), the cgroup swap arrays are used to track memory
      ownership at the time of swap readahead and swapoff, even if swap space
      *accounting* has been turned off by the user via swapaccount=0 (which sets
      cgroup_memory_noswap).
      
      However, the patch was overzealous: by simply dropping the
      cgroup_memory_noswap conditionals in the swapon, swapoff and uncharge
      path, it caused the cgroup arrays being allocated even when the memory
      controller as a whole is disabled.  This is a waste of that memory.
      
      Restore mem_cgroup_disabled() checks, implied previously by
      cgroup_memory_noswap, in the swapon, swapoff, and swap_entry_free
      callbacks.
      
      Link: https://lkml.kernel.org/r/20220926135704.400818-1-hannes@cmpxchg.org
      Link: https://lkml.kernel.org/r/20220926135704.400818-2-hannes@cmpxchg.org
      Fixes: 2d1c4980 ("mm: memcontrol: make swap tracking an integral part of memory control")
      Signed-off-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: default avatarHugh Dickins <hughd@google.com>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Acked-by: default avatarHugh Dickins <hughd@google.com>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c91bdc93
    • Xiu Jianfeng's avatar
      mm/secretmem: remove reduntant return value · f7c5b1aa
      Xiu Jianfeng authored
      The return value @ret is always 0, so remove it and return 0 directly.
      
      Link: https://lkml.kernel.org/r/20220920012205.246217-1-xiujianfeng@huawei.comSigned-off-by: default avatarXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f7c5b1aa
    • Xin Hao's avatar
      mm/hugetlb: add available_huge_pages() func · 8346d69d
      Xin Hao authored
      In hugetlb.c there are several places which compare the values of
      'h->free_huge_pages' and 'h->resv_huge_pages', it looks a bit messy, so
      add a new available_huge_pages() function to do these.
      
      Link: https://lkml.kernel.org/r/20220922021929.98961-1-xhao@linux.alibaba.comSigned-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarOscar Salvador <osalvador@suse.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8346d69d
    • Gaosheng Cui's avatar
      mm: remove unused inline functions from include/linux/mm_inline.h · 6b91e5df
      Gaosheng Cui authored
      Remove the following unused inline functions from mm_inline.h:
      
      1.  All uses of add_page_to_lru_list_tail() have been removed since
         commit 7a3dbfe8 ("mm/swap: convert lru_deactivate_file to a
         folio_batch"), and it can be replaced by lruvec_add_folio_tail().
      
      2.  All uses of __clear_page_lru_flags() have been removed since commit
         188e8cae ("mm/swap: convert __page_cache_release() to use a
         folio"), and it can be replaced by __folio_clear_lru_flags().
      
      They are useless, so remove them.
      
      Link: https://lkml.kernel.org/r/20220922110935.1495099-1-cuigaosheng1@huawei.comSigned-off-by: default avatarGaosheng Cui <cuigaosheng1@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6b91e5df
    • Zach O'Keefe's avatar
      selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory · 0f633baa
      Zach O'Keefe authored
      Add :collapse mod to userfaultfd selftest.  Currently this mod is only
      valid for "shmem" test type, but could be used for other test types.
      
      When provided, memory allocated by ->allocate_area() will be
      hugepage-aligned enforced to be hugepage-sized.  userfaultf_minor_test,
      after the UFFD-registered mapping has been populated by UUFD minor fault
      handler, attempt to MADV_COLLAPSE the UFFD-registered mapping to collapse
      the memory into a pmd-mapped THP.
      
      This test is meant to be a functional test of what occurs during
      UFFD-driven live migration of VMs backed by huge tmpfs where, after a
      hugepage-sized region has been successfully migrated (in native page-sized
      chunks, to avoid latency of fetched a hugepage over the network), we want
      to reclaim previous VM performance by remapping it at the PMD level.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-11-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-11-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0f633baa
    • Zach O'Keefe's avatar
      selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd · 69d9428c
      Zach O'Keefe authored
      This test tests that MADV_COLLAPSE acting on file/shmem memory for which
      (1) the file extent mapping by the memory is already a huge page in the
      page cache, and (2) the pmd mapping this memory in the target process is
      none.
      
      In practice, (1)+(2) is the state left over after khugepaged has
      successfully collapsed file/shmem memory for a target VMA, but the memory
      has not yet been refaulted.  So, this test in-effect tests MADV_COLLAPSE
      racing with khugepaged to collapse the memory first.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-10-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-10-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      69d9428c
    • Zach O'Keefe's avatar
      selftests/vm: add thp collapse shmem testing · d0d35b60
      Zach O'Keefe authored
      Add memory operations for shmem (memfd) memory, and reuse existing tests
      with the new memory operations.
      
      Shmem tests can be called with "shmem" mem_type, and shmem tests are ran
      with "all" mem_type as well.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-9-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-9-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d0d35b60
    • Zach O'Keefe's avatar
      selftests/vm: add thp collapse file and tmpfs testing · 1b03d0d5
      Zach O'Keefe authored
      Add memory operations for file-backed and tmpfs memory.  Call existing
      tests with these new memory operations to test collapse functionality of
      khugepaged and MADV_COLLAPSE on file-backed and tmpfs memory.  Not all
      tests are reusable; for example, collapse_swapin_single_pte() which checks
      swap usage.
      
      Refactor test arguments.  Usage is now:
      
      Usage: ./khugepaged <test type> [dir]
      
              <test type>     : <context>:<mem_type>
              <context>       : [all|khugepaged|madvise]
              <mem_type>      : [all|anon|file]
      
              "file,all" mem_type requires [dir] argument
      
              "file,all" mem_type requires kernel built with
              CONFIG_READ_ONLY_THP_FOR_FS=y
      
              if [dir] is a (sub)directory of a tmpfs mount, tmpfs must be
              mounted with huge=madvise option for khugepaged tests to work
      
      Refactor calling tests to make it clear what collapse context / memory
      operations they support, but only invoke tests requested by user.  Also
      log what test is being ran, and with what context / memory, to make test
      logs more human readable.
      
      A new test file is created and deleted for every test to ensure no pages
      remain in the page cache between tests (tests also may attempt to collapse
      different amount of memory).
      
      For file-backed memory where the file is stored on a block device, disable
      /sys/block/<device>/queue/read_ahead_kb so that pages don't find their way
      into the page cache without the tests faulting them in.
      
      Add file and shmem wrappers to vm_utils check for file and shmem hugepages
      in smaps.
      
      [zokeefe@google.com: fix "add thp collapse file and tmpfs testing" for
        tmpfs]
        Link: https://lkml.kernel.org/r/20220913212517.3163701-1-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220907144521.3115321-8-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-8-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1b03d0d5
    • Zach O'Keefe's avatar
      selftests/vm: modularize thp collapse memory operations · 8e638707
      Zach O'Keefe authored
      Modularize operations to setup, cleanup, fault, and check for huge pages,
      for a given memory type.  This allows reusing existing tests with
      additional memory types by defining new memory operations.  Following
      patches will add file and shmem memory types.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-7-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-7-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8e638707
    • Zach O'Keefe's avatar
      selftests/vm: dedup THP helpers · c07c343c
      Zach O'Keefe authored
      These files:
      
      tools/testing/selftests/vm/vm_util.c
      tools/testing/selftests/vm/khugepaged.c
      
      Both contain logic to:
      
      1) Determine hugepage size on current system
      2) Read /proc/self/smaps to determine number of THPs at an address
      
      Refactor selftests/vm/khugepaged.c to use the vm_util common helpers and
      add it as a build dependency.
      
      Since selftests/vm/khugepaged.c is the largest user of check_huge(),
      change the signature of check_huge() to match selftests/vm/khugepaged.c's
      useage: take an expected number of hugepages, and return a bool indicating
      if the correct number of hugepages were found.  Add a wrapper,
      check_huge_anon(), in anticipation of checking smaps for file and shmem
      hugepages.
      
      Update existing callsites to use the new pattern / function.
      
      Likewise, check_for_pattern() was duplicated, and it's a general enough
      helper to include in vm_util helpers as well.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-6-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-6-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarZi Yan <ziy@nvidia.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c07c343c
    • Zach O'Keefe's avatar
      mm/khugepaged: add tracepoint to hpage_collapse_scan_file() · d41fd201
      Zach O'Keefe authored
      Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to
      hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd().
      
      While this change is targeted at debugging MADV_COLLAPSE pathway, the
      "mm_khugepaged" prefix is retained for symmetry with
      huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name
      to prevent changing kernel ABI as much as possible.
      
      Link: https://lkml.kernel.org/r/20220907144521.3115321-5-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-5-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Reviewed-by: default avatarYang Shi <shy828301@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d41fd201
    • Zach O'Keefe's avatar
      mm/madvise: add file and shmem support to MADV_COLLAPSE · 34488399
      Zach O'Keefe authored
      Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
      memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
      
      On success, the backing memory will be a hugepage.  For the memory range
      and process provided, the page tables will synchronously have a huge pmd
      installed, mapping the THP.  Other mappings of the file extent mapped by
      the memory range may be added to a set of entries that khugepaged will
      later process and attempt update their page tables to map the THP by a
      pmd.
      
      This functionality unlocks two important uses:
      
      (1)	Immediately back executable text by THPs.  Current support provided
      	by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
      	system which might impair services from serving at their full rated
      	load after (re)starting.  Tricks like mremap(2)'ing text onto
      	anonymous memory to immediately realize iTLB performance prevents
      	page sharing and demand paging, both of which increase steady state
      	memory footprint.  Now, we can have the best of both worlds: Peak
      	upfront performance and lower RAM footprints.
      
      (2)	userfaultfd-based live migration of virtual machines satisfy UFFD
      	faults by fetching native-sized pages over the network (to avoid
      	latency of transferring an entire hugepage).  However, after guest
      	memory has been fully copied to the new host, MADV_COLLAPSE can
      	be used to immediately increase guest performance.
      
      Since khugepaged is single threaded, this change now introduces
      possibility of collapse contexts racing in file collapse path.  There a
      important few places to consider:
      
      (1)	hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
      	We could have the memory collapsed out from under us, but
      	the next xas_for_each() iteration will correctly pick up the
      	hugepage.  The hugepage might not be up to date (insofar as
      	copying of small page contents might not have completed - the
      	page still may be locked), but regardless what small page index
      	we were iterating over, we'll find the hugepage and identify it
      	as a suitably aligned compound page of order HPAGE_PMD_ORDER.
      
      	In khugepaged path, we locklessly check the value of the pmd,
      	and only add it to deferred collapse array if we find pmd
      	mapping pte table. This is fine, since other values that could
      	have raced in right afterwards denote failure, or that the
      	memory was successfully collapsed, so we don't need further
      	processing.
      
      	In madvise path, we'll take mmap_lock() in write to serialize
      	against page table updates and will know what to do based on the
      	true value of the pmd: recheck all ptes if we point to a pte table,
      	directly install the pmd, if the pmd has been cleared, but
      	memory not yet faulted, or nothing at all if we find a huge pmd.
      
      	It's worth putting emphasis here on how we treat the none pmd
      	here.  If khugepaged has processed this mm's page tables
      	already, it will have left the pmd cleared (ready for refault by
      	the process).  Depending on the VMA flags and sysfs settings,
      	amount of RAM on the machine, and the current load, could be a
      	relatively common occurrence - and as such is one we'd like to
      	handle successfully in MADV_COLLAPSE.  When we see the none pmd
      	in collapse_pte_mapped_thp(), we've locked mmap_lock in write
      	and checked (a) huepaged_vma_check() to see if the backing
      	memory is appropriate still, along with VMA sizing and
      	appropriate hugepage alignment within the file, and (b) we've
      	found a hugepage head of order HPAGE_PMD_ORDER at the offset
      	in the file mapped by our hugepage-aligned virtual address.
      	Even though the common-case is likely race with khugepaged,
      	given these checks (regardless how we got here - we could be
      	operating on a completely different file than originally checked
      	in hpage_collapse_scan_file() for all we know) it should be safe
      	to directly make the pmd a huge pmd pointing to this hugepage.
      
      (2)	collapse_file() is mostly serialized on the same file extent by
      	lock sequence:
      
      		|	lock hupepage
      		|		lock mapping->i_pages
      		|			lock 1st page
      		|		unlock mapping->i_pages
      		|				<page checks>
      		|		lock mapping->i_pages
      		|				page_ref_freeze(3)
      		|				xas_store(hugepage)
      		|		unlock mapping->i_pages
      		|				page_ref_unfreeze(1)
      		|			unlock 1st page
      		V	unlock hugepage
      
      	Once a context (who already has their fresh hugepage locked)
      	locks mapping->i_pages exclusively, it will hold said lock
      	until it locks the first page, and it will hold that lock until
      	the after the hugepage has been added to the page cache (and
      	will unlock the hugepage after page table update, though that
      	isn't important here).
      
      	A racing context that loses the race for mapping->i_pages will
      	then lose the race to locking the first page.  Here - depending
      	on how far the other racing context has gotten - we might find
      	the new hugepage (in which case we'll exit cleanly when we
      	check PageTransCompound()), or we'll find the "old" 1st small
      	page (in which we'll exit cleanly when we discover unexpected
      	refcount of 2 after isolate_lru_page()).  This is assuming we
      	are able to successfully lock the page we find - in shmem path,
      	we could just fail the trylock and exit cleanly anyways.
      
      	Failure path in collapse_file() is similar: once we hold lock
      	on 1st small page, we are serialized against other collapse
      	contexts.  Before the 1st small page is unlocked, we add it
      	back to the pagecache and unfreeze the refcount appropriately.
      	Contexts who lost the race to the 1st small page will then find
      	the same 1st small page with the correct refcount and will be
      	able to proceed.
      
      [zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()]
        Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com
      [shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove
      	check for multi-add in khugepaged_add_pte_mapped_thp()]
        Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/
      Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      34488399