1. 13 Oct, 2022 22 commits
    • Yafang Shao's avatar
      mm/page_alloc: fix incorrect PGFREE and PGALLOC for high-order page · 15cd9004
      Yafang Shao authored
      PGFREE and PGALLOC represent the number of freed and allocated pages.  So
      the page order must be considered.
      
      Link: https://lkml.kernel.org/r/20221006101540.40686-1-laoar.shao@gmail.com
      Fixes: 44042b44 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
      Signed-off-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      15cd9004
    • Peter Xu's avatar
      mm/selftest: uffd: explain the write missing fault check · 26c92d37
      Peter Xu authored
      It's not obvious why we had a write check for each of the missing
      messages, especially when it should be a locking op.  Add a rich comment
      for that, and also try to explain its good side and limitations, so that
      if someone hit it again for either a bug or a different glibc impl
      there'll be some clue to start with.
      
      Link: https://lkml.kernel.org/r/20221004193400.110155-4-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      26c92d37
    • Peter Xu's avatar
      mm/hugetlb: use hugetlb_pte_stable in migration race check · f9bf6c03
      Peter Xu authored
      After hugetlb_pte_stable() introduced, we can also rewrite the migration
      race condition against page allocation to use the new helper too.
      
      Link: https://lkml.kernel.org/r/20221004193400.110155-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f9bf6c03
    • Peter Xu's avatar
      mm/hugetlb: fix race condition of uffd missing/minor handling · 2ea7ff1e
      Peter Xu authored
      Patch series "mm/hugetlb: Fix selftest failures with write check", v3.
      
      Currently akpm mm-unstable fails with uffd hugetlb private mapping test
      randomly on a write check.
      
      The initial bisection of that points to the recent pmd unshare series, but
      it turns out there's no direction relationship with the series but only
      some timing change caused the race to start trigger.
      
      The race should be fixed in patch 1.  Patch 2 is a trivial cleanup on the
      similar race with hugetlb migrations, patch 3 comment on the write check
      so when anyone read it again it'll be clear why it's there.
      
      
      This patch (of 3):
      
      After the recent rework patchset of hugetlb locking on pmd sharing,
      kselftest for userfaultfd sometimes fails on hugetlb private tests with
      unexpected write fault checks.
      
      It turns out there's nothing wrong within the locking series regarding
      this matter, but it could have changed the timing of threads so it can
      trigger an old bug.
      
      The real bug is when we call hugetlb_no_page() we're not with the pgtable
      lock.  It means we're reading the pte values lockless.  It's perfectly
      fine in most cases because before we do normal page allocations we'll take
      the lock and check pte_same() again.  However before that, there are
      actually two paths on userfaultfd missing/minor handling that may directly
      move on with the fault process without checking the pte values.
      
      It means for these two paths we may be generating an uffd message based on
      an unstable pte, while an unstable pte can legally be anything as long as
      the modifier holds the pgtable lock.
      
      One example, which is also what happened in the failing kselftest and
      caused the test failure, is that for private mappings wr-protection
      changes can happen on one page.  While hugetlb_change_protection()
      generally requires pte being cleared before being changed, then there can
      be a race condition like:
      
              thread 1                              thread 2
              --------                              --------
      
            UFFDIO_WRITEPROTECT                     hugetlb_fault
              hugetlb_change_protection
                pgtable_lock()
                huge_ptep_modify_prot_start
                                                    pte==NULL
                                                    hugetlb_no_page
                                                      generate uffd missing event
                                                      even if page existed!!
                huge_ptep_modify_prot_commit
                pgtable_unlock()
      
      Fix this by rechecking the pte after pgtable lock for both userfaultfd
      missing & minor fault paths.
      
      This bug should have been around starting from uffd hugetlb introduced, so
      attaching a Fixes to the commit.  Also attach another Fixes to the minor
      support commit for easier tracking.
      
      Note that userfaultfd is actually fine with false positives (e.g.  caused
      by pte changed), but not wrong logical events (e.g.  caused by reading a
      pte during changing).  The latter can confuse the userspace, so the
      strictness is very much preferred.  E.g., MISSING event should never
      happen on the page after UFFDIO_COPY has correctly installed the page and
      returned.
      
      Link: https://lkml.kernel.org/r/20221004193400.110155-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20221004193400.110155-2-peterx@redhat.com
      Fixes: 1a1aad8a ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
      Fixes: 7677f7fd ("userfaultfd: add minor fault registration mode")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Co-developed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2ea7ff1e
    • Brian Geffon's avatar
      zram: always expose rw_page · 94541bc3
      Brian Geffon authored
      Currently zram will adjust its fops to a version which does not contain
      rw_page when a backing device has been assigned.  This is done to prevent
      upper layers from assuming a synchronous operation when a page may have
      been written back.  This forces every operation through bio which has
      overhead associated with bio_alloc/frees.
      
      The code can be simplified to always expose an rw_page method and only in
      the rare event that a page is written back we instead will return
      -EOPNOTSUPP forcing the upper layer to fallback to bio.
      
      Link: https://lkml.kernel.org/r/20221003144832.2906610-1-bgeffon@google.comSigned-off-by: default avatarBrian Geffon <bgeffon@google.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Rom Lemarchand <romlem@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      94541bc3
    • Qi Zheng's avatar
      LoongArch: update local TLB if PTE entry exists · 14c2ac36
      Qi Zheng authored
      Currently, the implementation of update_mmu_tlb() is empty if
      __HAVE_ARCH_UPDATE_MMU_TLB is not defined.  Then if two threads
      concurrently fault at the same page, the second thread that did not win
      the race will give up and do nothing.  In the LoongArch architecture, this
      second thread will trigger another fault, and only updates its local TLB.
      
      Instead of triggering another fault, it's better to implement
      update_mmu_tlb() to directly update the local TLB of the second thread. 
      Just do it.
      
      Link: https://lkml.kernel.org/r/20220929112318.32393-3-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Suggested-by: default avatarBibo Mao <maobibo@loongson.cn>
      Acked-by: default avatarHuacai Chen <chenhuacai@loongson.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14c2ac36
    • Qi Zheng's avatar
      mm: use update_mmu_tlb() on the second thread · bce8cb3c
      Qi Zheng authored
      As message in commit 7df67697 ("mm/memory.c: Update local TLB if PTE
      entry exists") said, we should update local TLB only on the second thread.
      So in the do_anonymous_page() here, we should use update_mmu_tlb()
      instead of update_mmu_cache() on the second thread.
      
      As David pointed out, this is a performance improvement, not a
      correctness fix.
      
      Link: https://lkml.kernel.org/r/20220929112318.32393-2-zhengqi.arch@bytedance.comSigned-off-by: default avatarQi Zheng <zhengqi.arch@bytedance.com>
      Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Bibo Mao <maobibo@loongson.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Huacai Chen <chenhuacai@loongson.cn>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bce8cb3c
    • Andrey Konovalov's avatar
      kasan: fix array-bounds warnings in tests · d6e5040b
      Andrey Konovalov authored
      GCC's -Warray-bounds option detects out-of-bounds accesses to
      statically-sized allocations in krealloc out-of-bounds tests.
      
      Use OPTIMIZER_HIDE_VAR to suppress the warning.
      
      Also change kmalloc_memmove_invalid_size to use OPTIMIZER_HIDE_VAR
      instead of a volatile variable.
      
      Link: https://lkml.kernel.org/r/e94399242d32e00bba6fd0d9ec4c897f188128e8.1664215688.git.andreyknvl@google.comSigned-off-by: default avatarAndrey Konovalov <andreyknvl@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reviewed-by: default avatarKees Cook <keescook@chromium.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d6e5040b
    • Alistair Popple's avatar
      hmm-tests: add test for migrate_device_range() · ad4c3652
      Alistair Popple authored
      Link: https://lkml.kernel.org/r/a73cf109de0224cfd118d22be58ddebac3ae2897.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ad4c3652
    • Alistair Popple's avatar
      nouveau/dmem: evict device private memory during release · 24988123
      Alistair Popple authored
      When the module is unloaded or a GPU is unbound from the module it is
      possible for device private pages to still be mapped in currently running
      processes.  This can lead to a hangs and RCU stall warnings when unbinding
      the device as memunmap_pages() will wait in an uninterruptible state until
      all device pages have been freed which may never happen.
      
      Fix this by migrating device mappings back to normal CPU memory prior to
      freeing the GPU memory chunks and associated device private pages.
      
      Link: https://lkml.kernel.org/r/66277601fb8fda9af408b33da9887192bf895bda.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      24988123
    • Alistair Popple's avatar
      nouveau/dmem: refactor nouveau_dmem_fault_copy_one() · d9b71939
      Alistair Popple authored
      nouveau_dmem_fault_copy_one() is used during handling of CPU faults via
      the migrate_to_ram() callback and is used to copy data from GPU to CPU
      memory.  It is currently specific to fault handling, however a future
      patch implementing eviction of data during teardown needs similar
      functionality.
      
      Refactor out the core functionality so that it is not specific to fault
      handling.
      
      Link: https://lkml.kernel.org/r/20573d7b4e641a78fde9935f948e64e71c9e709e.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Reviewed-by: default avatarLyude Paul <lyude@redhat.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d9b71939
    • Alistair Popple's avatar
      mm/migrate_device.c: add migrate_device_range() · e778406b
      Alistair Popple authored
      Device drivers can use the migrate_vma family of functions to migrate
      existing private anonymous mappings to device private pages.  These pages
      are backed by memory on the device with drivers being responsible for
      copying data to and from device memory.
      
      Device private pages are freed via the pgmap->page_free() callback when
      they are unmapped and their refcount drops to zero.  Alternatively they
      may be freed indirectly via migration back to CPU memory in response to a
      pgmap->migrate_to_ram() callback called whenever the CPU accesses an
      address mapped to a device private page.
      
      In other words drivers cannot control the lifetime of data allocated on
      the devices and must wait until these pages are freed from userspace. 
      This causes issues when memory needs to reclaimed on the device, either
      because the device is going away due to a ->release() callback or because
      another user needs to use the memory.
      
      Drivers could use the existing migrate_vma functions to migrate data off
      the device.  However this would require them to track the mappings of each
      page which is both complicated and not always possible.  Instead drivers
      need to be able to migrate device pages directly so they can free up
      device memory.
      
      To allow that this patch introduces the migrate_device family of functions
      which are functionally similar to migrate_vma but which skips the initial
      lookup based on mapping.
      
      Link: https://lkml.kernel.org/r/868116aab70b0c8ee467d62498bb2cf0ef907295.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e778406b
    • Alistair Popple's avatar
      mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page() · 241f6885
      Alistair Popple authored
      migrate_device_coherent_page() reuses the existing migrate_vma family of
      functions to migrate a specific page without providing a valid mapping or
      vma.  This looks a bit odd because it means we are calling migrate_vma_*()
      without setting a valid vma, however it was considered acceptable at the
      time because the details were internal to migrate_device.c and there was
      only a single user.
      
      One of the reasons the details could be kept internal was that this was
      strictly for migrating device coherent memory.  Such memory can be copied
      directly by the CPU without intervention from a driver.  However this
      isn't true for device private memory, and a future change requires similar
      functionality for device private memory.  So refactor the code into
      something more sensible for migrating device memory without a vma.
      
      Link: https://lkml.kernel.org/r/c7b2ff84e9b33d022cf4a40f87d051f281a16d8f.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      241f6885
    • Alistair Popple's avatar
      mm/memremap.c: take a pgmap reference on page allocation · 0dc45ca1
      Alistair Popple authored
      ZONE_DEVICE pages have a struct dev_pagemap which is allocated by a
      driver.  When the struct page is first allocated by the kernel in
      memremap_pages() a reference is taken on the associated pagemap to ensure
      it is not freed prior to the pages being freed.
      
      Prior to 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page
      refcount") pages were considered free and returned to the driver when the
      reference count dropped to one.  However the pagemap reference was not
      dropped until the page reference count hit zero.  This would occur as part
      of the final put_page() in memunmap_pages() which would wait for all pages
      to be freed prior to returning.
      
      When the extra refcount was removed the pagemap reference was no longer
      being dropped in put_page().  Instead memunmap_pages() was changed to
      explicitly drop the pagemap references.  This means that memunmap_pages()
      can complete even though pages are still mapped by the kernel which can
      lead to kernel crashes, particularly if a driver frees the pagemap.
      
      To fix this drivers should take a pagemap reference when allocating the
      page.  This reference can then be returned when the page is freed.
      
      Link: https://lkml.kernel.org/r/12d155ec727935ebfbb4d639a03ab374917ea51b.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Fixes: 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page refcount")
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0dc45ca1
    • Alistair Popple's avatar
      mm: free device private pages have zero refcount · ef233450
      Alistair Popple authored
      Since 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page
      refcount") device private pages have no longer had an extra reference
      count when the page is in use.  However before handing them back to the
      owning device driver we add an extra reference count such that free pages
      have a reference count of one.
      
      This makes it difficult to tell if a page is free or not because both free
      and in use pages will have a non-zero refcount.  Instead we should return
      pages to the drivers page allocator with a zero reference count.  Kernel
      code can then safely use kernel functions such as get_page_unless_zero().
      
      Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef233450
    • Alistair Popple's avatar
      mm/memory.c: fix race when faulting a device private page · 16ce101d
      Alistair Popple authored
      Patch series "Fix several device private page reference counting issues",
      v2
      
      This series aims to fix a number of page reference counting issues in
      drivers dealing with device private ZONE_DEVICE pages.  These result in
      use-after-free type bugs, either from accessing a struct page which no
      longer exists because it has been removed or accessing fields within the
      struct page which are no longer valid because the page has been freed.
      
      During normal usage it is unlikely these will cause any problems.  However
      without these fixes it is possible to crash the kernel from userspace. 
      These crashes can be triggered either by unloading the kernel module or
      unbinding the device from the driver prior to a userspace task exiting. 
      In modules such as Nouveau it is also possible to trigger some of these
      issues by explicitly closing the device file-descriptor prior to the task
      exiting and then accessing device private memory.
      
      This involves some minor changes to both PowerPC and AMD GPU code. 
      Unfortunately I lack hardware to test either of those so any help there
      would be appreciated.  The changes mimic what is done in for both Nouveau
      and hmm-tests though so I doubt they will cause problems.
      
      
      This patch (of 8):
      
      When the CPU tries to access a device private page the migrate_to_ram()
      callback associated with the pgmap for the page is called.  However no
      reference is taken on the faulting page.  Therefore a concurrent migration
      of the device private page can free the page and possibly the underlying
      pgmap.  This results in a race which can crash the kernel due to the
      migrate_to_ram() function pointer becoming invalid.  It also means drivers
      can't reliably read the zone_device_data field because the page may have
      been freed with memunmap_pages().
      
      Close the race by getting a reference on the page while holding the ptl to
      ensure it has not been freed.  Unfortunately the elevated reference count
      will cause the migration required to handle the fault to fail.  To avoid
      this failure pass the faulting page into the migrate_vma functions so that
      if an elevated reference count is found it can be checked to see if it's
      expected or not.
      
      [mpe@ellerman.id.au: fix build]
        Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
      Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
      Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.comSigned-off-by: default avatarAlistair Popple <apopple@nvidia.com>
      Acked-by: default avatarFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      16ce101d
    • Xin Hao's avatar
      mm/damon: use damon_sz_region() in appropriate place · ab63f63f
      Xin Hao authored
      In many places we can use damon_sz_region() to instead of "r->ar.end -
      r->ar.start".
      
      Link: https://lkml.kernel.org/r/20220927001946.85375-2-xhao@linux.alibaba.comSigned-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Suggested-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab63f63f
    • Xin Hao's avatar
      mm/damon: move sz_damon_region to damon_sz_region · 652e0446
      Xin Hao authored
      Rename sz_damon_region() to damon_sz_region(), and move it to
      "include/linux/damon.h", because in many places, we can to use this func.
      
      Link: https://lkml.kernel.org/r/20220927001946.85375-1-xhao@linux.alibaba.comSigned-off-by: default avatarXin Hao <xhao@linux.alibaba.com>
      Suggested-by: default avatarSeongJae Park <sj@kernel.org>
      Reviewed-by: default avatarSeongJae Park <sj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      652e0446
    • Xiaoke Wang's avatar
      lib/test_meminit: add checks for the allocation functions · ea091fa5
      Xiaoke Wang authored
      alloc_pages(), kmalloc() and vmalloc() are all memory allocation functions
      which can return NULL when some internal memory failures happen.  So it is
      better to check the return of them to catch the failure in time for better
      test them.
      
      Link: https://lkml.kernel.org/r/tencent_D44A49FFB420EDCCBFB9221C8D14DFE12908@qq.comSigned-off-by: default avatarXiaoke Wang <xkernel.wang@foxmail.com>
      Reviewed-by: default avatarAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ea091fa5
    • Alexander Potapenko's avatar
      kmsan: unpoison @tlb in arch_tlb_gather_mmu() · ac801e7e
      Alexander Potapenko authored
      This is an optimization to reduce stackdepot pressure.
      
      struct mmu_gather contains 7 1-bit fields packed into a 32-bit unsigned
      int value.  The remaining 25 bits remain uninitialized and are never used,
      but KMSAN updates the origin for them in zap_pXX_range() in mm/memory.c,
      thus creating very long origin chains.  This is technically correct, but
      consumes too much memory.
      
      Unpoisoning the whole structure will prevent creating such chains.
      
      Link: https://lkml.kernel.org/r/20220905122452.2258262-20-glider@google.comSigned-off-by: default avatarAlexander Potapenko <glider@google.com>
      Acked-by: default avatarMarco Elver <elver@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Ilya Leoshkevich <iii@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Liu Shixin <liushixin2@huawei.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ac801e7e
    • Matthew Wilcox (Oracle)'s avatar
      ext4,f2fs: fix readahead of verity data · 4fa0e3ff
      Matthew Wilcox (Oracle) authored
      The recent change of page_cache_ra_unbounded() arguments was buggy in the
      two callers, causing us to readahead the wrong pages.  Move the definition
      of ractl down to after the index is set correctly.  This affected
      performance on configurations that use fs-verity.
      
      Link: https://lkml.kernel.org/r/20221012193419.1453558-1-willy@infradead.org
      Fixes: 73bb49da ("mm/readahead: make page_cache_ra_unbounded take a readahead_control")
      Signed-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Reported-by: default avatarJintao Yin <nicememory@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4fa0e3ff
    • Carlos Llamas's avatar
      mm/mmap: undo ->mmap() when arch_validate_flags() fails · deb0f656
      Carlos Llamas authored
      Commit c462ac28 ("mm: Introduce arch_validate_flags()") added a late
      check in mmap_region() to let architectures validate vm_flags.  The check
      needs to happen after calling ->mmap() as the flags can potentially be
      modified during this callback.
      
      If arch_validate_flags() check fails we unmap and free the vma.  However,
      the error path fails to undo the ->mmap() call that previously succeeded
      and depending on the specific ->mmap() implementation this translates to
      reference increments, memory allocations and other operations what will
      not be cleaned up.
      
      There are several places (mainly device drivers) where this is an issue.
      However, one specific example is bpf_map_mmap() which keeps count of the
      mappings in map->writecnt.  The count is incremented on ->mmap() and then
      decremented on vm_ops->close().  When arch_validate_flags() fails this
      count is off since bpf_map_mmap_close() is never called.
      
      One can reproduce this issue in arm64 devices with MTE support.  Here the
      vm_flags are checked to only allow VM_MTE if VM_MTE_ALLOWED has been set
      previously.  From userspace then is enough to pass the PROT_MTE flag to
      mmap() syscall to trigger the arch_validate_flags() failure.
      
      The following program reproduces this issue:
      
        #include <stdio.h>
        #include <unistd.h>
        #include <linux/unistd.h>
        #include <linux/bpf.h>
        #include <sys/mman.h>
      
        int main(void)
        {
      	union bpf_attr attr = {
      		.map_type = BPF_MAP_TYPE_ARRAY,
      		.key_size = sizeof(int),
      		.value_size = sizeof(long long),
      		.max_entries = 256,
      		.map_flags = BPF_F_MMAPABLE,
      	};
      	int fd;
      
      	fd = syscall(__NR_bpf, BPF_MAP_CREATE, &attr, sizeof(attr));
      	mmap(NULL, 4096, PROT_WRITE | PROT_MTE, MAP_SHARED, fd, 0);
      
      	return 0;
        }
      
      By manually adding some log statements to the vm_ops callbacks we can
      confirm that when passing PROT_MTE to mmap() the map->writecnt is off upon
      ->release():
      
      With PROT_MTE flag:
        root@debian:~# ./bpf-test
        [  111.263874] bpf_map_write_active_inc: map=9 writecnt=1
        [  111.288763] bpf_map_release: map=9 writecnt=1
      
      Without PROT_MTE flag:
        root@debian:~# ./bpf-test
        [  157.816912] bpf_map_write_active_inc: map=10 writecnt=1
        [  157.830442] bpf_map_write_active_dec: map=10 writecnt=0
        [  157.832396] bpf_map_release: map=10 writecnt=0
      
      This patch fixes the above issue by calling vm_ops->close() when the
      arch_validate_flags() check fails, after this we can proceed to unmap and
      free the vma on the error path.
      
      Link: https://lkml.kernel.org/r/20220930003844.1210987-1-cmllamas@google.com
      Fixes: c462ac28 ("mm: Introduce arch_validate_flags()")
      Signed-off-by: default avatarCarlos Llamas <cmllamas@google.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarLiam Howlett <liam.howlett@oracle.com>
      Cc: Christian Brauner (Microsoft) <brauner@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>	[5.10+]
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      deb0f656
  2. 12 Oct, 2022 6 commits
  3. 07 Oct, 2022 5 commits
    • Mike Kravetz's avatar
      hugetlb: allocate vma lock for all sharable vmas · bbff39cc
      Mike Kravetz authored
      The hugetlb vma lock was originally designed to synchronize pmd sharing. 
      As such, it was only necessary to allocate the lock for vmas that were
      capable of pmd sharing.  Later in the development cycle, it was discovered
      that it could also be used to simplify fault/truncation races as described
      in [1].  However, a subsequent change to allocate the lock for all vmas
      that use the page cache was never made.  A fault/truncation race could
      leave pages in a file past i_size until the file is removed.
      
      Remove the previous restriction and allocate lock for all VM_MAYSHARE
      vmas.  Warn in the unlikely event of allocation failure.
      
      [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t
      
      Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com
      Fixes: "hugetlb: clean up code checking for fault/truncation races"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      bbff39cc
    • Mike Kravetz's avatar
      hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer · ecfbd733
      Mike Kravetz authored
      hugetlb file truncation/hole punch code may need to back out and take
      locks in order in the routine hugetlb_unmap_file_folio().  This code could
      race with vma freeing as pointed out in [1] and result in accessing a
      stale vma pointer.  To address this, take the vma_lock when clearing the
      vma_lock->vma pointer.
      
      [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
      
      [mike.kravetz@oracle.com: address build issues]
        Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey
      Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com
      Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ecfbd733
    • Mike Kravetz's avatar
      hugetlb: fix vma lock handling during split vma and range unmapping · 131a79b4
      Mike Kravetz authored
      Patch series "hugetlb: fixes for new vma lock series".
      
      In review of the series "hugetlb: Use new vma lock for huge pmd sharing
      synchronization", Miaohe Lin pointed out two key issues:
      
      1) There is a race in the routine hugetlb_unmap_file_folio when locks
         are dropped and reacquired in the correct order [1].
      
      2) With the switch to using vma lock for fault/truncate synchronization,
         we need to make sure lock exists for all VM_MAYSHARE vmas, not just
         vmas capable of pmd sharing.
      
      These two issues are addressed here.  In addition, having a vma lock
      present in all VM_MAYSHARE vmas, uncovered some issues around vma
      splitting.  Those are also addressed.
      
      [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
      
      
      This patch (of 3):
      
      The hugetlb vma lock hangs off the vm_private_data field and is specific
      to the vma.  When vm_area_dup() is called as part of vma splitting, the
      vma lock pointer is copied to the new vma.  This will result in issues
      such as double freeing of the structure.  Update the hugetlb open vm_ops
      to allocate a new vma lock for the new vma.
      
      The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
      to prevent subsequent pmd sharing.  hugetlb_vma_lock_free attempted to
      anticipate this by checking both VM_MAYSHARE and VM_SHARED.  However, if
      only VM_MAYSHARE was set we would miss the free.  With the introduction of
      the vma lock, a vma can not participate in pmd sharing if vm_private_data
      is NULL.  Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
      free the vma lock to prevent sharing.  Also, update the sharing code to
      make sure vma lock is indeed a condition for pmd sharing. 
      hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.
      
      Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
      Fixes: "hugetlb: add vma based lock for pmd sharing"
      Signed-off-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      131a79b4
    • Yu Zhao's avatar
    • Yu Zhao's avatar
      mm/mglru: don't sync disk for each aging cycle · 14aa8b2d
      Yu Zhao authored
      wakeup_flusher_threads() was added under the assumption that if a system
      runs out of clean cold pages, it might want to write back dirty pages more
      aggressively so that they can become clean and be dropped.
      
      However, doing so can breach the rate limit a system wants to impose on
      writeback, resulting in early SSD wearout.
      
      Link: https://lkml.kernel.org/r/YzSiWq9UEER5LKup@google.com
      Fixes: bd74fdae ("mm: multi-gen LRU: support page table walks")
      Signed-off-by: default avatarYu Zhao <yuzhao@google.com>
      Reported-by: default avatarAxel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      14aa8b2d
  4. 03 Oct, 2022 7 commits