1. 22 Feb, 2024 40 commits
    • Ryan Roberts's avatar
      mm: clarify the spec for set_ptes() · 6280d731
      Ryan Roberts authored
      Patch series "Transparent Contiguous PTEs for User Mappings", v6.
      
      This is a series to opportunistically and transparently use contpte
      mappings (set the contiguous bit in ptes) for user memory when those
      mappings meet the requirements.  The change benefits arm64, but there is
      some (very) minor refactoring for x86 to enable its integration with
      core-mm.
      
      It is part of a wider effort to improve performance by allocating and
      mapping variable-sized blocks of memory (folios).  One aim is for the 4K
      kernel to approach the performance of the 16K kernel, but without breaking
      compatibility and without the associated increase in memory.  Another aim
      is to benefit the 16K and 64K kernels by enabling 2M THP, since this is
      the contpte size for those kernels.  We have good performance data that
      demonstrates both aims are being met (see below).
      
      Of course this is only one half of the change.  We require the mapped
      physical memory to be the correct size and alignment for this to actually
      be useful (i.e.  64K for 4K pages, or 2M for 16K/64K pages).  Fortunately
      folios are solving this problem for us.  Filesystems that support it (XFS,
      AFS, EROFS, tmpfs, ...) will allocate large folios up to the PMD size
      today, and more filesystems are coming.  And for anonymous memory,
      "multi-size THP" is now upstream.
      
      
      Patch Layout
      ============
      
      In this version, I've split the patches to better show each optimization:
      
        - 1-2:    mm prep: misc code and docs cleanups
        - 3-6:    mm,arm64,x86 prep: Add pte_advance_pfn() and make pte_next_pfn() a
                  generic wrapper around it
        - 7-11:   arm64 prep: Refactor ptep helpers into new layer
        - 12:     functional contpte implementation
        - 23-18:  various optimizations on top of the contpte implementation
      
      
      Testing
      =======
      
      I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
        - mm selftests (inc new tests written for multi-size THP); no regressions
        - Speedometer Java script benchmark in Chromium web browser; no issues
        - Kernel compilation; no issues
        - Various tests under high memory pressure with swap enabled; no issues
      
      
      Performance
      ===========
      
      High Level Use Cases
      ~~~~~~~~~~~~~~~~~~~~
      
      First some high level use cases (kernel compilation and speedometer JavaScript
      benchmarks). These are running on Ampere Altra (I've seen similar improvements
      on Android/Pixel 6).
      
      baseline:                  mm-unstable (mTHP switched off)
      mTHP:                      + enable 16K, 32K, 64K mTHP sizes "always"
      mTHP + contpte:            + this series
      mTHP + contpte + exefolio: + patch at [6], which series supports
      
      Kernel Compilation with -j8 (negative is faster):
      
      | kernel                    | real-time | kern-time | user-time |
      |---------------------------|-----------|-----------|-----------|
      | baseline                  |      0.0% |      0.0% |      0.0% |
      | mTHP                      |     -5.0% |    -39.1% |     -0.7% |
      | mTHP + contpte            |     -6.0% |    -41.4% |     -1.5% |
      | mTHP + contpte + exefolio |     -7.8% |    -43.1% |     -3.4% |
      
      Kernel Compilation with -j80 (negative is faster):
      
      | kernel                    | real-time | kern-time | user-time |
      |---------------------------|-----------|-----------|-----------|
      | baseline                  |      0.0% |      0.0% |      0.0% |
      | mTHP                      |     -5.0% |    -36.6% |     -0.6% |
      | mTHP + contpte            |     -6.1% |    -38.2% |     -1.6% |
      | mTHP + contpte + exefolio |     -7.4% |    -39.2% |     -3.2% |
      
      Speedometer (positive is faster):
      
      | kernel                    | runs_per_min |
      |:--------------------------|--------------|
      | baseline                  |         0.0% |
      | mTHP                      |         1.5% |
      | mTHP + contpte            |         3.2% |
      | mTHP + contpte + exefolio |         4.5% |
      
      
      Micro Benchmarks
      ~~~~~~~~~~~~~~~~
      
      The following microbenchmarks are intended to demonstrate the performance of
      fork() and munmap() do not regress. I'm showing results for order-0 (4K)
      mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
      benchmarks.
      
      baseline:                  mm-unstable + batch zap [7] series
      contpte-basic:             + patches 0-19; functional contpte implementation
      contpte-batch:             + patches 20-23; implement new batched APIs
      contpte-inline:            + patch 24; __always_inline to help compiler
      contpte-fold:              + patch 25; fold contpte mapping when sensible
      
      Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
      (on top of MacOS) for reference, although experience suggests this might not be
      the most reliable for performance numbers of this sort:
      
      | FORK           |         order-0        |         order-9        |
      | Ampere Altra   |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      2.7% |       0.0% |      0.2% |
      | contpte-basic  |       6.3% |      1.4% |    1948.7% |      0.2% |
      | contpte-batch  |       7.6% |      2.0% |      -1.9% |      0.4% |
      | contpte-inline |       3.6% |      1.5% |      -1.0% |      0.2% |
      | contpte-fold   |       4.6% |      2.1% |      -1.8% |      0.2% |
      
      | MUNMAP         |         order-0        |         order-9        |
      | Ampere Altra   |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      0.5% |       0.0% |      0.3% |
      | contpte-basic  |       1.8% |      0.3% |    1104.8% |      0.1% |
      | contpte-batch  |      -0.3% |      0.4% |       2.7% |      0.1% |
      | contpte-inline |      -0.1% |      0.6% |       0.9% |      0.1% |
      | contpte-fold   |       0.1% |      0.6% |       0.8% |      0.1% |
      
      | FORK           |         order-0        |         order-9        |
      | Apple M2 VM    |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      1.4% |       0.0% |      0.8% |
      | contpte-basic  |       6.8% |      1.2% |     469.4% |      1.4% |
      | contpte-batch  |      -7.7% |      2.0% |      -8.9% |      0.7% |
      | contpte-inline |      -6.0% |      2.1% |      -6.0% |      2.0% |
      | contpte-fold   |       5.9% |      1.4% |      -6.4% |      1.4% |
      
      | MUNMAP         |         order-0        |         order-9        |
      | Apple M2 VM    |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      0.6% |       0.0% |      0.4% |
      | contpte-basic  |       1.6% |      0.6% |     233.6% |      0.7% |
      | contpte-batch  |       1.9% |      0.3% |      -3.9% |      0.4% |
      | contpte-inline |       2.2% |      0.8% |      -1.6% |      0.9% |
      | contpte-fold   |       1.5% |      0.7% |      -1.7% |      0.7% |
      
      Misc
      ~~~~
      
      John Hubbard at Nvidia has indicated dramatic 10x performance improvements
      for some workloads at [8], when using 64K base page kernel.
      
      [1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/
      [2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@arm.com/
      [3] https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@arm.com/
      [4] https://lore.kernel.org/lkml/20231218105100.172635-1-ryan.roberts@arm.com/
      [5] https://lore.kernel.org/linux-mm/633af0a7-0823-424f-b6ef-374d99483f05@arm.com/
      [6] https://lore.kernel.org/lkml/08c16f7d-f3b3-4f22-9acc-da943f647dc3@arm.com/
      [7] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/
      [8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/
      [9] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v6
      
      
      
      
      This patch (of 18):
      
      set_ptes() spec implies that it can only be used to set a present pte
      because it interprets the PFN field to increment it.  However,
      set_pte_at() has been implemented on top of set_ptes() since set_ptes()
      was introduced, and set_pte_at() allows setting a pte to a not-present
      state.  So clarify the spec to state that when nr==1, new state of pte may
      be present or not present.  When nr>1, new state of all ptes must be
      present.
      
      While we are at it, tighten the spec to set requirements around the
      initial state of ptes; when nr==1 it may be either present or not-present.
      But when nr>1 all ptes must initially be not-present.  All set_ptes()
      callsites already conform to this requirement.  Stating it explicitly is
      useful because it allows for a simplification to the upcoming arm64
      contpte implementation.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-1-ryan.roberts@arm.com
      Link: https://lkml.kernel.org/r/20240215103205.2607016-2-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6280d731
    • David Hildenbrand's avatar
      mm/memory: optimize unmap/zap with PTE-mapped THP · 10ebac4f
      David Hildenbrand authored
      Similar to how we optimized fork(), let's implement PTE batching when
      consecutive (present) PTEs map consecutive pages of the same large folio.
      
      Most infrastructure we need for batching (mmu gather, rmap) is already
      there.  We only have to add get_and_clear_full_ptes() and
      clear_full_ptes().  Similarly, extend zap_install_uffd_wp_if_needed() to
      process a PTE range.
      
      We won't bother sanity-checking the mapcount of all subpages, but only
      check the mapcount of the first subpage we process.  If there is a real
      problem hiding somewhere, we can trigger it simply by using small folios,
      or when we zap single pages of a large folio.  Ideally, we had that check
      in rmap code (including for delayed rmap), but then we cannot print the
      PTE.  Let's keep it simple for now.  If we ever have a cheap
      folio_mapcount(), we might just want to check for underflows there.
      
      To keep small folios as fast as possible force inlining of a specialized
      variant using __always_inline with nr=1.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-11-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      10ebac4f
    • David Hildenbrand's avatar
      mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing · e61abd44
      David Hildenbrand authored
      In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or now
      up to 256 folio fragments that span more than one page, before we
      conditionally reschedule.
      
      It's a pain that we have to handle cond_resched() in
      tlb_batch_pages_flush() manually and cannot simply handle it in
      release_pages() -- release_pages() can be called from atomic context. 
      Well, in a perfect world we wouldn't have to make our code more
      complicated at all.
      
      With page poisoning and init_on_free, we might now run into soft lockups
      when we free a lot of rather large folio fragments, because page freeing
      time then depends on the actual memory size we are freeing instead of on
      the number of folios that are involved.
      
      In the absolute (unlikely) worst case, on arm64 with 64k we will be able
      to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
      GiB does sound like it might take a while.  But instead of ignoring this
      unlikely case, let's just handle it.
      
      So, let's teach tlb_batch_pages_flush() that there are some configurations
      where page freeing is horribly slow, and let's reschedule more frequently
      -- similarly like we did for now before we had large folio fragments in
      there.  Avoid yet another loop over all encoded pages in the common case
      by handling that separately.
      
      Note that with page poisoning/zeroing, we might now end up freeing only a
      single folio fragment at a time that might exceed the old 512 pages limit:
      but if we cannot even free a single MAX_ORDER page on a system without
      running into soft lockups, something else is already completely bogus. 
      Freeing a PMD-mapped THP would similarly cause trouble.
      
      In theory, we might even free 511 order-0 pages + a single MAX_ORDER page,
      effectively having to zero out 8703 pages on arm64 with 64k, translating
      to ~544 MiB of memory: however, if 512 MiB doesn't result in soft lockups,
      544 MiB is unlikely to result in soft lockups, so we won't care about that
      for the time being.
      
      In the future, we might want to detect if handling cond_resched() is
      required at all, and just not do any of that with full preemption enabled.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e61abd44
    • David Hildenbrand's avatar
      mm/mmu_gather: add __tlb_remove_folio_pages() · d7f861b9
      David Hildenbrand authored
      Add __tlb_remove_folio_pages(), which will remove multiple consecutive
      pages that belong to the same large folio, instead of only a single page. 
      We'll be using this function when optimizing unmapping/zapping of large
      folios that are mapped by PTEs.
      
      We're using the remaining spare bit in an encoded_page to indicate that
      the next enoced page in an array contains actually shifted "nr_pages". 
      Teach swap/freeing code about putting multiple folio references, and
      delayed rmap handling to remove page ranges of a folio.
      
      This extension allows for still gathering almost as many small folios as
      we used to (-1, because we have to prepare for a possibly bigger next
      entry), but still allows for gathering consecutive pages that belong to
      the same large folio.
      
      Note that we don't pass the folio pointer, because it is not required for
      now.  Further, we don't support page_size != PAGE_SIZE, it won't be
      required for simple PTE batching.
      
      We have to provide a separate s390 implementation, but it's fairly
      straight forward.
      
      Another, more invasive and likely more expensive, approach would be to use
      folio+range or a PFN range instead of page+nr_pages.  But, we should do
      that consistently for the whole mmu_gather.  For now, let's keep it simple
      and add "nr_pages" only.
      
      Note that it is now possible to gather significantly more pages: In the
      past, we were able to gather ~10000 pages, now we can also gather ~5000
      folio fragments that span multiple pages.  A folio fragment on x86-64 can
      span up to 512 pages (2 MiB THP) and on arm64 with 64k in theory 8192
      pages (512 MiB THP).  Gathering more memory is not considered something we
      should worry about, especially because these are already corner cases.
      
      While we can gather more total memory, we won't free more folio fragments.
      As long as page freeing time primarily only depends on the number of
      involved folios, there is no effective change for !preempt configurations.
      However, we'll adjust tlb_batch_pages_flush() separately to handle corner
      cases where page freeing time grows proportionally with the actual memory
      size.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7f861b9
    • David Hildenbrand's avatar
      mm/mmu_gather: add tlb_remove_tlb_entries() · 4d5bf0b6
      David Hildenbrand authored
      Let's add a helper that lets us batch-process multiple consecutive PTEs.
      
      Note that the loop will get optimized out on all architectures except on
      powerpc.  We have to add an early define of __tlb_remove_tlb_entry() on
      ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries()
      a macro).
      
      [arnd@kernel.org: change __tlb_remove_tlb_entry() to an inline function]
        Link: https://lkml.kernel.org/r/20240221154549.2026073-1-arnd@kernel.org
      Link: https://lkml.kernel.org/r/20240214204435.167852-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d5bf0b6
    • David Hildenbrand's avatar
      mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP · da510964
      David Hildenbrand authored
      Nowadays, encoded pages are only used in mmu_gather handling.  Let's
      update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP.  While
      at it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.
      
      If encoded page pointers would ever be used in other context again, we'd
      likely want to change the defines to reflect their context (e.g.,
      ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP).  For now, let's keep it simple.
      
      This is a preparation for using the remaining spare bit to indicate that
      the next item in an array of encoded pages is a "nr_pages" argument and
      not an encoded page.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      da510964
    • David Hildenbrand's avatar
      mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size() · c30d6bc8
      David Hildenbrand authored
      We have two bits available in the encoded page pointer to store additional
      information.  Currently, we use one bit to request delay of the rmap
      removal until after a TLB flush.
      
      We want to make use of the remaining bit internally for batching of
      multiple pages of the same folio, specifying that the next encoded page
      pointer in an array is actually "nr_pages".  So pass page + delay_rmap
      flag instead of an encoded page, to handle the encoding internally.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c30d6bc8
    • David Hildenbrand's avatar
      mm/memory: factor out zapping folio pte into zap_present_folio_pte() · 2b42a7e5
      David Hildenbrand authored
      Let's prepare for further changes by factoring it out into a separate
      function.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2b42a7e5
    • David Hildenbrand's avatar
      mm/memory: further separate anon and pagecache folio handling in zap_present_pte() · d11838ed
      David Hildenbrand authored
      We don't need up-to-date accessed-dirty information for anon folios and
      can simply work with the ptent we already have.  Also, we know the RSS
      counter we want to update.
      
      We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
      zap_install_uffd_wp_if_needed() after updating the folio and RSS.
      
      While at it, only call zap_install_uffd_wp_if_needed() if there is even
      any chance that pte_install_uffd_wp_if_needed() would do *something*. 
      That is, just don't bother if uffd-wp does not apply.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d11838ed
    • David Hildenbrand's avatar
      mm/memory: handle !page case in zap_present_pte() separately · 0cf18e83
      David Hildenbrand authored
      We don't need uptodate accessed/dirty bits, so in theory we could replace
      ptep_get_and_clear_full() by an optimized ptep_clear_full() function. 
      Let's rely on the provided pte.
      
      Further, there is no scenario where we would have to insert uffd-wp
      markers when zapping something that is not a normal page (i.e., zeropage).
      Add a sanity check to make sure this remains true.
      
      should_zap_folio() no longer has to handle NULL pointers.  This change
      replaces 2/3 "!page/!folio" checks by a single "!page" one.
      
      Note that arch_check_zapped_pte() on x86-64 checks the HW-dirty bit to
      detect shadow stack entries.  But for shadow stack entries, the HW dirty
      bit (in combination with non-writable PTEs) is set by software.  So for
      the arch_check_zapped_pte() check, we don't have to sync against HW
      setting the HW dirty bit concurrently, it is always set.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0cf18e83
    • David Hildenbrand's avatar
      mm/memory: factor out zapping of present pte into zap_present_pte() · 789753e1
      David Hildenbrand authored
      Patch series "mm/memory: optimize unmap/zap with PTE-mapped THP", v3.
      
      This series is based on [1].  Similar to what we did with fork(), let's
      implement PTE batching during unmap/zap when processing PTE-mapped THPs.
      
      We collect consecutive PTEs that map consecutive pages of the same large
      folio, making sure that the other PTE bits are compatible, and (a) adjust
      the refcount only once per batch, (b) call rmap handling functions only
      once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
      entry removal once per batch.
      
      Ryan was previously working on this in the context of cont-pte for arm64,
      int latest iteration [2] with a focus on arm6 with cont-pte only.  This
      series implements the optimization for all architectures, independent of
      such PTE bits, teaches MMU gather/TLB code to be fully aware of such
      large-folio-pages batches as well, and amkes use of our new rmap batching
      function when removing the rmap.
      
      To achieve that, we have to enlighten MMU gather / page freeing code
      (i.e., everything that consumes encoded_page) to process unmapping of
      consecutive pages that all belong to the same large folio.  I'm being very
      careful to not degrade order-0 performance, and it looks like I managed to
      achieve that.
      
      While this series should -- similar to [1] -- be beneficial for adding
      cont-pte support on arm64[2], it's one of the requirements for maintaining
      a total mapcount[3] for large folios with minimal added overhead and
      further changes[4] that build up on top of the total mapcount.
      
      Independent of all that, this series results in a speedup during munmap()
      and similar unmapping (process teardown, MADV_DONTNEED on larger ranges)
      with PTE-mapped THP, which is the default with THPs that are smaller than
      a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
      
      On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by
      PTE-mapped folios of the same size (stddev < 1%) results in the following
      runtimes for munmap() in seconds (shorter is better):
      
      Folio Size | mm-unstable |      New | Change
      ---------------------------------------------
            4KiB |    0.058110 | 0.057715 |   - 1%
           16KiB |    0.044198 | 0.035469 |   -20%
           32KiB |    0.034216 | 0.023522 |   -31%
           64KiB |    0.029207 | 0.018434 |   -37%
          128KiB |    0.026579 | 0.014026 |   -47%
          256KiB |    0.025130 | 0.011756 |   -53%
          512KiB |    0.024292 | 0.010703 |   -56%
         1024KiB |    0.023812 | 0.010294 |   -57%
         2048KiB |    0.023785 | 0.009910 |   -58%
      
      [1] https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com
      [3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com
      [4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com
      [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com
      
      
      This patch (of 10):
      
      Let's prepare for further changes by factoring out processing of present
      PTEs.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20240214204435.167852-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      789753e1
    • Nhat Pham's avatar
      selftests: add zswapin and no zswap tests · b93c28ff
      Nhat Pham authored
      Add a selftest to cover the zswapin code path, allocating more memory than
      the cgroup limit to trigger swapout/zswapout, then reading the pages back
      in memory several times.  This is inspired by a recently encountered
      kernel crash on the zswapin path in our internal kernel, which went
      undetected because of a lack of test coverage for this path.
      
      Add a selftest to verify that when memory.zswap.max = 0, no pages can go
      to the zswap pool for the cgroup.
      
      [nphamcs@gmail.com: remove redundant comment, add success checks]
        Link: https://lkml.kernel.org/r/20240222043132.616320-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20240205225608.3083251-4-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarRik van Riel <riel@surriel.com>
      Suggested-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b93c28ff
    • Nhat Pham's avatar
      selftests: fix the zswap invasive shrink test · 012688f6
      Nhat Pham authored
      The zswap no invasive shrink selftest breaks because we rename the zswap
      writeback counter (see [1]).  Fix the test.
      
      [1]: https://patchwork.kernel.org/project/linux-kselftest/patch/20231205193307.2432803-1-nphamcs@gmail.com/
      
      Link: https://lkml.kernel.org/r/20240205225608.3083251-3-nphamcs@gmail.com
      Fixes: a697dc2b ("selftests: cgroup: update per-memcg zswap writeback selftest")
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      012688f6
    • Nhat Pham's avatar
      selftests: zswap: add zswap selftest file to zswap maintainer entry · 2b2178c4
      Nhat Pham authored
      Patch series "fix and extend zswap kselftests", v3.
      
      Fix a broken zswap kselftest due to cgroup zswap writeback counter
      renaming, and add 2 zswap kselftests, one to cover the (z)swapin case, and
      another to check that no zswapping happens when the cgroup limit is 0.
      
      Also, add the zswap kselftest file to zswap maintainer entry so that
      get_maintainers script can find zswap maintainers.
      
      
      This patch (of 3):
      
      Make it easier for contributors to find the zswap maintainers when they
      update the zswap tests.
      
      Link: https://lkml.kernel.org/r/20240205225608.3083251-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20240205225608.3083251-2-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2b2178c4
    • Baolin Wang's avatar
      mm: compaction: limit the suitable target page order to be less than cc->order · 1883e8ac
      Baolin Wang authored
      It can not improve the fragmentation if we isolate the target free pages
      exceeding cc->order, especially when the cc->order is less than
      pageblock_order.  For example, suppose the pageblock_order is MAX_ORDER
      (size is 4M) and cc->order is 2M THP size, we should not isolate other 2M
      free pages to be the migration target, which can not improve the
      fragmentation.
      
      Moreover this is also applicable for large folio compaction.
      
      Link: https://lkml.kernel.org/r/afcd9377351c259df7a25a388a4a0d5862b986f4.1705928395.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1883e8ac
    • Barry Song's avatar
      zram: do not allocate physically contiguous strm buffers · 45866e0e
      Barry Song authored
      Currently zram allocates 2 physically contiguous pages per-CPU's
      compression stream (we may have up to 4 streams per-CPU).  Since those
      buffers are per-CPU we allocate them from CPU hotplug path, which may have
      higher risks of failed allocations on devices with fragmented memory.
      
      Switch to virtually contiguous allocations - crypto comp does not seem
      impose requirements on compression working buffers to be physically
      contiguous.
      
      Link: https://lkml.kernel.org/r/20240213065400.6561-1-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      45866e0e
    • Anshuman Khandual's avatar
      mm/hugetlb: move page order check inside hugetlb_cma_reserve() · ce70cfb1
      Anshuman Khandual authored
      All platforms could benefit from page order check against MAX_PAGE_ORDER
      before allocating a CMA area for gigantic hugetlb pages.  Let's move this
      check from individual platforms to generic hugetlb.
      
      Link: https://lkml.kernel.org/r/20240209054221.1403364-1-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ce70cfb1
    • Kinsey Ho's avatar
      mm/mglru: improve swappiness handling · 4acef569
      Kinsey Ho authored
      The reclaimable number of anon pages used to set initial reclaim priority
      is only based on get_swappiness().  Use can_reclaim_anon_pages() to
      include NUMA node demotion.
      
      Also move the swappiness handling of when !__GFP_IO in
      try_to_shrink_lruvec() into isolate_folios().
      
      Link: https://lkml.kernel.org/r/20240214060538.3524462-6-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Donet Tom <donettom@linux.vnet.ibm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4acef569
    • Kinsey Ho's avatar
      mm/mglru: improve struct lru_gen_mm_walk · cc25bbe1
      Kinsey Ho authored
      Rename max_seq to seq in struct lru_gen_mm_walk to keep consistent with
      struct lru_gen_mm_state.  Note that seq is not always up to date with
      max_seq from lru_gen_folio.
      
      No functional changes.
      
      Link: https://lkml.kernel.org/r/20240214060538.3524462-5-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Donet Tom <donettom@linux.vnet.ibm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cc25bbe1
    • Kinsey Ho's avatar
      mm/mglru: improve reset_mm_stats() · 2d823764
      Kinsey Ho authored
      struct lruvec* is already a field of struct lru_gen_mm_walk.  Remove the
      parameter struct lruvec* into functions that already have access to struct
      lru_gen_mm_walk*.
      
      Also, we do not need to handle reset histogram stats when
      !should_walk_mmu().  Remove the call to reset_mm_stats() in
      iterate_mm_list_nowalk().
      
      Link: https://lkml.kernel.org/r/20240214060538.3524462-4-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Donet Tom <donettom@linux.vnet.ibm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2d823764
    • Kinsey Ho's avatar
      mm/mglru: improve should_run_aging() · 51973cc9
      Kinsey Ho authored
      scan_control *sc does not need to be passed into should_run_aging(), as it
      provides only the reclaim priority.  This can be moved to
      get_nr_to_scan().
      
      Refactor should_run_aging() and get_nr_to_scan() to improve code
      readability.  No functional changes.
      
      Link: https://lkml.kernel.org/r/20240214060538.3524462-3-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Donet Tom <donettom@linux.vnet.ibm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      51973cc9
    • Kinsey Ho's avatar
      mm/mglru: drop unused parameter · 1ce2292c
      Kinsey Ho authored
      Patch series "mm/mglru: code cleanup and refactoring"
      
      This provides MGLRU code cleanup and refactoring for better readability.
      
      
      This patch (of 5):
      
      struct scan_control *sc is currently passed into try_to_inc_max_seq() and
      run_aging().  This parameter is not used.
      
      Drop the unused parameter struct scan_control *sc. No functional change.
      
      Link: https://lkml.kernel.org/r/20240214060538.3524462-1-kinseyho@google.com
      Link: https://lkml.kernel.org/r/20240214060538.3524462-2-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Donet Tom <donettom@linux.vnet.ibm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1ce2292c
    • Arnd Bergmann's avatar
      kasan/test: avoid gcc warning for intentional overflow · e10aea10
      Arnd Bergmann authored
      The out-of-bounds test allocates an object that is three bytes too short
      in order to validate the bounds checking.  Starting with gcc-14, this
      causes a compile-time warning as gcc has grown smart enough to understand
      the sizeof() logic:
      
      mm/kasan/kasan_test.c: In function 'kmalloc_oob_16':
      mm/kasan/kasan_test.c:443:14: error: allocation of insufficient size '13' for type 'struct <anonymous>' with size '16' [-Werror=alloc-size]
        443 |         ptr1 = kmalloc(sizeof(*ptr1) - 3, GFP_KERNEL);
            |              ^
      
      Hide the actual computation behind a RELOC_HIDE() that ensures
      the compiler misses the intentional bug.
      
      Link: https://lkml.kernel.org/r/20240212111609.869266-1-arnd@kernel.org
      Fixes: 3f15801c ("lib: add kasan test module")
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e10aea10
    • Vlastimil Babka's avatar
      mm: document memalloc_noreclaim_save() and memalloc_pin_save() · cfb837e8
      Vlastimil Babka authored
      The memalloc_noreclaim_save() function currently has no documentation
      comment, so the implications of its usage are not obvious.  Namely that it
      not only prevents entering reclaim (as the name suggests), but also allows
      using all memory reserves and thus should be only used in contexts that
      are allocating memory to free memory.  This may lead to new improper
      usages being added.
      
      Thus add a documenting comment, based on the description of
      __GFP_MEMALLOC.  While at it, also document memalloc_pin_save() so that
      all the memalloc_ scopes are documented.  For those already documented,
      add missing Return: descriptions, and mark Context: description per
      kernel-docs style guide.
      
      In the comments describing the relevant PF_MEMALLOC flags, refer to their
      scope setting functions.
      
      [vbabka@suse.cz: fix issues that Mike pointed out]
        Link: https://lkml.kernel.org/r/20240215095827.13756-2-vbabka@suse.cz
      Link: https://lkml.kernel.org/r/20240212182950.32730-2-vbabka@suse.czSigned-off-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Acked-by: default avatarMichal Hocko <mhocko@suse.com>
      Cc: Kent Overstreet <kent.overstreet@linux.dev>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cfb837e8
    • Chengming Zhou's avatar
      mm/zswap: optimize and cleanup the invalidation of duplicate entry · f576a1e8
      Chengming Zhou authored
      We may encounter duplicate entry in the zswap_store():
      
      1. swap slot that freed to per-cpu swap cache, doesn't invalidate
         the zswap entry, then got reused. This has been fixed.
      
      2. !exclusive load mode, swapin folio will leave its zswap entry
         on the tree, then swapout again. This has been removed.
      
      3. one folio can be dirtied again after zswap_store(), so need to
         zswap_store() again. This should be handled correctly.
      
      So we must invalidate the old duplicate entry before inserting the
      new one, which actually doesn't have to be done at the beginning
      of zswap_store().
      
      The good point is that we don't need to lock the tree twice in the normal
      store success path.  And cleanup the loop as we are here.
      
      Note we still need to invalidate the old duplicate entry when store failed
      or zswap is disabled , otherwise the new data in swapfile could be
      overwrite by the old data in zswap pool when lru writeback.
      
      Link: https://lkml.kernel.org/r/20240209044112.3883835-1-chengming.zhou@linux.devSigned-off-by: default avatarChengming Zhou <zhouchengming@bytedance.com>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarChris Li <chrisl@kernel.org>
      Acked-by: default avatarNhat Pham <nphamcs@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f576a1e8
    • Mark Brown's avatar
      selftests/mm: log a consistent test name for check_compaction · f3b7568c
      Mark Brown authored
      Every test result report in the compaction test prints a distinct log
      messae, and some of the reports print a name that varies at runtime.  This
      causes problems for automation since a lot of automation software uses the
      printed string as the name of the test, if the name varies from run to run
      and from pass to fail then the automation software can't identify that a
      test changed result or that the same tests are being run.
      
      Refactor the logging to use a consistent name when printing the result of
      the test, printing the existing messages as diagnostic information instead
      so they are still available for people trying to interpret the results.
      
      Link: https://lkml.kernel.org/r/20240209-kselftest-mm-cleanup-v1-2-a3c0386496b5@kernel.orgSigned-off-by: default avatarMark Brown <broonie@kernel.org>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f3b7568c
    • Mark Brown's avatar
      selftests/mm: log skipped compaction test as a skip · 9c1490d9
      Mark Brown authored
      Patch series "selftests/mm: Output cleanups for the compaction test".
      
      A couple of small updates for the check_compaction selftest which make
      it play more nicely with test automation systems.
      
      
      This patch (of 2):
      
      When the compaction test is run it checks to make sure that prerequistives
      the test requires are available and skips the tests if not.  When this
      happens we log the test as a pass rather than a skip, log as a skip so
      that the distinction is clear and automation can see unexpected skips.
      
      Link: https://lkml.kernel.org/r/20240209-kselftest-mm-cleanup-v1-0-a3c0386496b5@kernel.org
      Link: https://lkml.kernel.org/r/20240209-kselftest-mm-cleanup-v1-1-a3c0386496b5@kernel.orgSigned-off-by: default avatarMark Brown <broonie@kernel.org>
      Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c1490d9
    • Kefeng Wang's avatar
      mm: compaction: refactor compact_node() · 3e40b3f4
      Kefeng Wang authored
      Refactor compact_node() to handle both proactive and synchronous compact
      memory, which cleanups code a bit.
      
      Link: https://lkml.kernel.org/r/20240208013607.1731817-1-wangkefeng.wang@huawei.comSigned-off-by: default avatarKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3e40b3f4
    • Anshuman Khandual's avatar
      mm/cma: add sysfs file 'release_pages_success' · b9ad003a
      Anshuman Khandual authored
      This adds the following new sysfs file tracking the number of successfully
      released pages from a given CMA heap area.  This file will be available
      via CONFIG_CMA_SYSFS and help in determining active CMA pages available on
      the CMA heap area.  This adds a new 'nr_pages_released' (CONFIG_CMA_SYSFS)
      into 'struct cma' which gets updated during cma_release().
      
      /sys/kernel/mm/cma/<cma-heap-area>/release_pages_success
      
      After this change, an user will be able to find active CMA pages available
      in a given CMA heap area via the following method.
      
      Active pages = alloc_pages_success - release_pages_success
      
      That's valuable information for both software designers, and system admins
      as it allows them to tune the number of CMA pages available in the system.
      This increases user visibility for allocated CMA area and its
      utilization.
      
      Link: https://lkml.kernel.org/r/20240206045731.472759-1-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b9ad003a
    • SeongJae Park's avatar
      selftests/damon/_chk_dependency: get debugfs mount point from /proc/mounts · 501e3dc5
      SeongJae Park authored
      DAMON debugfs selftests dependency checker assumes debugfs would be
      mounted at /sys/kernel/debug.  That would be ok for many cases, but some
      systems might mounted the file system on some different places.  Parse the
      real mount point using /proc/mounts file.
      
      Link: https://lkml.kernel.org/r/20240207203134.69976-9-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      501e3dc5
    • SeongJae Park's avatar
      selftests/damon: add a test for the pid leak of dbgfs_target_ids_write() · f08db42b
      SeongJae Park authored
      Commit ebb3f994 ("mm/damon/dbgfs: fix 'struct pid' leaks in
      'dbgfs_target_ids_write()'") fixes a pid leak bug in DAMON debugfs
      interface, namely dbgfs_target_ids_write() function.  Add a selftest for
      the issue to prevent the problem from mistakenly recurring.
      
      Link: https://lkml.kernel.org/r/20240207203134.69976-8-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f08db42b
    • SeongJae Park's avatar
      selftests/damon: add a test for a race between target_ids_read() and dbgfs_before_terminate() · e6255a29
      SeongJae Park authored
      commit 34796417 ("mm/damon/dbgfs: protect targets destructions with
      kdamond_lock") fixed a race of DAMON debugfs interface.  Specifically, the
      race was happening between target_ids_read() and dbgfs_before_terminate().
      Add a test for the issue to prevent the problem from accidentally
      recurring.
      
      Link: https://lkml.kernel.org/r/20240207203134.69976-7-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e6255a29
    • SeongJae Park's avatar
      selftests/damon: add a test for DAMOS apply intervals · ce7a2834
      SeongJae Park authored
      Add a selftest for DAMOS apply intervals.  It runs two schemes having
      different apply interval agains an artificial memory access workload, and
      check if the scheme with smaller apply interval was applied more
      frequently.
      
      Link: https://lkml.kernel.org/r/20240207203134.69976-6-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ce7a2834
    • SeongJae Park's avatar
      selftests/damon: add a test for DAMOS quota · 51f58c9d
      SeongJae Park authored
      Add a selftest for verifying the DAMOS quota feature.  The test is very
      similar to sysfs_update_schemes_tried_regions_wss_estimation.py.  It
      starts an artificial workload of 20 MiB working set, run DAMON to find the
      working set size, but with 1 MiB/100 ms size quota.  Then, it collect the
      DAMON-found working set size every 100 ms and check if the quota was
      always applied as expected.  For the confirmation, the tests shows the
      stat-applied region size and the qt_exceeds stat.
      
      Link: https://lkml.kernel.org/r/20240207203134.69976-5-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      51f58c9d
    • SeongJae Park's avatar
      selftests/damon/_damon_sysfs: support DAMOS apply interval · a8622625
      SeongJae Park authored
      Update the test-purpose DAMON sysfs control Python module to support DAMOS
      apply interval.
      
      Link: https://lkml.kernel.org/r/20240207203134.69976-4-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a8622625
    • SeongJae Park's avatar
      selftests/damon/_damon_sysfs: support DAMOS stats · a0f87454
      SeongJae Park authored
      Update the test-purpose DAMON sysfs control Python module to support DAMOS
      stats.
      
      Link: https://lkml.kernel.org/r/20240207203134.69976-3-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a0f87454
    • SeongJae Park's avatar
      selftests/damon/_damon_sysfs: support DAMOS quota · faf4977e
      SeongJae Park authored
      Patch series "selftests/damon: add more tests for core functionalities and
      corner cases".
      
      Continue DAMON selftests' test coverage improvement works with a trivial
      improvement of the test code itself.  The sequence of the patches in
      patchset is as follows.
      
      The first five patches add two DAMON core functionalities tests.  Those
      begins with three patches (patches 1-3) that update the test-purpose DAMON
      sysfs interface wrapper to support DAMOS quota, stats, and apply interval
      features, respectively.  The fourth patch implements and adds a selftest
      for DAMOS quota feature, using the DAMON sysfs interface wrapper's newly
      added support of the quota and the stats feature.  The fifth patch further
      implements and adds a selftest for DAMOS apply interval using the DAMON
      sysfs interface wrapper's newly added support of the apply interval and
      the stats feature.
      
      Two patches (patches 6 and 7) for implementing and adding two corner cases
      handling selftests follow.  Those try to avoid two previously fixed bugs
      from recurring.
      
      Finally, a patch for making DAMON debugfs selftests dependency checker to
      use /proc/mounts instead of the hard-coded mount point assumption follows.
      
      
      This patch (of 8):
      
      Update the test-purpose DAMON sysfs control Python module to support DAMOS
      quota.
      
      Link: https://lkml.kernel.org/r/20240207203134.69976-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240207203134.69976-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      faf4977e
    • John Groves's avatar
      memremap.h: correct an error in a comment · 0c32c9f7
      John Groves authored
      It tried to send me off to memory_hotplug.h for an enum that is a few
      lines above...
      
      Link: https://lkml.kernel.org/r/dba0f5f01162d6fa16e4da2a9fede7f97080e92d.1707179960.git.john@groves.netSigned-off-by: default avatarJohn Groves <john@groves.net>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0c32c9f7
    • Mark-PK Tsai's avatar
      zram: use copy_page for full page copy · 80ba4caf
      Mark-PK Tsai authored
      Some architectures, such as arm, have implemented optimized copy_page for
      full page copying.
      
      Replace the full page memcpy with copy_page to take advantage of the
      optimization.
      
      Link: https://lkml.kernel.org/r/20231007070554.8657-1-mark-pk.tsai@mediatek.comSigned-off-by: default avatarMark-PK Tsai <mark-pk.tsai@mediatek.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: YJ Chiang <yj.chiang@mediatek.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      80ba4caf
    • Li Zhijian's avatar
      mm/demotion: print demotion targets · 601e793a
      Li Zhijian authored
      Currently, when a demotion occurs, it will prioritize selecting a node
      from the preferred targets as the destination node for the demotion.  If
      the preferred node does not meet the requirements, it will try from all
      the lower memory tier nodes until it finds a suitable demotion destination
      node or ultimately fails.
      
      However, the demotion target information isn't exposed to the users,
      especially the preferred target information, which relies on more factors.
      This makes it hard for users to understand the exact demotion behavior.
      
      Rather than having a new sysfs interface to expose this information,
      printing directly to kernel messages, just like the current page
      allocation fallback order does.
      
      A dmesg example with this patch is as follows:
      [    0.704860] Demotion targets for Node 0: null
      [    0.705456] Demotion targets for Node 1: null
      // node 2 is onlined
      [   32.259775] Demotion targets for Node 0: perferred: 2, fallback: 2
      [   32.261290] Demotion targets for Node 1: perferred: 2, fallback: 2
      [   32.262726] Demotion targets for Node 2: null
      // node 3 is onlined
      [   42.448809] Demotion targets for Node 0: perferred: 2, fallback: 2-3
      [   42.450704] Demotion targets for Node 1: perferred: 2, fallback: 2-3
      [   42.452556] Demotion targets for Node 2: perferred: 3, fallback: 3
      [   42.454136] Demotion targets for Node 3: null
      // node 4 is onlined
      [   52.676833] Demotion targets for Node 0: perferred: 2, fallback: 2-4
      [   52.678735] Demotion targets for Node 1: perferred: 2, fallback: 2-4
      [   52.680493] Demotion targets for Node 2: perferred: 4, fallback: 3-4
      [   52.682154] Demotion targets for Node 3: null
      [   52.683405] Demotion targets for Node 4: null
      // node 5 is onlined
      [   62.931902] Demotion targets for Node 0: perferred: 2, fallback: 2-5
      [   62.938266] Demotion targets for Node 1: perferred: 5, fallback: 2-5
      [   62.943515] Demotion targets for Node 2: perferred: 4, fallback: 3-4
      [   62.947471] Demotion targets for Node 3: null
      [   62.949908] Demotion targets for Node 4: null
      [   62.952137] Demotion targets for Node 5: perferred: 3, fallback: 3-4
      
      Regarding this requirement, we have previously discussed [1].  The initial
      proposal involved introducing a new sysfs interface.  However, due to
      concerns about potential changes and compatibility issues with the
      interface in the future, a consensus was not reached with the community. 
      Therefore, this time, we are directly printing out the information.
      
      [1] https://lore.kernel.org/all/d1d5add8-8f4a-4578-8bf0-2cbe79b09989@fujitsu.com/
      
      Link: https://lkml.kernel.org/r/20240206020151.605516-1-lizhijian@fujitsu.comSigned-off-by: default avatarLi Zhijian <lizhijian@fujitsu.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      601e793a