1. 22 Feb, 2024 40 commits
    • Mathieu Desnoyers's avatar
      dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n · 6d439c18
      Mathieu Desnoyers authored
      Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
      CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y
      never returns NULL.
      
      This is done in preparation for using cpu_dcache_is_aliasing() in a
      following change which will properly support architectures which detect
      data cache aliasing at runtime.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-3-mathieu.desnoyers@efficios.com
      Fixes: 4e4ced93 ("dax: Move mandatory ->zero_page_range() check in alloc_dax()")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d439c18
    • Mathieu Desnoyers's avatar
      dax: add empty static inline for CONFIG_DAX=n · 2807c54b
      Mathieu Desnoyers authored
      Patch series "Introduce cpu_dcache_is_aliasing() to fix DAX regression",
      v6.
      
      This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
      even on ARMv7 which does not have virtually aliased data caches:
      
      commit d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      
      Even though it used to work fine before.
      
      The root of the issue here is the fact that DAX was never designed to
      handle virtually aliasing data caches (VIVT and VIPT with aliasing data
      cache). It touches the pages through their linear mapping, which is not
      consistent with the userspace mappings with virtually aliasing data
      caches.
      
      This patch series introduces cpu_dcache_is_aliasing() with the new
      Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
      architectures. The implementation of cpu_dcache_is_aliasing() is either
      evaluated to a constant at compile-time or a runtime check, which is
      what is needed on ARM.
      
      With this we can basically narrow down the list of architectures which
      are unsupported by DAX to those which are really affected.
      
      
      This patch (of 9):
      
      When building a kernel with CONFIG_DAX=n, all uses of set_dax_nocache()
      and set_dax_nomc() need to be either within regions of code or compile
      units which are explicitly not compiled, or they need to rely on compiler
      optimizations to eliminate calls to those undefined symbols.
      
      It appears that at least the openrisc and loongarch architectures don't
      end up eliminating those undefined symbols even if they are provably
      within code which is eliminated due to conditional branches depending on
      constants.
      
      Implement empty static inline functions for set_dax_nocache() and
      set_dax_nomc() in CONFIG_DAX=n to ensure those undefined references are
      removed.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-1-mathieu.desnoyers@efficios.com
      Link: https://lkml.kernel.org/r/20240215144633.96437-2-mathieu.desnoyers@efficios.comReported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202402140037.wGfA1kqX-lkp@intel.com/Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202402131351.a0FZOgEG-lkp@intel.com/
      Fixes: 7ac5360c ("dax: remove the copy_from_iter and copy_to_iter methods")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2807c54b
    • Mathieu Desnoyers's avatar
      nvdimm/pmem: fix leak on dax_add_host() failure · f6932a27
      Mathieu Desnoyers authored
      Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
      before setting pmem->dax_dev, which therefore issues the two following
      calls on NULL pointers:
      
      out_cleanup_dax:
              kill_dax(pmem->dax_dev);
              put_dax(pmem->dax_dev);
      
      Link: https://lkml.kernel.org/r/20240208184913.484340-1-mathieu.desnoyers@efficios.com
      Link: https://lkml.kernel.org/r/20240208184913.484340-2-mathieu.desnoyers@efficios.comSigned-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarFan Ni <fan.ni@samsung.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f6932a27
    • Ryan Roberts's avatar
      arm64/mm: automatically fold contpte mappings · f0c22649
      Ryan Roberts authored
      There are situations where a change to a single PTE could cause the
      contpte block in which it resides to become foldable (i.e.  could be
      repainted with the contiguous bit).  Such situations arise, for example,
      when user space temporarily changes protections, via mprotect, for
      individual pages, such can be the case for certain garbage collectors.
      
      We would like to detect when such a PTE change occurs.  However this can
      be expensive due to the amount of checking required.  Therefore only
      perform the checks when an indiviual PTE is modified via mprotect
      (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
      when we are setting the final PTE in a contpte-aligned block.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-19-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f0c22649
    • Ryan Roberts's avatar
      arm64/mm: __always_inline to improve fork() perf · b972fc6a
      Ryan Roberts authored
      As set_ptes() and wrprotect_ptes() become a bit more complex, the compiler
      may choose not to inline them.  But this is critical for fork()
      performance.  So mark the functions, along with contpte_try_unfold() which
      is called by them, as __always_inline.  This is worth ~1% on the fork()
      microbenchmark with order-0 folios (the common case).
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-18-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b972fc6a
    • Ryan Roberts's avatar
      arm64/mm: implement pte_batch_hint() · fb5451e5
      Ryan Roberts authored
      When core code iterates over a range of ptes and calls ptep_get() for each
      of them, if the range happens to cover contpte mappings, the number of pte
      reads becomes amplified by a factor of the number of PTEs in a contpte
      block.  This is because for each call to ptep_get(), the implementation
      must read all of the ptes in the contpte block to which it belongs to
      gather the access and dirty bits.
      
      This causes a hotspot for fork(), as well as operations that unmap memory
      such as munmap(), exit and madvise(MADV_DONTNEED).  Fortunately we can fix
      this by implementing pte_batch_hint() which allows their iterators to skip
      getting the contpte tail ptes when gathering the batch of ptes to operate
      on.  This results in the number of PTE reads returning to 1 per pte.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-17-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fb5451e5
    • Ryan Roberts's avatar
      mm: add pte_batch_hint() to reduce scanning in folio_pte_batch() · c6ec76a2
      Ryan Roberts authored
      Some architectures (e.g.  arm64) can tell from looking at a pte, if some
      follow-on ptes also map contiguous physical memory with the same pgprot. 
      (for arm64, these are contpte mappings).
      
      Take advantage of this knowledge to optimize folio_pte_batch() so that it
      can skip these ptes when scanning to create a batch.  By default, if an
      arch does not opt-in, folio_pte_batch() returns a compile-time 1, so the
      changes are optimized out and the behaviour is as before.
      
      arm64 will opt-in to providing this hint in the next patch, which will
      greatly reduce the cost of ptep_get() when scanning a range of contptes.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-16-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6ec76a2
    • Ryan Roberts's avatar
      arm64/mm: implement new [get_and_]clear_full_ptes() batch APIs · 6b1e4efb
      Ryan Roberts authored
      Optimize the contpte implementation to fix some of the
      exit/munmap/dontneed performance regression introduced by the initial
      contpte commit.  Subsequent patches will solve it entirely.
      
      During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
      cleared.  Previously this was done 1 PTE at a time.  But the core-mm
      supports batched clear via the new [get_and_]clear_full_ptes() APIs.  So
      let's implement those APIs and for fully covered contpte mappings, we no
      longer need to unfold the contpte.  This significantly reduces unfolding
      operations, reducing the number of tlbis that must be issued.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-15-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6b1e4efb
    • Ryan Roberts's avatar
      arm64/mm: implement new wrprotect_ptes() batch API · 311a6cf2
      Ryan Roberts authored
      Optimize the contpte implementation to fix some of the fork performance
      regression introduced by the initial contpte commit.  Subsequent patches
      will solve it entirely.
      
      During fork(), any private memory in the parent must be write-protected. 
      Previously this was done 1 PTE at a time.  But the core-mm supports
      batched wrprotect via the new wrprotect_ptes() API.  So let's implement
      that API and for fully covered contpte mappings, we no longer need to
      unfold the contpte.  This has 2 benefits:
      
        - reduced unfolding, reduces the number of tlbis that must be issued.
        - The memory remains contpte-mapped ("folded") in the parent, so it
          continues to benefit from the more efficient use of the TLB after
          the fork.
      
      The optimization to wrprotect a whole contpte block without unfolding is
      possible thanks to the tightening of the Arm ARM in respect to the
      definition and behaviour when 'Misprogramming the Contiguous bit'.  See
      section D21194 at https://developer.arm.com/documentation/102105/ja-07/
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-14-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      311a6cf2
    • Ryan Roberts's avatar
      arm64/mm: wire up PTE_CONT for user mappings · 4602e575
      Ryan Roberts authored
      With the ptep API sufficiently refactored, we can now introduce a new
      "contpte" API layer, which transparently manages the PTE_CONT bit for user
      mappings.
      
      In this initial implementation, only suitable batches of PTEs, set via
      set_ptes(), are mapped with the PTE_CONT bit.  Any subsequent modification
      of individual PTEs will cause an "unfold" operation to repaint the contpte
      block as individual PTEs before performing the requested operation. 
      While, a modification of a single PTE could cause the block of PTEs to
      which it belongs to become eligible for "folding" into a contpte entry,
      "folding" is not performed in this initial implementation due to the costs
      of checking the requirements are met.  Due to this, contpte mappings will
      degrade back to normal pte mappings over time if/when protections are
      changed.  This will be solved in a future patch.
      
      Since a contpte block only has a single access and dirty bit, the semantic
      here changes slightly; when getting a pte (e.g.  ptep_get()) that is part
      of a contpte mapping, the access and dirty information are pulled from the
      block (so all ptes in the block return the same access/dirty info).  When
      changing the access/dirty info on a pte (e.g.  ptep_set_access_flags())
      that is part of a contpte mapping, this change will affect the whole
      contpte block.  This is works fine in practice since we guarantee that
      only a single folio is mapped by a contpte block, and the core-mm tracks
      access/dirty information per folio.
      
      In order for the public functions, which used to be pure inline, to
      continue to be callable by modules, export all the contpte_* symbols that
      are now called by those public inline functions.
      
      The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
      at build time.  It defaults to enabled as long as its dependency,
      TRANSPARENT_HUGEPAGE is also enabled.  The core-mm depends upon
      TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
      enabled, then there is no chance of meeting the physical contiguity
      requirement for contpte mappings.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-13-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4602e575
    • Ryan Roberts's avatar
      arm64/mm: dplit __flush_tlb_range() to elide trailing DSB · d9d8dc2b
      Ryan Roberts authored
      Split __flush_tlb_range() into __flush_tlb_range_nosync() +
      __flush_tlb_range(), in the same way as the existing flush_tlb_page()
      arrangement.  This allows calling __flush_tlb_range_nosync() to elide the
      trailing DSB.  Forthcoming "contpte" code will take advantage of this when
      clearing the young bit from a contiguous range of ptes.
      
      Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs() has
      changed, but now aligns with the ordering of __flush_tlb_page().  It has
      been discussed that __flush_tlb_page() may be wrong though.  Regardless,
      both will be resolved separately if needed.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-12-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d9d8dc2b
    • Ryan Roberts's avatar
      arm64/mm: new ptep layer to manage contig bit · 5a00bfd6
      Ryan Roberts authored
      Create a new layer for the in-table PTE manipulation APIs.  For now, The
      existing API is prefixed with double underscore to become the arch-private
      API and the public API is just a simple wrapper that calls the private
      API.
      
      The public API implementation will subsequently be used to transparently
      manipulate the contiguous bit where appropriate.  But since there are
      already some contig-aware users (e.g.  hugetlb, kernel mapper), we must
      first ensure those users use the private API directly so that the future
      contig-bit manipulations in the public API do not interfere with those
      existing uses.
      
      The following APIs are treated this way:
      
       - ptep_get
       - set_pte
       - set_ptes
       - pte_clear
       - ptep_get_and_clear
       - ptep_test_and_clear_young
       - ptep_clear_flush_young
       - ptep_set_wrprotect
       - ptep_set_access_flags
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-11-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a00bfd6
    • Ryan Roberts's avatar
      arm64/mm: convert ptep_clear() to ptep_get_and_clear() · cbb0294f
      Ryan Roberts authored
      ptep_clear() is a generic wrapper around the arch-implemented
      ptep_get_and_clear().  We are about to convert ptep_get_and_clear() into a
      public version and private version (__ptep_get_and_clear()) to support the
      transparent contpte work.  We won't have a private version of ptep_clear()
      so let's convert it to directly call ptep_get_and_clear().
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-10-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cbb0294f
    • Ryan Roberts's avatar
      arm64/mm: convert set_pte_at() to set_ptes(..., 1) · 659e1930
      Ryan Roberts authored
      Since set_ptes() was introduced, set_pte_at() has been implemented as a
      generic macro around set_ptes(..., 1).  So this change should continue to
      generate the same code.  However, making this change prepares us for the
      transparent contpte support.  It means we can reroute set_ptes() to
      __set_ptes().  Since set_pte_at() is a generic macro, there will be no
      equivalent __set_pte_at() to reroute to.
      
      Note that a couple of calls to set_pte_at() remain in the arch code.  This
      is intentional, since those call sites are acting on behalf of core-mm and
      should continue to call into the public set_ptes() rather than the
      arch-private __set_ptes().
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-9-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      659e1930
    • Ryan Roberts's avatar
      arm64/mm: convert READ_ONCE(*ptep) to ptep_get(ptep) · 53273655
      Ryan Roberts authored
      There are a number of places in the arch code that read a pte by using the
      READ_ONCE() macro.  Refactor these call sites to instead use the
      ptep_get() helper, which itself is a READ_ONCE().  Generated code should
      be the same.
      
      This will benefit us when we shortly introduce the transparent contpte
      support.  In this case, ptep_get() will become more complex so we now have
      all the code abstracted through it.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-8-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53273655
    • Ryan Roberts's avatar
      mm: tidy up pte_next_pfn() definition · fb23bf6b
      Ryan Roberts authored
      Now that the all architecture overrides of pte_next_pfn() have been
      replaced with pte_advance_pfn(), we can simplify the definition of the
      generic pte_next_pfn() macro so that it is unconditionally defined.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-7-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fb23bf6b
    • Ryan Roberts's avatar
      x86/mm: convert pte_next_pfn() to pte_advance_pfn() · 506b5867
      Ryan Roberts authored
      Core-mm needs to be able to advance the pfn by an arbitrary amount, so
      override the new pte_advance_pfn() API to do so.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-6-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      506b5867
    • Ryan Roberts's avatar
      arm64/mm: convert pte_next_pfn() to pte_advance_pfn() · c1bd2b40
      Ryan Roberts authored
      Core-mm needs to be able to advance the pfn by an arbitrary amount, so
      override the new pte_advance_pfn() API to do so.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-5-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c1bd2b40
    • Ryan Roberts's avatar
      mm: introduce pte_advance_pfn() and use for pte_next_pfn() · 583ceaaa
      Ryan Roberts authored
      The goal is to be able to advance a PTE by an arbitrary number of PFNs. 
      So introduce a new API that takes a nr param.  Define the default
      implementation here and allow for architectures to override. 
      pte_next_pfn() becomes a wrapper around pte_advance_pfn().
      
      Follow up commits will convert each overriding architecture's
      pte_next_pfn() to pte_advance_pfn().
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-4-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      583ceaaa
    • Ryan Roberts's avatar
      mm: thp: batch-collapse PMD with set_ptes() · 2bdba986
      Ryan Roberts authored
      Refactor __split_huge_pmd_locked() so that a present PMD can be collapsed
      to PTEs in a single batch using set_ptes().
      
      This should improve performance a little bit, but the real motivation is
      to remove the need for the arm64 backend to have to fold the contpte
      entries.  Instead, since the ptes are set as a batch, the contpte blocks
      can be initially set up pre-folded (once the arm64 contpte support is
      added in the next few patches).  This leads to noticeable performance
      improvement during split.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-3-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2bdba986
    • Ryan Roberts's avatar
      mm: clarify the spec for set_ptes() · 6280d731
      Ryan Roberts authored
      Patch series "Transparent Contiguous PTEs for User Mappings", v6.
      
      This is a series to opportunistically and transparently use contpte
      mappings (set the contiguous bit in ptes) for user memory when those
      mappings meet the requirements.  The change benefits arm64, but there is
      some (very) minor refactoring for x86 to enable its integration with
      core-mm.
      
      It is part of a wider effort to improve performance by allocating and
      mapping variable-sized blocks of memory (folios).  One aim is for the 4K
      kernel to approach the performance of the 16K kernel, but without breaking
      compatibility and without the associated increase in memory.  Another aim
      is to benefit the 16K and 64K kernels by enabling 2M THP, since this is
      the contpte size for those kernels.  We have good performance data that
      demonstrates both aims are being met (see below).
      
      Of course this is only one half of the change.  We require the mapped
      physical memory to be the correct size and alignment for this to actually
      be useful (i.e.  64K for 4K pages, or 2M for 16K/64K pages).  Fortunately
      folios are solving this problem for us.  Filesystems that support it (XFS,
      AFS, EROFS, tmpfs, ...) will allocate large folios up to the PMD size
      today, and more filesystems are coming.  And for anonymous memory,
      "multi-size THP" is now upstream.
      
      
      Patch Layout
      ============
      
      In this version, I've split the patches to better show each optimization:
      
        - 1-2:    mm prep: misc code and docs cleanups
        - 3-6:    mm,arm64,x86 prep: Add pte_advance_pfn() and make pte_next_pfn() a
                  generic wrapper around it
        - 7-11:   arm64 prep: Refactor ptep helpers into new layer
        - 12:     functional contpte implementation
        - 23-18:  various optimizations on top of the contpte implementation
      
      
      Testing
      =======
      
      I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
        - mm selftests (inc new tests written for multi-size THP); no regressions
        - Speedometer Java script benchmark in Chromium web browser; no issues
        - Kernel compilation; no issues
        - Various tests under high memory pressure with swap enabled; no issues
      
      
      Performance
      ===========
      
      High Level Use Cases
      ~~~~~~~~~~~~~~~~~~~~
      
      First some high level use cases (kernel compilation and speedometer JavaScript
      benchmarks). These are running on Ampere Altra (I've seen similar improvements
      on Android/Pixel 6).
      
      baseline:                  mm-unstable (mTHP switched off)
      mTHP:                      + enable 16K, 32K, 64K mTHP sizes "always"
      mTHP + contpte:            + this series
      mTHP + contpte + exefolio: + patch at [6], which series supports
      
      Kernel Compilation with -j8 (negative is faster):
      
      | kernel                    | real-time | kern-time | user-time |
      |---------------------------|-----------|-----------|-----------|
      | baseline                  |      0.0% |      0.0% |      0.0% |
      | mTHP                      |     -5.0% |    -39.1% |     -0.7% |
      | mTHP + contpte            |     -6.0% |    -41.4% |     -1.5% |
      | mTHP + contpte + exefolio |     -7.8% |    -43.1% |     -3.4% |
      
      Kernel Compilation with -j80 (negative is faster):
      
      | kernel                    | real-time | kern-time | user-time |
      |---------------------------|-----------|-----------|-----------|
      | baseline                  |      0.0% |      0.0% |      0.0% |
      | mTHP                      |     -5.0% |    -36.6% |     -0.6% |
      | mTHP + contpte            |     -6.1% |    -38.2% |     -1.6% |
      | mTHP + contpte + exefolio |     -7.4% |    -39.2% |     -3.2% |
      
      Speedometer (positive is faster):
      
      | kernel                    | runs_per_min |
      |:--------------------------|--------------|
      | baseline                  |         0.0% |
      | mTHP                      |         1.5% |
      | mTHP + contpte            |         3.2% |
      | mTHP + contpte + exefolio |         4.5% |
      
      
      Micro Benchmarks
      ~~~~~~~~~~~~~~~~
      
      The following microbenchmarks are intended to demonstrate the performance of
      fork() and munmap() do not regress. I'm showing results for order-0 (4K)
      mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
      benchmarks.
      
      baseline:                  mm-unstable + batch zap [7] series
      contpte-basic:             + patches 0-19; functional contpte implementation
      contpte-batch:             + patches 20-23; implement new batched APIs
      contpte-inline:            + patch 24; __always_inline to help compiler
      contpte-fold:              + patch 25; fold contpte mapping when sensible
      
      Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
      (on top of MacOS) for reference, although experience suggests this might not be
      the most reliable for performance numbers of this sort:
      
      | FORK           |         order-0        |         order-9        |
      | Ampere Altra   |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      2.7% |       0.0% |      0.2% |
      | contpte-basic  |       6.3% |      1.4% |    1948.7% |      0.2% |
      | contpte-batch  |       7.6% |      2.0% |      -1.9% |      0.4% |
      | contpte-inline |       3.6% |      1.5% |      -1.0% |      0.2% |
      | contpte-fold   |       4.6% |      2.1% |      -1.8% |      0.2% |
      
      | MUNMAP         |         order-0        |         order-9        |
      | Ampere Altra   |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      0.5% |       0.0% |      0.3% |
      | contpte-basic  |       1.8% |      0.3% |    1104.8% |      0.1% |
      | contpte-batch  |      -0.3% |      0.4% |       2.7% |      0.1% |
      | contpte-inline |      -0.1% |      0.6% |       0.9% |      0.1% |
      | contpte-fold   |       0.1% |      0.6% |       0.8% |      0.1% |
      
      | FORK           |         order-0        |         order-9        |
      | Apple M2 VM    |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      1.4% |       0.0% |      0.8% |
      | contpte-basic  |       6.8% |      1.2% |     469.4% |      1.4% |
      | contpte-batch  |      -7.7% |      2.0% |      -8.9% |      0.7% |
      | contpte-inline |      -6.0% |      2.1% |      -6.0% |      2.0% |
      | contpte-fold   |       5.9% |      1.4% |      -6.4% |      1.4% |
      
      | MUNMAP         |         order-0        |         order-9        |
      | Apple M2 VM    |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      0.6% |       0.0% |      0.4% |
      | contpte-basic  |       1.6% |      0.6% |     233.6% |      0.7% |
      | contpte-batch  |       1.9% |      0.3% |      -3.9% |      0.4% |
      | contpte-inline |       2.2% |      0.8% |      -1.6% |      0.9% |
      | contpte-fold   |       1.5% |      0.7% |      -1.7% |      0.7% |
      
      Misc
      ~~~~
      
      John Hubbard at Nvidia has indicated dramatic 10x performance improvements
      for some workloads at [8], when using 64K base page kernel.
      
      [1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/
      [2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@arm.com/
      [3] https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@arm.com/
      [4] https://lore.kernel.org/lkml/20231218105100.172635-1-ryan.roberts@arm.com/
      [5] https://lore.kernel.org/linux-mm/633af0a7-0823-424f-b6ef-374d99483f05@arm.com/
      [6] https://lore.kernel.org/lkml/08c16f7d-f3b3-4f22-9acc-da943f647dc3@arm.com/
      [7] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/
      [8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/
      [9] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v6
      
      
      
      
      This patch (of 18):
      
      set_ptes() spec implies that it can only be used to set a present pte
      because it interprets the PFN field to increment it.  However,
      set_pte_at() has been implemented on top of set_ptes() since set_ptes()
      was introduced, and set_pte_at() allows setting a pte to a not-present
      state.  So clarify the spec to state that when nr==1, new state of pte may
      be present or not present.  When nr>1, new state of all ptes must be
      present.
      
      While we are at it, tighten the spec to set requirements around the
      initial state of ptes; when nr==1 it may be either present or not-present.
      But when nr>1 all ptes must initially be not-present.  All set_ptes()
      callsites already conform to this requirement.  Stating it explicitly is
      useful because it allows for a simplification to the upcoming arm64
      contpte implementation.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-1-ryan.roberts@arm.com
      Link: https://lkml.kernel.org/r/20240215103205.2607016-2-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6280d731
    • David Hildenbrand's avatar
      mm/memory: optimize unmap/zap with PTE-mapped THP · 10ebac4f
      David Hildenbrand authored
      Similar to how we optimized fork(), let's implement PTE batching when
      consecutive (present) PTEs map consecutive pages of the same large folio.
      
      Most infrastructure we need for batching (mmu gather, rmap) is already
      there.  We only have to add get_and_clear_full_ptes() and
      clear_full_ptes().  Similarly, extend zap_install_uffd_wp_if_needed() to
      process a PTE range.
      
      We won't bother sanity-checking the mapcount of all subpages, but only
      check the mapcount of the first subpage we process.  If there is a real
      problem hiding somewhere, we can trigger it simply by using small folios,
      or when we zap single pages of a large folio.  Ideally, we had that check
      in rmap code (including for delayed rmap), but then we cannot print the
      PTE.  Let's keep it simple for now.  If we ever have a cheap
      folio_mapcount(), we might just want to check for underflows there.
      
      To keep small folios as fast as possible force inlining of a specialized
      variant using __always_inline with nr=1.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-11-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      10ebac4f
    • David Hildenbrand's avatar
      mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing · e61abd44
      David Hildenbrand authored
      In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or now
      up to 256 folio fragments that span more than one page, before we
      conditionally reschedule.
      
      It's a pain that we have to handle cond_resched() in
      tlb_batch_pages_flush() manually and cannot simply handle it in
      release_pages() -- release_pages() can be called from atomic context. 
      Well, in a perfect world we wouldn't have to make our code more
      complicated at all.
      
      With page poisoning and init_on_free, we might now run into soft lockups
      when we free a lot of rather large folio fragments, because page freeing
      time then depends on the actual memory size we are freeing instead of on
      the number of folios that are involved.
      
      In the absolute (unlikely) worst case, on arm64 with 64k we will be able
      to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
      GiB does sound like it might take a while.  But instead of ignoring this
      unlikely case, let's just handle it.
      
      So, let's teach tlb_batch_pages_flush() that there are some configurations
      where page freeing is horribly slow, and let's reschedule more frequently
      -- similarly like we did for now before we had large folio fragments in
      there.  Avoid yet another loop over all encoded pages in the common case
      by handling that separately.
      
      Note that with page poisoning/zeroing, we might now end up freeing only a
      single folio fragment at a time that might exceed the old 512 pages limit:
      but if we cannot even free a single MAX_ORDER page on a system without
      running into soft lockups, something else is already completely bogus. 
      Freeing a PMD-mapped THP would similarly cause trouble.
      
      In theory, we might even free 511 order-0 pages + a single MAX_ORDER page,
      effectively having to zero out 8703 pages on arm64 with 64k, translating
      to ~544 MiB of memory: however, if 512 MiB doesn't result in soft lockups,
      544 MiB is unlikely to result in soft lockups, so we won't care about that
      for the time being.
      
      In the future, we might want to detect if handling cond_resched() is
      required at all, and just not do any of that with full preemption enabled.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e61abd44
    • David Hildenbrand's avatar
      mm/mmu_gather: add __tlb_remove_folio_pages() · d7f861b9
      David Hildenbrand authored
      Add __tlb_remove_folio_pages(), which will remove multiple consecutive
      pages that belong to the same large folio, instead of only a single page. 
      We'll be using this function when optimizing unmapping/zapping of large
      folios that are mapped by PTEs.
      
      We're using the remaining spare bit in an encoded_page to indicate that
      the next enoced page in an array contains actually shifted "nr_pages". 
      Teach swap/freeing code about putting multiple folio references, and
      delayed rmap handling to remove page ranges of a folio.
      
      This extension allows for still gathering almost as many small folios as
      we used to (-1, because we have to prepare for a possibly bigger next
      entry), but still allows for gathering consecutive pages that belong to
      the same large folio.
      
      Note that we don't pass the folio pointer, because it is not required for
      now.  Further, we don't support page_size != PAGE_SIZE, it won't be
      required for simple PTE batching.
      
      We have to provide a separate s390 implementation, but it's fairly
      straight forward.
      
      Another, more invasive and likely more expensive, approach would be to use
      folio+range or a PFN range instead of page+nr_pages.  But, we should do
      that consistently for the whole mmu_gather.  For now, let's keep it simple
      and add "nr_pages" only.
      
      Note that it is now possible to gather significantly more pages: In the
      past, we were able to gather ~10000 pages, now we can also gather ~5000
      folio fragments that span multiple pages.  A folio fragment on x86-64 can
      span up to 512 pages (2 MiB THP) and on arm64 with 64k in theory 8192
      pages (512 MiB THP).  Gathering more memory is not considered something we
      should worry about, especially because these are already corner cases.
      
      While we can gather more total memory, we won't free more folio fragments.
      As long as page freeing time primarily only depends on the number of
      involved folios, there is no effective change for !preempt configurations.
      However, we'll adjust tlb_batch_pages_flush() separately to handle corner
      cases where page freeing time grows proportionally with the actual memory
      size.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7f861b9
    • David Hildenbrand's avatar
      mm/mmu_gather: add tlb_remove_tlb_entries() · 4d5bf0b6
      David Hildenbrand authored
      Let's add a helper that lets us batch-process multiple consecutive PTEs.
      
      Note that the loop will get optimized out on all architectures except on
      powerpc.  We have to add an early define of __tlb_remove_tlb_entry() on
      ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries()
      a macro).
      
      [arnd@kernel.org: change __tlb_remove_tlb_entry() to an inline function]
        Link: https://lkml.kernel.org/r/20240221154549.2026073-1-arnd@kernel.org
      Link: https://lkml.kernel.org/r/20240214204435.167852-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d5bf0b6
    • David Hildenbrand's avatar
      mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP · da510964
      David Hildenbrand authored
      Nowadays, encoded pages are only used in mmu_gather handling.  Let's
      update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP.  While
      at it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.
      
      If encoded page pointers would ever be used in other context again, we'd
      likely want to change the defines to reflect their context (e.g.,
      ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP).  For now, let's keep it simple.
      
      This is a preparation for using the remaining spare bit to indicate that
      the next item in an array of encoded pages is a "nr_pages" argument and
      not an encoded page.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      da510964
    • David Hildenbrand's avatar
      mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size() · c30d6bc8
      David Hildenbrand authored
      We have two bits available in the encoded page pointer to store additional
      information.  Currently, we use one bit to request delay of the rmap
      removal until after a TLB flush.
      
      We want to make use of the remaining bit internally for batching of
      multiple pages of the same folio, specifying that the next encoded page
      pointer in an array is actually "nr_pages".  So pass page + delay_rmap
      flag instead of an encoded page, to handle the encoding internally.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c30d6bc8
    • David Hildenbrand's avatar
      mm/memory: factor out zapping folio pte into zap_present_folio_pte() · 2b42a7e5
      David Hildenbrand authored
      Let's prepare for further changes by factoring it out into a separate
      function.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2b42a7e5
    • David Hildenbrand's avatar
      mm/memory: further separate anon and pagecache folio handling in zap_present_pte() · d11838ed
      David Hildenbrand authored
      We don't need up-to-date accessed-dirty information for anon folios and
      can simply work with the ptent we already have.  Also, we know the RSS
      counter we want to update.
      
      We can safely move arch_check_zapped_pte() + tlb_remove_tlb_entry() +
      zap_install_uffd_wp_if_needed() after updating the folio and RSS.
      
      While at it, only call zap_install_uffd_wp_if_needed() if there is even
      any chance that pte_install_uffd_wp_if_needed() would do *something*. 
      That is, just don't bother if uffd-wp does not apply.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d11838ed
    • David Hildenbrand's avatar
      mm/memory: handle !page case in zap_present_pte() separately · 0cf18e83
      David Hildenbrand authored
      We don't need uptodate accessed/dirty bits, so in theory we could replace
      ptep_get_and_clear_full() by an optimized ptep_clear_full() function. 
      Let's rely on the provided pte.
      
      Further, there is no scenario where we would have to insert uffd-wp
      markers when zapping something that is not a normal page (i.e., zeropage).
      Add a sanity check to make sure this remains true.
      
      should_zap_folio() no longer has to handle NULL pointers.  This change
      replaces 2/3 "!page/!folio" checks by a single "!page" one.
      
      Note that arch_check_zapped_pte() on x86-64 checks the HW-dirty bit to
      detect shadow stack entries.  But for shadow stack entries, the HW dirty
      bit (in combination with non-writable PTEs) is set by software.  So for
      the arch_check_zapped_pte() check, we don't have to sync against HW
      setting the HW dirty bit concurrently, it is always set.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0cf18e83
    • David Hildenbrand's avatar
      mm/memory: factor out zapping of present pte into zap_present_pte() · 789753e1
      David Hildenbrand authored
      Patch series "mm/memory: optimize unmap/zap with PTE-mapped THP", v3.
      
      This series is based on [1].  Similar to what we did with fork(), let's
      implement PTE batching during unmap/zap when processing PTE-mapped THPs.
      
      We collect consecutive PTEs that map consecutive pages of the same large
      folio, making sure that the other PTE bits are compatible, and (a) adjust
      the refcount only once per batch, (b) call rmap handling functions only
      once per batch, (c) perform batch PTE setting/updates and (d) perform TLB
      entry removal once per batch.
      
      Ryan was previously working on this in the context of cont-pte for arm64,
      int latest iteration [2] with a focus on arm6 with cont-pte only.  This
      series implements the optimization for all architectures, independent of
      such PTE bits, teaches MMU gather/TLB code to be fully aware of such
      large-folio-pages batches as well, and amkes use of our new rmap batching
      function when removing the rmap.
      
      To achieve that, we have to enlighten MMU gather / page freeing code
      (i.e., everything that consumes encoded_page) to process unmapping of
      consecutive pages that all belong to the same large folio.  I'm being very
      careful to not degrade order-0 performance, and it looks like I managed to
      achieve that.
      
      While this series should -- similar to [1] -- be beneficial for adding
      cont-pte support on arm64[2], it's one of the requirements for maintaining
      a total mapcount[3] for large folios with minimal added overhead and
      further changes[4] that build up on top of the total mapcount.
      
      Independent of all that, this series results in a speedup during munmap()
      and similar unmapping (process teardown, MADV_DONTNEED on larger ranges)
      with PTE-mapped THP, which is the default with THPs that are smaller than
      a PMD (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
      
      On an Intel Xeon Silver 4210R CPU, munmap'ing a 1GiB VMA backed by
      PTE-mapped folios of the same size (stddev < 1%) results in the following
      runtimes for munmap() in seconds (shorter is better):
      
      Folio Size | mm-unstable |      New | Change
      ---------------------------------------------
            4KiB |    0.058110 | 0.057715 |   - 1%
           16KiB |    0.044198 | 0.035469 |   -20%
           32KiB |    0.034216 | 0.023522 |   -31%
           64KiB |    0.029207 | 0.018434 |   -37%
          128KiB |    0.026579 | 0.014026 |   -47%
          256KiB |    0.025130 | 0.011756 |   -53%
          512KiB |    0.024292 | 0.010703 |   -56%
         1024KiB |    0.023812 | 0.010294 |   -57%
         2048KiB |    0.023785 | 0.009910 |   -58%
      
      [1] https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com
      [3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com
      [4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com
      [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com
      
      
      This patch (of 10):
      
      Let's prepare for further changes by factoring out processing of present
      PTEs.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20240214204435.167852-2-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      789753e1
    • Nhat Pham's avatar
      selftests: add zswapin and no zswap tests · b93c28ff
      Nhat Pham authored
      Add a selftest to cover the zswapin code path, allocating more memory than
      the cgroup limit to trigger swapout/zswapout, then reading the pages back
      in memory several times.  This is inspired by a recently encountered
      kernel crash on the zswapin path in our internal kernel, which went
      undetected because of a lack of test coverage for this path.
      
      Add a selftest to verify that when memory.zswap.max = 0, no pages can go
      to the zswap pool for the cgroup.
      
      [nphamcs@gmail.com: remove redundant comment, add success checks]
        Link: https://lkml.kernel.org/r/20240222043132.616320-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20240205225608.3083251-4-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Suggested-by: default avatarRik van Riel <riel@surriel.com>
      Suggested-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b93c28ff
    • Nhat Pham's avatar
      selftests: fix the zswap invasive shrink test · 012688f6
      Nhat Pham authored
      The zswap no invasive shrink selftest breaks because we rename the zswap
      writeback counter (see [1]).  Fix the test.
      
      [1]: https://patchwork.kernel.org/project/linux-kselftest/patch/20231205193307.2432803-1-nphamcs@gmail.com/
      
      Link: https://lkml.kernel.org/r/20240205225608.3083251-3-nphamcs@gmail.com
      Fixes: a697dc2b ("selftests: cgroup: update per-memcg zswap writeback selftest")
      Signed-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      012688f6
    • Nhat Pham's avatar
      selftests: zswap: add zswap selftest file to zswap maintainer entry · 2b2178c4
      Nhat Pham authored
      Patch series "fix and extend zswap kselftests", v3.
      
      Fix a broken zswap kselftest due to cgroup zswap writeback counter
      renaming, and add 2 zswap kselftests, one to cover the (z)swapin case, and
      another to check that no zswapping happens when the cgroup limit is 0.
      
      Also, add the zswap kselftest file to zswap maintainer entry so that
      get_maintainers script can find zswap maintainers.
      
      
      This patch (of 3):
      
      Make it easier for contributors to find the zswap maintainers when they
      update the zswap tests.
      
      Link: https://lkml.kernel.org/r/20240205225608.3083251-1-nphamcs@gmail.com
      Link: https://lkml.kernel.org/r/20240205225608.3083251-2-nphamcs@gmail.comSigned-off-by: default avatarNhat Pham <nphamcs@gmail.com>
      Acked-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2b2178c4
    • Baolin Wang's avatar
      mm: compaction: limit the suitable target page order to be less than cc->order · 1883e8ac
      Baolin Wang authored
      It can not improve the fragmentation if we isolate the target free pages
      exceeding cc->order, especially when the cc->order is less than
      pageblock_order.  For example, suppose the pageblock_order is MAX_ORDER
      (size is 4M) and cc->order is 2M THP size, we should not isolate other 2M
      free pages to be the migration target, which can not improve the
      fragmentation.
      
      Moreover this is also applicable for large folio compaction.
      
      Link: https://lkml.kernel.org/r/afcd9377351c259df7a25a388a4a0d5862b986f4.1705928395.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1883e8ac
    • Barry Song's avatar
      zram: do not allocate physically contiguous strm buffers · 45866e0e
      Barry Song authored
      Currently zram allocates 2 physically contiguous pages per-CPU's
      compression stream (we may have up to 4 streams per-CPU).  Since those
      buffers are per-CPU we allocate them from CPU hotplug path, which may have
      higher risks of failed allocations on devices with fragmented memory.
      
      Switch to virtually contiguous allocations - crypto comp does not seem
      impose requirements on compression working buffers to be physically
      contiguous.
      
      Link: https://lkml.kernel.org/r/20240213065400.6561-1-21cnbao@gmail.comSigned-off-by: default avatarBarry Song <v-songbaohua@oppo.com>
      Reviewed-by: default avatarSergey Senozhatsky <senozhatsky@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      45866e0e
    • Anshuman Khandual's avatar
      mm/hugetlb: move page order check inside hugetlb_cma_reserve() · ce70cfb1
      Anshuman Khandual authored
      All platforms could benefit from page order check against MAX_PAGE_ORDER
      before allocating a CMA area for gigantic hugetlb pages.  Let's move this
      check from individual platforms to generic hugetlb.
      
      Link: https://lkml.kernel.org/r/20240209054221.1403364-1-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: default avatarJane Chu <jane.chu@oracle.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ce70cfb1
    • Kinsey Ho's avatar
      mm/mglru: improve swappiness handling · 4acef569
      Kinsey Ho authored
      The reclaimable number of anon pages used to set initial reclaim priority
      is only based on get_swappiness().  Use can_reclaim_anon_pages() to
      include NUMA node demotion.
      
      Also move the swappiness handling of when !__GFP_IO in
      try_to_shrink_lruvec() into isolate_folios().
      
      Link: https://lkml.kernel.org/r/20240214060538.3524462-6-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Donet Tom <donettom@linux.vnet.ibm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4acef569
    • Kinsey Ho's avatar
      mm/mglru: improve struct lru_gen_mm_walk · cc25bbe1
      Kinsey Ho authored
      Rename max_seq to seq in struct lru_gen_mm_walk to keep consistent with
      struct lru_gen_mm_state.  Note that seq is not always up to date with
      max_seq from lru_gen_folio.
      
      No functional changes.
      
      Link: https://lkml.kernel.org/r/20240214060538.3524462-5-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Donet Tom <donettom@linux.vnet.ibm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cc25bbe1
    • Kinsey Ho's avatar
      mm/mglru: improve reset_mm_stats() · 2d823764
      Kinsey Ho authored
      struct lruvec* is already a field of struct lru_gen_mm_walk.  Remove the
      parameter struct lruvec* into functions that already have access to struct
      lru_gen_mm_walk*.
      
      Also, we do not need to handle reset histogram stats when
      !should_walk_mmu().  Remove the call to reset_mm_stats() in
      iterate_mm_list_nowalk().
      
      Link: https://lkml.kernel.org/r/20240214060538.3524462-4-kinseyho@google.comSigned-off-by: default avatarKinsey Ho <kinseyho@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Donet Tom <donettom@linux.vnet.ibm.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2d823764