1. 22 Feb, 2024 40 commits
    • SeongJae Park's avatar
      Docs/mm/damon/maintainer-profile: fix reference links for mm-[un]stable tree · 0a1ebc17
      SeongJae Park authored
      Patch series "Docs/mm/damon: misc readability improvements".
      
      Fix trivial mistakes and improve layout of information on different
      documents for DAMON.
      
      
      This patch (of 5):
      
      A couple of sentences on maintainer-profile.rst are having reference links
      for mm-unstable and mm-stable trees with wrong rst markup.  Fix those.
      
      Link: https://lkml.kernel.org/r/20240217005842.87348-1-sj@kernel.org
      Link: https://lkml.kernel.org/r/20240217005842.87348-2-sj@kernel.orgSigned-off-by: default avatarSeongJae Park <sj@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      0a1ebc17
    • Lokesh Gidra's avatar
      userfaultfd: use per-vma locks in userfaultfd operations · 867a43a3
      Lokesh Gidra authored
      All userfaultfd operations, except write-protect, opportunistically use
      per-vma locks to lock vmas.  On failure, attempt again inside mmap_lock
      critical section.
      
      Write-protect operation requires mmap_lock as it iterates over multiple
      vmas.
      
      Link: https://lkml.kernel.org/r/20240215182756.3448972-5-lokeshgidra@google.comSigned-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      867a43a3
    • Lokesh Gidra's avatar
      mm: add vma_assert_locked() for !CONFIG_PER_VMA_LOCK · 32af81af
      Lokesh Gidra authored
      vma_assert_locked() is needed to replace mmap_assert_locked() once we
      start using per-vma locks in userfaultfd operations.
      
      In !CONFIG_PER_VMA_LOCK case when mm is locked, it implies that the given
      VMA is locked.
      
      Link: https://lkml.kernel.org/r/20240215182756.3448972-4-lokeshgidra@google.comSigned-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      32af81af
    • Lokesh Gidra's avatar
      userfaultfd: protect mmap_changing with rw_sem in userfaulfd_ctx · 5e4c24a5
      Lokesh Gidra authored
      Increments and loads to mmap_changing are always in mmap_lock critical
      section.  This ensures that if userspace requests event notification for
      non-cooperative operations (e.g.  mremap), userfaultfd operations don't
      occur concurrently.
      
      This can be achieved by using a separate read-write semaphore in
      userfaultfd_ctx such that increments are done in write-mode and loads in
      read-mode, thereby eliminating the dependency on mmap_lock for this
      purpose.
      
      This is a preparatory step before we replace mmap_lock usage with per-vma
      locks in fill/move ioctls.
      
      Link: https://lkml.kernel.org/r/20240215182756.3448972-3-lokeshgidra@google.comSigned-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5e4c24a5
    • Lokesh Gidra's avatar
      userfaultfd: move userfaultfd_ctx struct to header file · f91e6b41
      Lokesh Gidra authored
      Patch series "per-vma locks in userfaultfd", v7.
      
      Performing userfaultfd operations (like copy/move etc.) in critical
      section of mmap_lock (read-mode) causes significant contention on the lock
      when operations requiring the lock in write-mode are taking place
      concurrently.  We can use per-vma locks instead to significantly reduce
      the contention issue.
      
      Android runtime's Garbage Collector uses userfaultfd for concurrent
      compaction.  mmap-lock contention during compaction potentially causes
      jittery experience for the user.  During one such reproducible scenario,
      we observed the following improvements with this patch-set:
      
      - Wall clock time of compaction phase came down from ~3s to <500ms
      - Uninterruptible sleep time (across all threads in the process) was
        ~10ms (none in mmap_lock) during compaction, instead of >20s
      
      
      This patch (of 4):
      
      Move the struct to userfaultfd_k.h to be accessible from mm/userfaultfd.c.
      There are no other changes in the struct.
      
      This is required to prepare for using per-vma locks in userfaultfd
      operations.
      
      Link: https://lkml.kernel.org/r/20240215182756.3448972-1-lokeshgidra@google.com
      Link: https://lkml.kernel.org/r/20240215182756.3448972-2-lokeshgidra@google.comSigned-off-by: default avatarLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Reviewed-by: default avatarLiam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f91e6b41
    • Juntong Deng's avatar
      kasan: increase the number of bits to shift when recording extra timestamps · 952237b5
      Juntong Deng authored
      In 5d4c6ac9 ("kasan: record and report more information") I thought
      that printk only displays a maximum of 99999 seconds, but actually printk
      can display a larger number of seconds.
      
      So increase the number of bits to shift when recording the extra timestamp
      (44 bits), without affecting the precision, shift it right by 9 bits,
      discarding all bits that do not affect the microsecond part (nanoseconds
      will not be shown).
      
      Currently the maximum time that can be displayed is 9007199.254740s,
      because
      
      11111111111111111111111111111111111111111111 (44 bits) << 9
      = 11111111111111111111111111111111111111111111000000000
      = 9007199.254740
      
      Link: https://lkml.kernel.org/r/AM6PR03MB58481629F2F28CE007412139994D2@AM6PR03MB5848.eurprd03.prod.outlook.com
      Fixes: 5d4c6ac9 ("kasan: record and report more information")
      Signed-off-by: default avatarJuntong Deng <juntong.deng@outlook.com>
      Acked-by: default avatarAndrey Konovalov <andreyknvl@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      952237b5
    • Matthew Wilcox (Oracle)'s avatar
      rmap: replace two calls to compound_order with folio_order · 059ab7be
      Matthew Wilcox (Oracle) authored
      Removes two unnecessary conversions from folio to page.  Should be no
      difference in behaviour.
      
      Link: https://lkml.kernel.org/r/20240215205307.674707-1-willy@infradead.orgSigned-off-by: default avatarMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      059ab7be
    • Mathieu Desnoyers's avatar
      dax: fix incorrect list of data cache aliasing architectures · 902ccb86
      Mathieu Desnoyers authored
      commit d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      prevents DAX from building on architectures with virtually aliased
      dcache with:
      
        depends on !(ARM || MIPS || SPARC)
      
      This check is too broad (e.g. recent ARMv7 don't have virtually aliased
      dcaches), and also misses many other architectures with virtually
      aliased data cache.
      
      This is a regression introduced in the v4.0 Linux kernel where the
      dax mount option is removed for 32-bit ARMv7 boards which have no data
      cache aliasing, and therefore should work fine with FS_DAX.
      
      This was turned into the following check in alloc_dax() by a preparatory
      change:
      
              if (ops && (IS_ENABLED(CONFIG_ARM) ||
                  IS_ENABLED(CONFIG_MIPS) ||
                  IS_ENABLED(CONFIG_SPARC)))
                      return NULL;
      
      Use cpu_dcache_is_aliasing() instead to figure out whether the environment
      has aliasing data caches.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-10-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      902ccb86
    • Mathieu Desnoyers's avatar
      Introduce cpu_dcache_is_aliasing() across all architectures · 8690bbcf
      Mathieu Desnoyers authored
      Introduce a generic way to query whether the data cache is virtually
      aliased on all architectures. Its purpose is to ensure that subsystems
      which are incompatible with virtually aliased data caches (e.g. FS_DAX)
      can reliably query this.
      
      For data cache aliasing, there are three scenarios dependending on the
      architecture. Here is a breakdown based on my understanding:
      
      A) The data cache is always aliasing:
      
      * arc
      * csky
      * m68k (note: shared memory mappings are incoherent ? SHMLBA is missing there.)
      * sh
      * parisc
      
      B) The data cache aliasing is statically known or depends on querying CPU
         state at runtime:
      
      * arm (cache_is_vivt() || cache_is_vipt_aliasing())
      * mips (cpu_has_dc_aliases)
      * nios2 (NIOS2_DCACHE_SIZE > PAGE_SIZE)
      * sparc32 (vac_cache_size > PAGE_SIZE)
      * sparc64 (L1DCACHE_SIZE > PAGE_SIZE)
      * xtensa (DCACHE_WAY_SIZE > PAGE_SIZE)
      
      C) The data cache is never aliasing:
      
      * alpha
      * arm64 (aarch64)
      * hexagon
      * loongarch (but with incoherent write buffers, which are disabled since
                   commit d23b7795 ("LoongArch: Change SHMLBA from SZ_64K to PAGE_SIZE"))
      * microblaze
      * openrisc
      * powerpc
      * riscv
      * s390
      * um
      * x86
      
      Require architectures in A) and B) to select ARCH_HAS_CPU_CACHE_ALIASING and
      implement "cpu_dcache_is_aliasing()".
      
      Architectures in C) don't select ARCH_HAS_CPU_CACHE_ALIASING, and thus
      cpu_dcache_is_aliasing() simply evaluates to "false".
      
      Note that this leaves "cpu_icache_is_aliasing()" to be implemented as future
      work. This would be useful to gate features like XIP on architectures
      which have aliasing CPU dcache-icache but not CPU dcache-dcache.
      
      Use "cpu_dcache" and "cpu_cache" rather than just "dcache" and "cache"
      to clarify that we really mean "CPU data cache" and "CPU cache" to
      eliminate any possible confusion with VFS "dentry cache" and "page
      cache".
      
      Link: https://lore.kernel.org/lkml/20030910210416.GA24258@mail.jlokier.co.uk/
      Link: https://lkml.kernel.org/r/20240215144633.96437-9-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      8690bbcf
    • Mathieu Desnoyers's avatar
      dax: check for data cache aliasing at runtime · 1df4ca01
      Mathieu Desnoyers authored
      Replace the following fs/Kconfig:FS_DAX dependency:
      
        depends on !(ARM || MIPS || SPARC)
      
      By a runtime check within alloc_dax(). This runtime check returns
      ERR_PTR(-EOPNOTSUPP) if the @ops parameter is non-NULL (which means
      the kernel is using an aliased mapping) on an architecture which
      has data cache aliasing.
      
      Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
      CONFIG_DAX=n for consistency.
      
      This is done in preparation for using cpu_dcache_is_aliasing() in a
      following change which will properly support architectures which detect
      data cache aliasing at runtime.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-8-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      1df4ca01
    • Mathieu Desnoyers's avatar
      virtio: treat alloc_dax() -EOPNOTSUPP failure as non-fatal · 562ce828
      Mathieu Desnoyers authored
      In preparation for checking whether the architecture has data cache
      aliasing within alloc_dax(), modify the error handling of virtio
      virtio_fs_setup_dax() to treat alloc_dax() -EOPNOTSUPP failure as
      non-fatal.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-7-mathieu.desnoyers@efficios.comCo-developed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarDan Williams <dan.j.williams@intel.com>
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      562ce828
    • Mathieu Desnoyers's avatar
      dcssblk: handle alloc_dax() -EOPNOTSUPP failure · cf7fe690
      Mathieu Desnoyers authored
      In preparation for checking whether the architecture has data cache
      aliasing within alloc_dax(), modify the error handling of dcssblk
      dcssblk_add_store() to handle alloc_dax() -EOPNOTSUPP failures.
      
      Considering that s390 is not a data cache aliasing architecture,
      and considering that DCSSBLK selects DAX, a return value of -EOPNOTSUPP
      from alloc_dax() should make dcssblk_add_store() fail.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-6-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Acked-by: default avatarHeiko Carstens <hca@linux.ibm.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cf7fe690
    • Mathieu Desnoyers's avatar
      dm: treat alloc_dax() -EOPNOTSUPP failure as non-fatal · c2929072
      Mathieu Desnoyers authored
      In preparation for checking whether the architecture has data cache
      aliasing within alloc_dax(), modify the error handling of dm alloc_dev()
      to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-5-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Suggested-by: default avatarDan Williams <dan.j.williams@intel.com>
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c2929072
    • Mathieu Desnoyers's avatar
      nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatal · f4d373dd
      Mathieu Desnoyers authored
      In preparation for checking whether the architecture has data cache
      aliasing within alloc_dax(), modify the error handling of nvdimm/pmem
      pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal.
      
      [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ]
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-4-mathieu.desnoyers@efficios.com
      Fixes: d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f4d373dd
    • Mathieu Desnoyers's avatar
      dax: alloc_dax() return ERR_PTR(-EOPNOTSUPP) for CONFIG_DAX=n · 6d439c18
      Mathieu Desnoyers authored
      Change the return value from NULL to PTR_ERR(-EOPNOTSUPP) for
      CONFIG_DAX=n to be consistent with the fact that CONFIG_DAX=y
      never returns NULL.
      
      This is done in preparation for using cpu_dcache_is_aliasing() in a
      following change which will properly support architectures which detect
      data cache aliasing at runtime.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-3-mathieu.desnoyers@efficios.com
      Fixes: 4e4ced93 ("dax: Move mandatory ->zero_page_range() check in alloc_dax()")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6d439c18
    • Mathieu Desnoyers's avatar
      dax: add empty static inline for CONFIG_DAX=n · 2807c54b
      Mathieu Desnoyers authored
      Patch series "Introduce cpu_dcache_is_aliasing() to fix DAX regression",
      v6.
      
      This commit introduced in v4.0 prevents building FS_DAX on 32-bit ARM,
      even on ARMv7 which does not have virtually aliased data caches:
      
      commit d92576f1 ("dax: does not work correctly with virtual aliasing caches")
      
      Even though it used to work fine before.
      
      The root of the issue here is the fact that DAX was never designed to
      handle virtually aliasing data caches (VIVT and VIPT with aliasing data
      cache). It touches the pages through their linear mapping, which is not
      consistent with the userspace mappings with virtually aliasing data
      caches.
      
      This patch series introduces cpu_dcache_is_aliasing() with the new
      Kconfig option ARCH_HAS_CPU_CACHE_ALIASING and implements it for all
      architectures. The implementation of cpu_dcache_is_aliasing() is either
      evaluated to a constant at compile-time or a runtime check, which is
      what is needed on ARM.
      
      With this we can basically narrow down the list of architectures which
      are unsupported by DAX to those which are really affected.
      
      
      This patch (of 9):
      
      When building a kernel with CONFIG_DAX=n, all uses of set_dax_nocache()
      and set_dax_nomc() need to be either within regions of code or compile
      units which are explicitly not compiled, or they need to rely on compiler
      optimizations to eliminate calls to those undefined symbols.
      
      It appears that at least the openrisc and loongarch architectures don't
      end up eliminating those undefined symbols even if they are provably
      within code which is eliminated due to conditional branches depending on
      constants.
      
      Implement empty static inline functions for set_dax_nocache() and
      set_dax_nomc() in CONFIG_DAX=n to ensure those undefined references are
      removed.
      
      Link: https://lkml.kernel.org/r/20240215144633.96437-1-mathieu.desnoyers@efficios.com
      Link: https://lkml.kernel.org/r/20240215144633.96437-2-mathieu.desnoyers@efficios.comReported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202402140037.wGfA1kqX-lkp@intel.com/Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202402131351.a0FZOgEG-lkp@intel.com/
      Fixes: 7ac5360c ("dax: remove the copy_from_iter and copy_to_iter methods")
      Signed-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Michael Sclafani <dm-devel@lists.linux.dev>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2807c54b
    • Mathieu Desnoyers's avatar
      nvdimm/pmem: fix leak on dax_add_host() failure · f6932a27
      Mathieu Desnoyers authored
      Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done
      before setting pmem->dax_dev, which therefore issues the two following
      calls on NULL pointers:
      
      out_cleanup_dax:
              kill_dax(pmem->dax_dev);
              put_dax(pmem->dax_dev);
      
      Link: https://lkml.kernel.org/r/20240208184913.484340-1-mathieu.desnoyers@efficios.com
      Link: https://lkml.kernel.org/r/20240208184913.484340-2-mathieu.desnoyers@efficios.comSigned-off-by: default avatarMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
      Reviewed-by: default avatarFan Ni <fan.ni@samsung.com>
      Reviewed-by: default avatarDave Jiang <dave.jiang@intel.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Cc: Mike Snitzer <snitzer@kernel.org>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f6932a27
    • Ryan Roberts's avatar
      arm64/mm: automatically fold contpte mappings · f0c22649
      Ryan Roberts authored
      There are situations where a change to a single PTE could cause the
      contpte block in which it resides to become foldable (i.e.  could be
      repainted with the contiguous bit).  Such situations arise, for example,
      when user space temporarily changes protections, via mprotect, for
      individual pages, such can be the case for certain garbage collectors.
      
      We would like to detect when such a PTE change occurs.  However this can
      be expensive due to the amount of checking required.  Therefore only
      perform the checks when an indiviual PTE is modified via mprotect
      (ptep_modify_prot_commit() -> set_pte_at() -> set_ptes(nr=1)) and only
      when we are setting the final PTE in a contpte-aligned block.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-19-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f0c22649
    • Ryan Roberts's avatar
      arm64/mm: __always_inline to improve fork() perf · b972fc6a
      Ryan Roberts authored
      As set_ptes() and wrprotect_ptes() become a bit more complex, the compiler
      may choose not to inline them.  But this is critical for fork()
      performance.  So mark the functions, along with contpte_try_unfold() which
      is called by them, as __always_inline.  This is worth ~1% on the fork()
      microbenchmark with order-0 folios (the common case).
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-18-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b972fc6a
    • Ryan Roberts's avatar
      arm64/mm: implement pte_batch_hint() · fb5451e5
      Ryan Roberts authored
      When core code iterates over a range of ptes and calls ptep_get() for each
      of them, if the range happens to cover contpte mappings, the number of pte
      reads becomes amplified by a factor of the number of PTEs in a contpte
      block.  This is because for each call to ptep_get(), the implementation
      must read all of the ptes in the contpte block to which it belongs to
      gather the access and dirty bits.
      
      This causes a hotspot for fork(), as well as operations that unmap memory
      such as munmap(), exit and madvise(MADV_DONTNEED).  Fortunately we can fix
      this by implementing pte_batch_hint() which allows their iterators to skip
      getting the contpte tail ptes when gathering the batch of ptes to operate
      on.  This results in the number of PTE reads returning to 1 per pte.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-17-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fb5451e5
    • Ryan Roberts's avatar
      mm: add pte_batch_hint() to reduce scanning in folio_pte_batch() · c6ec76a2
      Ryan Roberts authored
      Some architectures (e.g.  arm64) can tell from looking at a pte, if some
      follow-on ptes also map contiguous physical memory with the same pgprot. 
      (for arm64, these are contpte mappings).
      
      Take advantage of this knowledge to optimize folio_pte_batch() so that it
      can skip these ptes when scanning to create a batch.  By default, if an
      arch does not opt-in, folio_pte_batch() returns a compile-time 1, so the
      changes are optimized out and the behaviour is as before.
      
      arm64 will opt-in to providing this hint in the next patch, which will
      greatly reduce the cost of ptep_get() when scanning a range of contptes.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-16-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c6ec76a2
    • Ryan Roberts's avatar
      arm64/mm: implement new [get_and_]clear_full_ptes() batch APIs · 6b1e4efb
      Ryan Roberts authored
      Optimize the contpte implementation to fix some of the
      exit/munmap/dontneed performance regression introduced by the initial
      contpte commit.  Subsequent patches will solve it entirely.
      
      During exit(), munmap() or madvise(MADV_DONTNEED), mappings must be
      cleared.  Previously this was done 1 PTE at a time.  But the core-mm
      supports batched clear via the new [get_and_]clear_full_ptes() APIs.  So
      let's implement those APIs and for fully covered contpte mappings, we no
      longer need to unfold the contpte.  This significantly reduces unfolding
      operations, reducing the number of tlbis that must be issued.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-15-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6b1e4efb
    • Ryan Roberts's avatar
      arm64/mm: implement new wrprotect_ptes() batch API · 311a6cf2
      Ryan Roberts authored
      Optimize the contpte implementation to fix some of the fork performance
      regression introduced by the initial contpte commit.  Subsequent patches
      will solve it entirely.
      
      During fork(), any private memory in the parent must be write-protected. 
      Previously this was done 1 PTE at a time.  But the core-mm supports
      batched wrprotect via the new wrprotect_ptes() API.  So let's implement
      that API and for fully covered contpte mappings, we no longer need to
      unfold the contpte.  This has 2 benefits:
      
        - reduced unfolding, reduces the number of tlbis that must be issued.
        - The memory remains contpte-mapped ("folded") in the parent, so it
          continues to benefit from the more efficient use of the TLB after
          the fork.
      
      The optimization to wrprotect a whole contpte block without unfolding is
      possible thanks to the tightening of the Arm ARM in respect to the
      definition and behaviour when 'Misprogramming the Contiguous bit'.  See
      section D21194 at https://developer.arm.com/documentation/102105/ja-07/
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-14-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      311a6cf2
    • Ryan Roberts's avatar
      arm64/mm: wire up PTE_CONT for user mappings · 4602e575
      Ryan Roberts authored
      With the ptep API sufficiently refactored, we can now introduce a new
      "contpte" API layer, which transparently manages the PTE_CONT bit for user
      mappings.
      
      In this initial implementation, only suitable batches of PTEs, set via
      set_ptes(), are mapped with the PTE_CONT bit.  Any subsequent modification
      of individual PTEs will cause an "unfold" operation to repaint the contpte
      block as individual PTEs before performing the requested operation. 
      While, a modification of a single PTE could cause the block of PTEs to
      which it belongs to become eligible for "folding" into a contpte entry,
      "folding" is not performed in this initial implementation due to the costs
      of checking the requirements are met.  Due to this, contpte mappings will
      degrade back to normal pte mappings over time if/when protections are
      changed.  This will be solved in a future patch.
      
      Since a contpte block only has a single access and dirty bit, the semantic
      here changes slightly; when getting a pte (e.g.  ptep_get()) that is part
      of a contpte mapping, the access and dirty information are pulled from the
      block (so all ptes in the block return the same access/dirty info).  When
      changing the access/dirty info on a pte (e.g.  ptep_set_access_flags())
      that is part of a contpte mapping, this change will affect the whole
      contpte block.  This is works fine in practice since we guarantee that
      only a single folio is mapped by a contpte block, and the core-mm tracks
      access/dirty information per folio.
      
      In order for the public functions, which used to be pure inline, to
      continue to be callable by modules, export all the contpte_* symbols that
      are now called by those public inline functions.
      
      The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
      at build time.  It defaults to enabled as long as its dependency,
      TRANSPARENT_HUGEPAGE is also enabled.  The core-mm depends upon
      TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
      enabled, then there is no chance of meeting the physical contiguity
      requirement for contpte mappings.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-13-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarArd Biesheuvel <ardb@kernel.org>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4602e575
    • Ryan Roberts's avatar
      arm64/mm: dplit __flush_tlb_range() to elide trailing DSB · d9d8dc2b
      Ryan Roberts authored
      Split __flush_tlb_range() into __flush_tlb_range_nosync() +
      __flush_tlb_range(), in the same way as the existing flush_tlb_page()
      arrangement.  This allows calling __flush_tlb_range_nosync() to elide the
      trailing DSB.  Forthcoming "contpte" code will take advantage of this when
      clearing the young bit from a contiguous range of ptes.
      
      Ordering between dsb and mmu_notifier_arch_invalidate_secondary_tlbs() has
      changed, but now aligns with the ordering of __flush_tlb_page().  It has
      been discussed that __flush_tlb_page() may be wrong though.  Regardless,
      both will be resolved separately if needed.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-12-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d9d8dc2b
    • Ryan Roberts's avatar
      arm64/mm: new ptep layer to manage contig bit · 5a00bfd6
      Ryan Roberts authored
      Create a new layer for the in-table PTE manipulation APIs.  For now, The
      existing API is prefixed with double underscore to become the arch-private
      API and the public API is just a simple wrapper that calls the private
      API.
      
      The public API implementation will subsequently be used to transparently
      manipulate the contiguous bit where appropriate.  But since there are
      already some contig-aware users (e.g.  hugetlb, kernel mapper), we must
      first ensure those users use the private API directly so that the future
      contig-bit manipulations in the public API do not interfere with those
      existing uses.
      
      The following APIs are treated this way:
      
       - ptep_get
       - set_pte
       - set_ptes
       - pte_clear
       - ptep_get_and_clear
       - ptep_test_and_clear_young
       - ptep_clear_flush_young
       - ptep_set_wrprotect
       - ptep_set_access_flags
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-11-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5a00bfd6
    • Ryan Roberts's avatar
      arm64/mm: convert ptep_clear() to ptep_get_and_clear() · cbb0294f
      Ryan Roberts authored
      ptep_clear() is a generic wrapper around the arch-implemented
      ptep_get_and_clear().  We are about to convert ptep_get_and_clear() into a
      public version and private version (__ptep_get_and_clear()) to support the
      transparent contpte work.  We won't have a private version of ptep_clear()
      so let's convert it to directly call ptep_get_and_clear().
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-10-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      cbb0294f
    • Ryan Roberts's avatar
      arm64/mm: convert set_pte_at() to set_ptes(..., 1) · 659e1930
      Ryan Roberts authored
      Since set_ptes() was introduced, set_pte_at() has been implemented as a
      generic macro around set_ptes(..., 1).  So this change should continue to
      generate the same code.  However, making this change prepares us for the
      transparent contpte support.  It means we can reroute set_ptes() to
      __set_ptes().  Since set_pte_at() is a generic macro, there will be no
      equivalent __set_pte_at() to reroute to.
      
      Note that a couple of calls to set_pte_at() remain in the arch code.  This
      is intentional, since those call sites are acting on behalf of core-mm and
      should continue to call into the public set_ptes() rather than the
      arch-private __set_ptes().
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-9-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      659e1930
    • Ryan Roberts's avatar
      arm64/mm: convert READ_ONCE(*ptep) to ptep_get(ptep) · 53273655
      Ryan Roberts authored
      There are a number of places in the arch code that read a pte by using the
      READ_ONCE() macro.  Refactor these call sites to instead use the
      ptep_get() helper, which itself is a READ_ONCE().  Generated code should
      be the same.
      
      This will benefit us when we shortly introduce the transparent contpte
      support.  In this case, ptep_get() will become more complex so we now have
      all the code abstracted through it.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-8-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Tested-by: default avatarJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53273655
    • Ryan Roberts's avatar
      mm: tidy up pte_next_pfn() definition · fb23bf6b
      Ryan Roberts authored
      Now that the all architecture overrides of pte_next_pfn() have been
      replaced with pte_advance_pfn(), we can simplify the definition of the
      generic pte_next_pfn() macro so that it is unconditionally defined.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-7-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fb23bf6b
    • Ryan Roberts's avatar
      x86/mm: convert pte_next_pfn() to pte_advance_pfn() · 506b5867
      Ryan Roberts authored
      Core-mm needs to be able to advance the pfn by an arbitrary amount, so
      override the new pte_advance_pfn() API to do so.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-6-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      506b5867
    • Ryan Roberts's avatar
      arm64/mm: convert pte_next_pfn() to pte_advance_pfn() · c1bd2b40
      Ryan Roberts authored
      Core-mm needs to be able to advance the pfn by an arbitrary amount, so
      override the new pte_advance_pfn() API to do so.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-5-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Acked-by: default avatarMark Rutland <mark.rutland@arm.com>
      Acked-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c1bd2b40
    • Ryan Roberts's avatar
      mm: introduce pte_advance_pfn() and use for pte_next_pfn() · 583ceaaa
      Ryan Roberts authored
      The goal is to be able to advance a PTE by an arbitrary number of PFNs. 
      So introduce a new API that takes a nr param.  Define the default
      implementation here and allow for architectures to override. 
      pte_next_pfn() becomes a wrapper around pte_advance_pfn().
      
      Follow up commits will convert each overriding architecture's
      pte_next_pfn() to pte_advance_pfn().
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-4-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      583ceaaa
    • Ryan Roberts's avatar
      mm: thp: batch-collapse PMD with set_ptes() · 2bdba986
      Ryan Roberts authored
      Refactor __split_huge_pmd_locked() so that a present PMD can be collapsed
      to PTEs in a single batch using set_ptes().
      
      This should improve performance a little bit, but the real motivation is
      to remove the need for the arm64 backend to have to fold the contpte
      entries.  Instead, since the ptes are set as a batch, the contpte blocks
      can be initially set up pre-folded (once the arm64 contpte support is
      added in the next few patches).  This leads to noticeable performance
      improvement during split.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-3-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      2bdba986
    • Ryan Roberts's avatar
      mm: clarify the spec for set_ptes() · 6280d731
      Ryan Roberts authored
      Patch series "Transparent Contiguous PTEs for User Mappings", v6.
      
      This is a series to opportunistically and transparently use contpte
      mappings (set the contiguous bit in ptes) for user memory when those
      mappings meet the requirements.  The change benefits arm64, but there is
      some (very) minor refactoring for x86 to enable its integration with
      core-mm.
      
      It is part of a wider effort to improve performance by allocating and
      mapping variable-sized blocks of memory (folios).  One aim is for the 4K
      kernel to approach the performance of the 16K kernel, but without breaking
      compatibility and without the associated increase in memory.  Another aim
      is to benefit the 16K and 64K kernels by enabling 2M THP, since this is
      the contpte size for those kernels.  We have good performance data that
      demonstrates both aims are being met (see below).
      
      Of course this is only one half of the change.  We require the mapped
      physical memory to be the correct size and alignment for this to actually
      be useful (i.e.  64K for 4K pages, or 2M for 16K/64K pages).  Fortunately
      folios are solving this problem for us.  Filesystems that support it (XFS,
      AFS, EROFS, tmpfs, ...) will allocate large folios up to the PMD size
      today, and more filesystems are coming.  And for anonymous memory,
      "multi-size THP" is now upstream.
      
      
      Patch Layout
      ============
      
      In this version, I've split the patches to better show each optimization:
      
        - 1-2:    mm prep: misc code and docs cleanups
        - 3-6:    mm,arm64,x86 prep: Add pte_advance_pfn() and make pte_next_pfn() a
                  generic wrapper around it
        - 7-11:   arm64 prep: Refactor ptep helpers into new layer
        - 12:     functional contpte implementation
        - 23-18:  various optimizations on top of the contpte implementation
      
      
      Testing
      =======
      
      I've tested this series on both Ampere Altra (bare metal) and Apple M2 (VM):
        - mm selftests (inc new tests written for multi-size THP); no regressions
        - Speedometer Java script benchmark in Chromium web browser; no issues
        - Kernel compilation; no issues
        - Various tests under high memory pressure with swap enabled; no issues
      
      
      Performance
      ===========
      
      High Level Use Cases
      ~~~~~~~~~~~~~~~~~~~~
      
      First some high level use cases (kernel compilation and speedometer JavaScript
      benchmarks). These are running on Ampere Altra (I've seen similar improvements
      on Android/Pixel 6).
      
      baseline:                  mm-unstable (mTHP switched off)
      mTHP:                      + enable 16K, 32K, 64K mTHP sizes "always"
      mTHP + contpte:            + this series
      mTHP + contpte + exefolio: + patch at [6], which series supports
      
      Kernel Compilation with -j8 (negative is faster):
      
      | kernel                    | real-time | kern-time | user-time |
      |---------------------------|-----------|-----------|-----------|
      | baseline                  |      0.0% |      0.0% |      0.0% |
      | mTHP                      |     -5.0% |    -39.1% |     -0.7% |
      | mTHP + contpte            |     -6.0% |    -41.4% |     -1.5% |
      | mTHP + contpte + exefolio |     -7.8% |    -43.1% |     -3.4% |
      
      Kernel Compilation with -j80 (negative is faster):
      
      | kernel                    | real-time | kern-time | user-time |
      |---------------------------|-----------|-----------|-----------|
      | baseline                  |      0.0% |      0.0% |      0.0% |
      | mTHP                      |     -5.0% |    -36.6% |     -0.6% |
      | mTHP + contpte            |     -6.1% |    -38.2% |     -1.6% |
      | mTHP + contpte + exefolio |     -7.4% |    -39.2% |     -3.2% |
      
      Speedometer (positive is faster):
      
      | kernel                    | runs_per_min |
      |:--------------------------|--------------|
      | baseline                  |         0.0% |
      | mTHP                      |         1.5% |
      | mTHP + contpte            |         3.2% |
      | mTHP + contpte + exefolio |         4.5% |
      
      
      Micro Benchmarks
      ~~~~~~~~~~~~~~~~
      
      The following microbenchmarks are intended to demonstrate the performance of
      fork() and munmap() do not regress. I'm showing results for order-0 (4K)
      mappings, and for order-9 (2M) PTE-mapped THP. Thanks to David for sharing his
      benchmarks.
      
      baseline:                  mm-unstable + batch zap [7] series
      contpte-basic:             + patches 0-19; functional contpte implementation
      contpte-batch:             + patches 20-23; implement new batched APIs
      contpte-inline:            + patch 24; __always_inline to help compiler
      contpte-fold:              + patch 25; fold contpte mapping when sensible
      
      Primary platform is Ampere Altra bare metal. I'm also showing results for M2 VM
      (on top of MacOS) for reference, although experience suggests this might not be
      the most reliable for performance numbers of this sort:
      
      | FORK           |         order-0        |         order-9        |
      | Ampere Altra   |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      2.7% |       0.0% |      0.2% |
      | contpte-basic  |       6.3% |      1.4% |    1948.7% |      0.2% |
      | contpte-batch  |       7.6% |      2.0% |      -1.9% |      0.4% |
      | contpte-inline |       3.6% |      1.5% |      -1.0% |      0.2% |
      | contpte-fold   |       4.6% |      2.1% |      -1.8% |      0.2% |
      
      | MUNMAP         |         order-0        |         order-9        |
      | Ampere Altra   |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      0.5% |       0.0% |      0.3% |
      | contpte-basic  |       1.8% |      0.3% |    1104.8% |      0.1% |
      | contpte-batch  |      -0.3% |      0.4% |       2.7% |      0.1% |
      | contpte-inline |      -0.1% |      0.6% |       0.9% |      0.1% |
      | contpte-fold   |       0.1% |      0.6% |       0.8% |      0.1% |
      
      | FORK           |         order-0        |         order-9        |
      | Apple M2 VM    |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      1.4% |       0.0% |      0.8% |
      | contpte-basic  |       6.8% |      1.2% |     469.4% |      1.4% |
      | contpte-batch  |      -7.7% |      2.0% |      -8.9% |      0.7% |
      | contpte-inline |      -6.0% |      2.1% |      -6.0% |      2.0% |
      | contpte-fold   |       5.9% |      1.4% |      -6.4% |      1.4% |
      
      | MUNMAP         |         order-0        |         order-9        |
      | Apple M2 VM    |------------------------|------------------------|
      | (pte-map)      |       mean |     stdev |       mean |     stdev |
      |----------------|------------|-----------|------------|-----------|
      | baseline       |       0.0% |      0.6% |       0.0% |      0.4% |
      | contpte-basic  |       1.6% |      0.6% |     233.6% |      0.7% |
      | contpte-batch  |       1.9% |      0.3% |      -3.9% |      0.4% |
      | contpte-inline |       2.2% |      0.8% |      -1.6% |      0.9% |
      | contpte-fold   |       1.5% |      0.7% |      -1.7% |      0.7% |
      
      Misc
      ~~~~
      
      John Hubbard at Nvidia has indicated dramatic 10x performance improvements
      for some workloads at [8], when using 64K base page kernel.
      
      [1] https://lore.kernel.org/linux-arm-kernel/20230622144210.2623299-1-ryan.roberts@arm.com/
      [2] https://lore.kernel.org/linux-arm-kernel/20231115163018.1303287-1-ryan.roberts@arm.com/
      [3] https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-1-ryan.roberts@arm.com/
      [4] https://lore.kernel.org/lkml/20231218105100.172635-1-ryan.roberts@arm.com/
      [5] https://lore.kernel.org/linux-mm/633af0a7-0823-424f-b6ef-374d99483f05@arm.com/
      [6] https://lore.kernel.org/lkml/08c16f7d-f3b3-4f22-9acc-da943f647dc3@arm.com/
      [7] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/
      [8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520be85@nvidia.com/
      [9] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/contpte-lkml_v6
      
      
      
      
      This patch (of 18):
      
      set_ptes() spec implies that it can only be used to set a present pte
      because it interprets the PFN field to increment it.  However,
      set_pte_at() has been implemented on top of set_ptes() since set_ptes()
      was introduced, and set_pte_at() allows setting a pte to a not-present
      state.  So clarify the spec to state that when nr==1, new state of pte may
      be present or not present.  When nr>1, new state of all ptes must be
      present.
      
      While we are at it, tighten the spec to set requirements around the
      initial state of ptes; when nr==1 it may be either present or not-present.
      But when nr>1 all ptes must initially be not-present.  All set_ptes()
      callsites already conform to this requirement.  Stating it explicitly is
      useful because it allows for a simplification to the upcoming arm64
      contpte implementation.
      
      Link: https://lkml.kernel.org/r/20240215103205.2607016-1-ryan.roberts@arm.com
      Link: https://lkml.kernel.org/r/20240215103205.2607016-2-ryan.roberts@arm.comSigned-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Ard Biesheuvel <ardb@kernel.org>
      Cc: Barry Song <21cnbao@gmail.com>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6280d731
    • David Hildenbrand's avatar
      mm/memory: optimize unmap/zap with PTE-mapped THP · 10ebac4f
      David Hildenbrand authored
      Similar to how we optimized fork(), let's implement PTE batching when
      consecutive (present) PTEs map consecutive pages of the same large folio.
      
      Most infrastructure we need for batching (mmu gather, rmap) is already
      there.  We only have to add get_and_clear_full_ptes() and
      clear_full_ptes().  Similarly, extend zap_install_uffd_wp_if_needed() to
      process a PTE range.
      
      We won't bother sanity-checking the mapcount of all subpages, but only
      check the mapcount of the first subpage we process.  If there is a real
      problem hiding somewhere, we can trigger it simply by using small folios,
      or when we zap single pages of a large folio.  Ideally, we had that check
      in rmap code (including for delayed rmap), but then we cannot print the
      PTE.  Let's keep it simple for now.  If we ever have a cheap
      folio_mapcount(), we might just want to check for underflows there.
      
      To keep small folios as fast as possible force inlining of a specialized
      variant using __always_inline with nr=1.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-11-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      10ebac4f
    • David Hildenbrand's avatar
      mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing · e61abd44
      David Hildenbrand authored
      In tlb_batch_pages_flush(), we can end up freeing up to 512 pages or now
      up to 256 folio fragments that span more than one page, before we
      conditionally reschedule.
      
      It's a pain that we have to handle cond_resched() in
      tlb_batch_pages_flush() manually and cannot simply handle it in
      release_pages() -- release_pages() can be called from atomic context. 
      Well, in a perfect world we wouldn't have to make our code more
      complicated at all.
      
      With page poisoning and init_on_free, we might now run into soft lockups
      when we free a lot of rather large folio fragments, because page freeing
      time then depends on the actual memory size we are freeing instead of on
      the number of folios that are involved.
      
      In the absolute (unlikely) worst case, on arm64 with 64k we will be able
      to free up to 256 folio fragments that each span 512 MiB: zeroing out 128
      GiB does sound like it might take a while.  But instead of ignoring this
      unlikely case, let's just handle it.
      
      So, let's teach tlb_batch_pages_flush() that there are some configurations
      where page freeing is horribly slow, and let's reschedule more frequently
      -- similarly like we did for now before we had large folio fragments in
      there.  Avoid yet another loop over all encoded pages in the common case
      by handling that separately.
      
      Note that with page poisoning/zeroing, we might now end up freeing only a
      single folio fragment at a time that might exceed the old 512 pages limit:
      but if we cannot even free a single MAX_ORDER page on a system without
      running into soft lockups, something else is already completely bogus. 
      Freeing a PMD-mapped THP would similarly cause trouble.
      
      In theory, we might even free 511 order-0 pages + a single MAX_ORDER page,
      effectively having to zero out 8703 pages on arm64 with 64k, translating
      to ~544 MiB of memory: however, if 512 MiB doesn't result in soft lockups,
      544 MiB is unlikely to result in soft lockups, so we won't care about that
      for the time being.
      
      In the future, we might want to detect if handling cond_resched() is
      required at all, and just not do any of that with full preemption enabled.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e61abd44
    • David Hildenbrand's avatar
      mm/mmu_gather: add __tlb_remove_folio_pages() · d7f861b9
      David Hildenbrand authored
      Add __tlb_remove_folio_pages(), which will remove multiple consecutive
      pages that belong to the same large folio, instead of only a single page. 
      We'll be using this function when optimizing unmapping/zapping of large
      folios that are mapped by PTEs.
      
      We're using the remaining spare bit in an encoded_page to indicate that
      the next enoced page in an array contains actually shifted "nr_pages". 
      Teach swap/freeing code about putting multiple folio references, and
      delayed rmap handling to remove page ranges of a folio.
      
      This extension allows for still gathering almost as many small folios as
      we used to (-1, because we have to prepare for a possibly bigger next
      entry), but still allows for gathering consecutive pages that belong to
      the same large folio.
      
      Note that we don't pass the folio pointer, because it is not required for
      now.  Further, we don't support page_size != PAGE_SIZE, it won't be
      required for simple PTE batching.
      
      We have to provide a separate s390 implementation, but it's fairly
      straight forward.
      
      Another, more invasive and likely more expensive, approach would be to use
      folio+range or a PFN range instead of page+nr_pages.  But, we should do
      that consistently for the whole mmu_gather.  For now, let's keep it simple
      and add "nr_pages" only.
      
      Note that it is now possible to gather significantly more pages: In the
      past, we were able to gather ~10000 pages, now we can also gather ~5000
      folio fragments that span multiple pages.  A folio fragment on x86-64 can
      span up to 512 pages (2 MiB THP) and on arm64 with 64k in theory 8192
      pages (512 MiB THP).  Gathering more memory is not considered something we
      should worry about, especially because these are already corner cases.
      
      While we can gather more total memory, we won't free more folio fragments.
      As long as page freeing time primarily only depends on the number of
      involved folios, there is no effective change for !preempt configurations.
      However, we'll adjust tlb_batch_pages_flush() separately to handle corner
      cases where page freeing time grows proportionally with the actual memory
      size.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d7f861b9
    • David Hildenbrand's avatar
      mm/mmu_gather: add tlb_remove_tlb_entries() · 4d5bf0b6
      David Hildenbrand authored
      Let's add a helper that lets us batch-process multiple consecutive PTEs.
      
      Note that the loop will get optimized out on all architectures except on
      powerpc.  We have to add an early define of __tlb_remove_tlb_entry() on
      ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries()
      a macro).
      
      [arnd@kernel.org: change __tlb_remove_tlb_entry() to an inline function]
        Link: https://lkml.kernel.org/r/20240221154549.2026073-1-arnd@kernel.org
      Link: https://lkml.kernel.org/r/20240214204435.167852-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarArnd Bergmann <arnd@arndb.de>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4d5bf0b6
    • David Hildenbrand's avatar
      mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP · da510964
      David Hildenbrand authored
      Nowadays, encoded pages are only used in mmu_gather handling.  Let's
      update the documentation, and define ENCODED_PAGE_BIT_DELAY_RMAP.  While
      at it, rename ENCODE_PAGE_BITS to ENCODED_PAGE_BITS.
      
      If encoded page pointers would ever be used in other context again, we'd
      likely want to change the defines to reflect their context (e.g.,
      ENCODED_PAGE_FLAG_MMU_GATHER_DELAY_RMAP).  For now, let's keep it simple.
      
      This is a preparation for using the remaining spare bit to indicate that
      the next item in an array of encoded pages is a "nr_pages" argument and
      not an encoded page.
      
      Link: https://lkml.kernel.org/r/20240214204435.167852-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yin Fengwei <fengwei.yin@intel.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      da510964