1. 22 Feb, 2024 40 commits
    • David Hildenbrand's avatar
      mm/memory: optimize fork() with PTE-mapped THP · f8d93776
      David Hildenbrand authored
      Let's implement PTE batching when consecutive (present) PTEs map
      consecutive pages of the same large folio, and all other PTE bits besides
      the PFNs are equal.
      
      We will optimize folio_pte_batch() separately, to ignore selected PTE
      bits.  This patch is based on work by Ryan Roberts.
      
      Use __always_inline for __copy_present_ptes() and keep the handling for
      single PTEs completely separate from the multi-PTE case: we really want
      the compiler to optimize for the single-PTE case with small folios, to not
      degrade performance.
      
      Note that PTE batching will never exceed a single page table and will
      always stay within VMA boundaries.
      
      Further, processing PTE-mapped THP that maybe pinned and have
      PageAnonExclusive set on at least one subpage should work as expected, but
      there is room for improvement: We will repeatedly (1) detect a PTE batch
      (2) detect that we have to copy a page (3) fall back and allocate a single
      page to copy a single page.  For now we won't care as pinned pages are a
      corner case, and we should rather look into maintaining only a single
      PageAnonExclusive bit for large folios.
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-14-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f8d93776
    • David Hildenbrand's avatar
      mm/memory: pass PTE to copy_present_pte() · 53723298
      David Hildenbrand authored
      We already read it, let's just forward it.
      
      This patch is based on work by Ryan Roberts.
      
      [david@redhat.com: fix the hmm "exclusive_cow" selftest]
        Link: https://lkml.kernel.org/r/13f296b8-e882-47fd-b939-c2141dc28717@redhat.com
      Link: https://lkml.kernel.org/r/20240129124649.189745-13-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      53723298
    • David Hildenbrand's avatar
      mm/memory: factor out copying the actual PTE in copy_present_pte() · 23ed1908
      David Hildenbrand authored
      Let's prepare for further changes.
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-12-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      23ed1908
    • David Hildenbrand's avatar
      powerpc/mm: use pte_next_pfn() in set_ptes() · 802cc2ab
      David Hildenbrand authored
      Let's use our handy new helper. Note that the implementation is slightly
      different, but shouldn't really make a difference in practice.
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-11-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      802cc2ab
    • David Hildenbrand's avatar
      arm/mm: use pte_next_pfn() in set_ptes() · e5ea320a
      David Hildenbrand authored
      Let's use our handy helper now that it's available on all archs.
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-10-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e5ea320a
    • David Hildenbrand's avatar
      mm/pgtable: make pte_next_pfn() independent of set_ptes() · 6cdfa1d5
      David Hildenbrand authored
      Let's provide pte_next_pfn(), independently of set_ptes().  This allows
      for using the generic pte_next_pfn() version in some arch-specific
      set_ptes() implementations, and prepares for reusing pte_next_pfn() in
      other context.
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-9-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cdfa1d5
    • David Hildenbrand's avatar
      sparc/pgtable: define PFN_PTE_SHIFT · ce7a9de3
      David Hildenbrand authored
      We want to make use of pte_next_pfn() outside of set_ptes().  Let's simply
      define PFN_PTE_SHIFT, required by pte_next_pfn().
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-8-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ce7a9de3
    • David Hildenbrand's avatar
      s390/pgtable: define PFN_PTE_SHIFT · 4555ac8b
      David Hildenbrand authored
      We want to make use of pte_next_pfn() outside of set_ptes().  Let's simply
      define PFN_PTE_SHIFT, required by pte_next_pfn().
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-7-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4555ac8b
    • David Hildenbrand's avatar
      riscv/pgtable: define PFN_PTE_SHIFT · 57c254b2
      David Hildenbrand authored
      We want to make use of pte_next_pfn() outside of set_ptes().  Let's simply
      define PFN_PTE_SHIFT, required by pte_next_pfn().
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-6-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarAlexandre Ghiti <alexghiti@rivosinc.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      57c254b2
    • David Hildenbrand's avatar
      powerpc/pgtable: define PFN_PTE_SHIFT · f7dc4d68
      David Hildenbrand authored
      We want to make use of pte_next_pfn() outside of set_ptes().  Let's simply
      define PFN_PTE_SHIFT, required by pte_next_pfn().
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-5-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      f7dc4d68
    • David Hildenbrand's avatar
      nios2/pgtable: define PFN_PTE_SHIFT · 3a6a6c3f
      David Hildenbrand authored
      We want to make use of pte_next_pfn() outside of set_ptes().  Let's simply
      define PFN_PTE_SHIFT, required by pte_next_pfn().
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-4-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3a6a6c3f
    • David Hildenbrand's avatar
      arm/pgtable: define PFN_PTE_SHIFT · 12b884f2
      David Hildenbrand authored
      We want to make use of pte_next_pfn() outside of set_ptes().  Let's simply
      define PFN_PTE_SHIFT, required by pte_next_pfn().
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-3-david@redhat.comSigned-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      12b884f2
    • Ryan Roberts's avatar
      arm64/mm: make set_ptes() robust when OAs cross 48-bit boundary · 6e8f5887
      Ryan Roberts authored
      Patch series "mm/memory: optimize fork() with PTE-mapped THP", v3.
      
      Now that the rmap overhaul[1] is upstream that provides a clean interface
      for rmap batching, let's implement PTE batching during fork when
      processing PTE-mapped THPs.
      
      This series is partially based on Ryan's previous work[2] to implement
      cont-pte support on arm64, but its a complete rewrite based on [1] to
      optimize all architectures independent of any such PTE bits, and to use
      the new rmap batching functions that simplify the code and prepare for
      further rmap accounting changes.
      
      We collect consecutive PTEs that map consecutive pages of the same large
      folio, making sure that the other PTE bits are compatible, and (a) adjust
      the refcount only once per batch, (b) call rmap handling functions only
      once per batch and (c) perform batch PTE setting/updates.
      
      While this series should be beneficial for adding cont-pte support on
      ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
      for large folios with minimal added overhead and further changes[4] that
      build up on top of the total mapcount.
      
      Independent of all that, this series results in a speedup during fork with
      PTE-mapped THP, which is the default with THPs that are smaller than a PMD
      (for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).
      
      On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
      of the same size (stddev < 1%) results in the following runtimes for
      fork() (shorter is better):
      
      Folio Size | v6.8-rc1 |      New | Change
      ------------------------------------------
            4KiB | 0.014328 | 0.014035 |   - 2%
           16KiB | 0.014263 | 0.01196  |   -16%
           32KiB | 0.014334 | 0.01094  |   -24%
           64KiB | 0.014046 | 0.010444 |   -26%
          128KiB | 0.014011 | 0.010063 |   -28%
          256KiB | 0.013993 | 0.009938 |   -29%
          512KiB | 0.013983 | 0.00985  |   -30%
         1024KiB | 0.013986 | 0.00982  |   -30%
         2048KiB | 0.014305 | 0.010076 |   -30%
      
      Note that these numbers are even better than the ones from v1 (verified
      over multiple reboots), even though there were only minimal code changes. 
      Well, I removed a pte_mkclean() call for anon folios, maybe that also
      plays a role.
      
      But my experience is that fork() is extremely sensitive to code size,
      inlining, ...  so I suspect we'll see on other architectures rather a
      change of -20% instead of -30%, and it will be easy to "lose" some of that
      speedup in the future by subtle code changes.
      
      Next up is PTE batching when unmapping.  Only tested on x86-64. 
      Compile-tested on most other architectures.
      
      [1] https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com
      [3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com
      [4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com
      [5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com
      
      
      This patch (of 15):
      
      Since the high bits [51:48] of an OA are not stored contiguously in the
      PTE, there is a theoretical bug in set_ptes(), which just adds PAGE_SIZE
      to the pte to get the pte with the next pfn.  This works until the pfn
      crosses the 48-bit boundary, at which point we overflow into the upper
      attributes.
      
      Of course one could argue (and Matthew Wilcox has :) that we will never
      see a folio cross this boundary because we only allow naturally aligned
      power-of-2 allocation, so this would require a half-petabyte folio.  So
      its only a theoretical bug.  But its better that the code is robust
      regardless.
      
      I've implemented pte_next_pfn() as part of the fix, which is an opt-in
      core-mm interface.  So that is now available to the core-mm, which will be
      needed shortly to support forthcoming fork()-batching optimizations.
      
      Link: https://lkml.kernel.org/r/20240129124649.189745-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20240125173534.1659317-1-ryan.roberts@arm.com
      Link: https://lkml.kernel.org/r/20240129124649.189745-2-david@redhat.com
      Fixes: 4a169d61 ("arm64: implement the new page table range API")
      Closes: https://lore.kernel.org/linux-mm/fdaeb9a5-d890-499a-92c8-d171df43ad01@arm.com/Signed-off-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Reviewed-by: default avatarCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: default avatarDavid Hildenbrand <david@redhat.com>
      Tested-by: default avatarRyan Roberts <ryan.roberts@arm.com>
      Reviewed-by: default avatarMike Rapoport (IBM) <rppt@kernel.org>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Russell King (Oracle) <linux@armlinux.org.uk>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6e8f5887
    • Hao Ge's avatar
      mm/vmscan: change the type of file from int to bool · e321d7c9
      Hao Ge authored
      Change the type of file from int to bool because is_file_lru return bool
      
      Link: https://lkml.kernel.org/r/20240131103802.122920-1-gehao@kylinos.cnSigned-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e321d7c9
    • Baolin Wang's avatar
      mm: compaction: update the cc->nr_migratepages when allocating or freeing the freepages · ab755bf4
      Baolin Wang authored
      Currently we will use 'cc->nr_freepages >= cc->nr_migratepages' comparison
      to ensure that enough freepages are isolated in isolate_freepages(),
      however it just decreases the cc->nr_freepages without updating
      cc->nr_migratepages in compaction_alloc(), which will waste more CPU
      cycles and cause too many freepages to be isolated.
      
      So we should also update the cc->nr_migratepages when allocating or
      freeing the freepages to avoid isolating excess freepages.  And I can see
      fewer free pages are scanned and isolated when running thpcompact on my
      Arm64 server:
      
                                             k6.7         k6.7_patched
      Ops Compaction pages isolated      120692036.00   118160797.00
      Ops Compaction migrate scanned     131210329.00   154093268.00
      Ops Compaction free scanned       1090587971.00  1080632536.00
      Ops Compact scan efficiency               12.03          14.26
      
      Moreover, I did not see an obvious latency improvements, this is likely
      because isolating freepages is not the bottleneck in the thpcompact test
      case.
      
                                    k6.7                  k6.7_patched
      Amean     fault-both-1      1089.76 (   0.00%)     1080.16 *   0.88%*
      Amean     fault-both-3      1616.48 (   0.00%)     1636.65 *  -1.25%*
      Amean     fault-both-5      2266.66 (   0.00%)     2219.20 *   2.09%*
      Amean     fault-both-7      2909.84 (   0.00%)     2801.90 *   3.71%*
      Amean     fault-both-12     4861.26 (   0.00%)     4733.25 *   2.63%*
      Amean     fault-both-18     7351.11 (   0.00%)     6950.51 *   5.45%*
      Amean     fault-both-24     9059.30 (   0.00%)     9159.99 *  -1.11%*
      Amean     fault-both-30    10685.68 (   0.00%)    11399.02 *  -6.68%*
      
      Link: https://lkml.kernel.org/r/6440493f18da82298152b6305d6b41c2962a3ce6.1708409245.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: default avatarMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: default avatarVlastimil Babka <vbabka@suse.cz>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ab755bf4
    • Muhammad Usama Anjum's avatar
      selftests/mm: virtual_address_range: conform to TAP format output · d1d86ce2
      Muhammad Usama Anjum authored
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-13-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d1d86ce2
    • Muhammad Usama Anjum's avatar
      selftests/mm: transhuge-stress: conform to TAP format output · c811b0ce
      Muhammad Usama Anjum authored
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-12-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      c811b0ce
    • Muhammad Usama Anjum's avatar
      selftests/mm: thuge-gen: conform to TAP format output · b38bd9b2
      Muhammad Usama Anjum authored
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages.
      
      Also remove unneeded logging which isn't enabled.  Skip a hugepage size if
      it has less free pages to avoid unnecessary failures.  For examples, some
      systems may not have 1GB hugepage free.  So skip 1GB for testing in this
      test instead of failing the entire test.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-11-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      b38bd9b2
    • Muhammad Usama Anjum's avatar
      selftests/mm: split_huge_page_test: conform test to TAP format output · 73588704
      Muhammad Usama Anjum authored
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-9-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      73588704
    • Muhammad Usama Anjum's avatar
      selftests/mm: mremap_dontunmap: conform test to TAP format output · a0d47057
      Muhammad Usama Anjum authored
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-8-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a0d47057
    • Muhammad Usama Anjum's avatar
      selftests/mm: mrelease_test: conform test to TAP format output · 746f356f
      Muhammad Usama Anjum authored
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-7-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      746f356f
    • Muhammad Usama Anjum's avatar
      selftests/mm: mlock2-tests: conform test to TAP format output · 65c89684
      Muhammad Usama Anjum authored
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages. 
      I've done some cleanups as well.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-6-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      65c89684
    • Muhammad Usama Anjum's avatar
      selftests/mm: mlock-random-test: conform test to TAP format output · 244ae271
      Muhammad Usama Anjum authored
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-5-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      244ae271
    • Muhammad Usama Anjum's avatar
      selftests/mm: map_populate: conform test to TAP format output · 7ef98513
      Muhammad Usama Anjum authored
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages. 
      Minor cleanups have also been included.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-4-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      7ef98513
    • Muhammad Usama Anjum's avatar
      selftests/mm: map_hugetlb: conform test to TAP format output · d1e7bf2c
      Muhammad Usama Anjum authored
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-3-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d1e7bf2c
    • Muhammad Usama Anjum's avatar
      selftests/mm: map_fixed_noreplace: conform test to TAP format output · 4838cf70
      Muhammad Usama Anjum authored
      Patch series "conform tests to TAP format output", v2.
      
      
      This patch (of 12):
      
      Conform the layout, informational and status messages to TAP.  No
      functional change is intended other than the layout of output messages. 
      While at it, convert commenting style from // to /**/.
      
      Link: https://lkml.kernel.org/r/20240202113119.2047740-1-usama.anjum@collabora.com
      Link: https://lkml.kernel.org/r/20240202113119.2047740-2-usama.anjum@collabora.comSigned-off-by: default avatarMuhammad Usama Anjum <usama.anjum@collabora.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      4838cf70
    • Suren Baghdasaryan's avatar
      userfaultfd: handle zeropage moves by UFFDIO_MOVE · eb1521da
      Suren Baghdasaryan authored
      Current implementation of UFFDIO_MOVE fails to move zeropages and returns
      EBUSY when it encounters one.  We can handle them by mapping a zeropage at
      the destination and clearing the mapping at the source.  This is done both
      for ordinary and for huge zeropages.
      
      Link: https://lkml.kernel.org/r/20240131175618.2417291-1-surenb@google.comSigned-off-by: default avatarSuren Baghdasaryan <surenb@google.com>
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Closes: https://lore.kernel.org/r/202401300107.U8iMAkTl-lkp@intel.com/
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport (IBM) <rppt@kernel.org>
      Cc: Nicolas Geoffray <ngeoffray@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: ZhangPeng <zhangpeng362@huawei.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      eb1521da
    • Daniel Gomez's avatar
      XArray: add cmpxchg order test · e777ae44
      Daniel Gomez authored
      XArray multi-index entries do not keep track of the order stored once the
      entry is being marked as used with cmpxchg (conditionally replaced with
      NULL).  Add a test to check the order is actually lost.  The test also
      verifies the order and entries for all the tied indexes before and after
      the NULL replacement with xa_cmpxchg.
      
      Add another entry at 1 << order that keeps the node around and the order
      information for the NULL-entry after xa_cmpxchg.
      
      Link: https://lkml.kernel.org/r/20240131225125.1370598-3-mcgrof@kernel.orgSigned-off-by: default avatarDaniel Gomez <da.gomez@samsung.com>
      Signed-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      e777ae44
    • Luis Chamberlain's avatar
      test_xarray: add tests for advanced multi-index use · a60cc288
      Luis Chamberlain authored
      Patch series "test_xarray: advanced API multi-index tests", v2.
      
      This is a respin of the test_xarray multi-index tests [0] which use and
      demonstrate the advanced API which is used by the page cache.  This should
      let folks more easily follow how we use multi-index to support for example
      a min order later in the page cache.  It also lets us grow the selftests
      to mimic more of what we do in the page cache.
      
      
      This patch (of 2):
      
      The multi index selftests are great but they don't replicate how we deal
      with the page cache exactly, which makes it a bit hard to follow as the
      page cache uses the advanced API.
      
      Add tests which use the advanced API, mimicking what we do in the page
      cache, while at it, extend the example to do what is needed for min order
      support.
      
      [mcgrof@kernel.org: fix soft lockup for advanced-api tests]
        Link: https://lkml.kernel.org/r/20240216194329.840555-1-mcgrof@kernel.org
      [akpm@linux-foundation.org: s/i/loops/, make non-static]
      [akpm@linux-foundation.org: restore static storage for loop counter]
      Link: https://lkml.kernel.org/r/20240131225125.1370598-1-mcgrof@kernel.org
      Link: https://lkml.kernel.org/r/20240131225125.1370598-2-mcgrof@kernel.orgSigned-off-by: default avatarLuis Chamberlain <mcgrof@kernel.org>
      Tested-by: default avatarDaniel Gomez <da.gomez@samsung.com>
      Cc: Darrick J. Wong <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Pankaj Raghav <p.raghav@samsung.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a60cc288
    • Anshuman Khandual's avatar
      mm/cma: don't treat bad input arguments for cma_alloc() as its failure · d818c98a
      Anshuman Khandual authored
      Invalid cma_alloc() input scenarios - including excess allocation request
      should neither be counted as CMA_ALLOC_FAIL nor 'cma->nr_pages_failed' be
      updated when applicable with CONFIG_CMA_SYSFS. This also drops 'out' jump
      label which has become redundant.
      
      Link: https://lkml.kernel.org/r/20240201023714.3871061-1-anshuman.khandual@arm.comSigned-off-by: default avatarAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      d818c98a
    • Christophe Leroy's avatar
      mm: ptdump: add check_wx_pages debugfs attribute · 565474af
      Christophe Leroy authored
      Add a readable attribute in debugfs to trigger a W^X pages check at any
      time.
      
      To trigger the test, just read /sys/kernel/debug/check_wx_pages It will
      report FAILED if the test failed, SUCCESS otherwise.
      
      Detailed result is provided into dmesg.
      
      Link: https://lkml.kernel.org/r/e947fb1a9f3f5466344823e532d343ff194ae03d.1706610398.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Phong Tran <tranmanphong@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      565474af
    • Christophe Leroy's avatar
      mm: ptdump: have ptdump_check_wx() return bool · 6cdc82db
      Christophe Leroy authored
      Have ptdump_check_wx() return true when the check is successful or false
      otherwise.
      
      [akpm@linux-foundation.org: fix a couple of build issues (x86_64 allmodconfig)]
      Link: https://lkml.kernel.org/r/7943149fe955458cb7b57cd483bf41a3aad94684.1706610398.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Phong Tran <tranmanphong@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6cdc82db
    • Christophe Leroy's avatar
      powerpc,s390: ptdump: define ptdump_check_wx() regardless of CONFIG_DEBUG_WX · 592e15f6
      Christophe Leroy authored
      Following patch will use ptdump_check_wx() regardless of CONFIG_DEBUG_WX,
      so define it at all times on powerpc and s390 just like other
      architectures.  Though keep the WARN_ON_ONCE() only when CONFIG_DEBUG_WX
      is set.
      
      Link: https://lkml.kernel.org/r/07bfb04c7fec58e84413e91d2533581be357a696.1706610398.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Phong Tran <tranmanphong@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      592e15f6
    • Christophe Leroy's avatar
      arm64, powerpc, riscv, s390, x86: ptdump: refactor CONFIG_DEBUG_WX · a5e8131a
      Christophe Leroy authored
      All architectures using the core ptdump functionality also implement
      CONFIG_DEBUG_WX, and they all do it more or less the same way, with a
      function called debug_checkwx() that is called by mark_rodata_ro(), which
      is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a
      no-op otherwise.
      
      Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call
      debug_checkwx() immediately after calling mark_rodata_ro() instead of
      calling it at the end of every mark_rodata_ro().
      
      On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX
      before calling debug_checkwx().  Now the check is inside the callee
      ptdump_walk_pgd_level_checkwx().
      
      On powerpc_64, mark_rodata_ro() bails out early before calling
      ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature.  The check
      is now also done in ptdump_check_wx() as it is called outside
      mark_rodata_ro().
      
      Link: https://lkml.kernel.org/r/a59b102d7964261d31ead0316a9f18628e4e7a8e.1706610398.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Reviewed-by: default avatarAlexandre Ghiti <alexghiti@rivosinc.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Phong Tran <tranmanphong@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a5e8131a
    • Christophe Leroy's avatar
      arm: ptdump: rename CONFIG_DEBUG_WX to CONFIG_ARM_DEBUG_WX · a90f0a02
      Christophe Leroy authored
      Patch series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages
      debugfs attribute", v2.
      
      This series refactors CONFIG_DEBUG_WX for the 5 architectures implementing
      CONFIG_GENERIC_PTDUMP
      
      First rename stuff in ARM which uses similar names while not implementing
      CONFIG_GENERIC_PTDUMP.
      
      Then define a generic version of debug_checkwx() that calls
      ptdump_check_wx() when CONFIG_DEBUG_WX is set.  Call it immediately after
      calling mark_rodata_ro() instead of calling it at the end of every
      mark_rodata_ro().
      
      Then implement a debugfs attribute that can be used to trigger a W^X test
      at anytime and regardless of CONFIG_DEBUG_WX
      
      
      This patch (of 5):
      
      CONFIG_DEBUG_WX is a core option defined in mm/Kconfig.debug
      
      To avoid any future conflict, rename ARM version into CONFIG_ARM_DEBUG_WX.
      
      Link: https://lore.kernel.org/lkml/20200422152656.GF676@willie-the-truck/T/#m802eaf33efd6f8d575939d157301b35ac0d4a64f
      Link: https://github.com/KSPP/linux/issues/35
      Link: https://lkml.kernel.org/r/cover.1706610398.git.christophe.leroy@csgroup.eu
      Link: https://lkml.kernel.org/r/fa297aa90caeb61eee2b70c6c5897a2ab58a9562.1706610398.git.christophe.leroy@csgroup.euSigned-off-by: default avatarChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org>
      Cc: Borislav Petkov (AMD) <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Phong Tran <tranmanphong@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Alexandre Ghiti <alexghiti@rivosinc.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      a90f0a02
    • Gregory Price's avatar
      mm/mempolicy: protect task interleave functions with tsk->mems_allowed_seq · 274519ed
      Gregory Price authored
      In the event of rebind, pol->nodemask can change at the same time as an
      allocation occurs.  We can detect this with tsk->mems_allowed_seq and
      prevent a miscount or an allocation failure from occurring.
      
      The same thing happens in the allocators to detect failure, but this can
      prevent spurious failures in a much smaller critical section.
      
      [gourry.memverge@gmail.com: weighted interleave checks wrong parameter]
        Link: https://lkml.kernel.org/r/20240206192853.3589-1-gregory.price@memverge.com
      Link: https://lkml.kernel.org/r/20240202170238.90004-5-gregory.price@memverge.comSigned-off-by: default avatarGregory Price <gregory.price@memverge.com>
      Suggested-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hasan Al Maruf <Hasan.Maruf@amd.com>
      Cc: Honggyu Kim <honggyu.kim@sk.com>
      Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
      Cc: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      274519ed
    • Gregory Price's avatar
      mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving · fa3bea4e
      Gregory Price authored
      When a system has multiple NUMA nodes and it becomes bandwidth hungry,
      using the current MPOL_INTERLEAVE could be an wise option.
      
      However, if those NUMA nodes consist of different types of memory such as
      socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based
      interleave policy does not optimally distribute data to make use of their
      different bandwidth characteristics.
      
      Instead, interleave is more effective when the allocation policy follows
      each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.
      
      This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
      enabling weighted interleave between NUMA nodes.  Weighted interleave
      allows for proportional distribution of memory across multiple numa nodes,
      preferably apportioned to match the bandwidth of each node.
      
      For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
      with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight
      distribution is (2:1).
      
      Weights for each node can be assigned via the new sysfs extension:
      /sys/kernel/mm/mempolicy/weighted_interleave/
      
      For now, the default value of all nodes will be `1`, which matches the
      behavior of standard 1:1 round-robin interleave.  An extension will be
      added in the future to allow default values to be registered at kernel and
      device bringup time.
      
      The policy allocates a number of pages equal to the set weights.  For
      example, if the weights are (2,1), then 2 pages will be allocated on node0
      for every 1 page allocated on node1.
      
      The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
      and mbind(2).
      
      Some high level notes about the pieces of weighted interleave:
      
      current->il_prev:
          Tracks the node previously allocated from.
      
      current->il_weight:
          The active weight of the current node (current->il_prev)
          When this reaches 0, current->il_prev is set to the next node
          and current->il_weight is set to the next weight.
      
      weighted_interleave_nodes:
          Counts the number of allocations as they occur, and applies the
          weight for the current node.  When the weight reaches 0, switch
          to the next node.  Operates only on task->mempolicy.
      
      weighted_interleave_nid:
          Gets the total weight of the nodemask as well as each individual
          node weight, then calculates the node based on the given index.
          Operates on VMA policies.
      
      bulk_array_weighted_interleave:
          Gets the total weight of the nodemask as well as each individual
          node weight, then calculates the number of "interleave rounds" as
          well as any delta ("partial round").  Calculates the number of
          pages for each node and allocates them.
      
          If a node was scheduled for interleave via interleave_nodes, the
          current weight will be allocated first.
      
          Operates only on the task->mempolicy.
      
      One piece of complexity is the interaction between a recent refactor which
      split the logic to acquire the "ilx" (interleave index) of an allocation
      and the actually application of the interleave.  If a call to
      alloc_pages_mpol() were made with a weighted-interleave policy and ilx set
      to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA
      policy - violating the description above.
      
      An inspection of all callers of alloc_pages_mpol() shows that all external
      callers set ilx to `0`, an index value, or will call get_vma_policy() to
      acquire the ilx.
      
      For example, mm/shmem.c may call into alloc_pages_mpol.  The call stacks
      all set (pgoff_t ilx) or end up in `get_vma_policy()`.  This enforces the
      `weighted_interleave_nodes()` and `weighted_interleave_nid()` policy
      requirements (task/vma respectively).
      
      Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.comSuggested-by: default avatarHasan Al Maruf <Hasan.Maruf@amd.com>
      Signed-off-by: default avatarGregory Price <gregory.price@memverge.com>
      Co-developed-by: default avatarRakie Kim <rakie.kim@sk.com>
      Signed-off-by: default avatarRakie Kim <rakie.kim@sk.com>
      Co-developed-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Signed-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Co-developed-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Signed-off-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Co-developed-by: default avatarSrinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
      Signed-off-by: default avatarSrinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
      Co-developed-by: default avatarRavi Jonnalagadda <ravis.opensrc@micron.com>
      Signed-off-by: default avatarRavi Jonnalagadda <ravis.opensrc@micron.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      fa3bea4e
    • Gregory Price's avatar
      mm/mempolicy: refactor a read-once mechanism into a function for re-use · 9685e6e3
      Gregory Price authored
      Move the use of barrier() to force policy->nodemask onto the stack into a
      function `read_once_policy_nodemask` so that it may be re-used.
      
      Link: https://lkml.kernel.org/r/20240202170238.90004-3-gregory.price@memverge.comSigned-off-by: default avatarGregory Price <gregory.price@memverge.com>
      Suggested-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hasan Al Maruf <Hasan.Maruf@amd.com>
      Cc: Honggyu Kim <honggyu.kim@sk.com>
      Cc: Hyeongtak Ji <hyeongtak.ji@sk.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rakie Kim <rakie.kim@sk.com>
      Cc: Ravi Jonnalagadda <ravis.opensrc@micron.com>
      Cc: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9685e6e3
    • Rakie Kim's avatar
      mm/mempolicy: implement the sysfs-based weighted_interleave interface · dce41f5a
      Rakie Kim authored
      Patch series "mm/mempolicy: weighted interleave mempolicy and sysfs
      extension", v5.
      
      Weighted interleave is a new interleave policy intended to make use of
      heterogeneous memory environments appearing with CXL.
      
      The existing interleave mechanism does an even round-robin distribution of
      memory across all nodes in a nodemask, while weighted interleave
      distributes memory across nodes according to a provided weight.  (Weight =
      # of page allocations per round)
      
      Weighted interleave is intended to reduce average latency when bandwidth
      is pressured - therefore increasing total throughput.
      
      In other words: It allows greater use of the total available bandwidth in
      a heterogeneous hardware environment (different hardware provides
      different bandwidth capacity).
      
      As bandwidth is pressured, latency increases - first linearly and then
      exponentially.  By keeping bandwidth usage distributed according to
      available bandwidth, we therefore can reduce the average latency of a
      cacheline fetch.
      
      A good explanation of the bandwidth vs latency response curve:
      https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/
      
      From the article:
      ```
      Constant region:
          The latency response is fairly constant for the first 40%
          of the sustained bandwidth.
      Linear region:
          In between 40% to 80% of the sustained bandwidth, the
          latency response increases almost linearly with the bandwidth
          demand of the system due to contention overhead by numerous
          memory requests.
      Exponential region:
          Between 80% to 100% of the sustained bandwidth, the memory
          latency is dominated by the contention latency which can be
          as much as twice the idle latency or more.
      Maximum sustained bandwidth :
          Is 65% to 75% of the theoretical maximum bandwidth.
      ```
      
      As a general rule of thumb:
      * If bandwidth usage is low, latency does not increase. It is
        optimal to place data in the nearest (lowest latency) device.
      * If bandwidth usage is high, latency increases. It is optimal
        to place data such that bandwidth use is optimized per-device.
      
      This is the top line goal: Provide a user a mechanism to target using the
      "maximum sustained bandwidth" of each hardware component in a heterogenous
      memory system.
      
      
      For example, the stream benchmark demonstrates that 1:1 (default)
      interleave is actively harmful, while weighted interleave can be
      beneficial.  Default interleave distributes data such that too much
      pressure is placed on devices with lower available bandwidth.
      
      Stream Benchmark (vs DRAM, 1 Socket + 1 CXL Device)
      Default interleave : -78% (slower than DRAM)
      Global weighting   : -6% to +4% (workload dependant)
      Targeted weights   : +2.5% to +4% (consistently better than DRAM)
      
      Global means the task-policy was set (set_mempolicy), while targeted means
      VMA policies were set (mbind2).  We see weighted interleave is not always
      beneficial when applied globally, but is always beneficial when applied to
      bandwidth-driving memory regions.
      
      
      There are 4 patches in this set:
      1) Implement system-global interleave weights as sysfs extension
         in mm/mempolicy.c.  These weights are RCU protected, and a
         default weight set is provided (all weights are 1 by default).
      
         In future work, we intend to expose an interface for HMAT/CDAT
         code to set reasonable default values based on the memory
         configuration of the system discovered at boot/hotplug.
      
      2) A mild refactor of some interleave-logic for re-use in the
         new weighted interleave logic.
      
      3) MPOL_WEIGHTED_INTERLEAVE extension for set_mempolicy/mbind
      
      4) Protect interleave logic (weighted and normal) with the
         mems_allowed seq cookie.  If the nodemask changes while
         accessing it during a rebind, just retry the access.
      
      Included below are some performance and LTP test information,
      and a sample numactl branch which can be used for testing.
      
      = Performance summary =
      (tests may have different configurations, see extended info below)
      1) MLC (W2) : +38% over DRAM. +264% over default interleave.
         MLC (W5) : +40% over DRAM. +226% over default interleave.
      2) Stream   : -6% to +4% over DRAM, +430% over default interleave.
      3) XSBench  : +19% over DRAM. +47% over default interleave.
      
      = LTP Testing Summary =
      existing mempolicy & mbind tests: pass
      mempolicy & mbind + weighted interleave (global weights): pass
      
      = version history
      v5:
      - style fixes
      - mems_allowed cookie protection to detect rebind issues,
        prevents spurious allocation failures and/or mis-allocations
      - sparse warning fixes related to __rcu on local variables
      
      =====================================================================
      Performance tests - MLC
      From - Ravi Jonnalagadda <ravis.opensrc@micron.com>
      
      Hardware: Single-socket, multiple CXL memory expanders.
      
      Workload:                               W2
      Data Signature:                         2:1 read:write
      DRAM only bandwidth (GBps):             298.8
      DRAM + CXL (default interleave) (GBps): 113.04
      DRAM + CXL (weighted interleave)(GBps): 412.5
      Gain over DRAM only:                    1.38x
      Gain over default interleave:           2.64x
      
      Workload:                               W5
      Data Signature:                         1:1 read:write
      DRAM only bandwidth (GBps):             273.2
      DRAM + CXL (default interleave) (GBps): 117.23
      DRAM + CXL (weighted interleave)(GBps): 382.7
      Gain over DRAM only:                    1.4x
      Gain over default interleave:           2.26x
      
      =====================================================================
      Performance test - Stream
      From - Gregory Price <gregory.price@memverge.com>
      
      Hardware: Single socket, single CXL expander
      numactl extension: https://github.com/gmprice/numactl/tree/weighted_interleave_master
      
      Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
      Default interleave : -78% (slower than DRAM)
      Global weighting   : -6% to +4% (workload dependant)
      mbind2 weights     : +2.5% to +4% (consistently better than DRAM)
      
      dram only:
      numactl --cpunodebind=1 --membind=1 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
      Function     Direction    BestRateMBs     AvgTime      MinTime      MaxTime
      Copy:        0->0            200923.2     0.032662     0.031853     0.033301
      Scale:       0->0            202123.0     0.032526     0.031664     0.032970
      Add:         0->0            208873.2     0.047322     0.045961     0.047884
      Triad:       0->0            208523.8     0.047262     0.046038     0.048414
      
      CXL-only:
      numactl --cpunodebind=1 -w --membind=2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
      Copy:        0->0             22209.7     0.288661     0.288162     0.289342
      Scale:       0->0             22288.2     0.287549     0.287147     0.288291
      Add:         0->0             24419.1     0.393372     0.393135     0.393735
      Triad:       0->0             24484.6     0.392337     0.392083     0.394331
      
      Based on the above, the optimal weights are ~9:1
      echo 9 > /sys/kernel/mm/mempolicy/weighted_interleave/node1
      echo 1 > /sys/kernel/mm/mempolicy/weighted_interleave/node2
      
      default interleave:
      numactl --cpunodebind=1 --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
      Copy:        0->0             44666.2     0.143671     0.143285     0.144174
      Scale:       0->0             44781.6     0.143256     0.142916     0.143713
      Add:         0->0             48600.7     0.197719     0.197528     0.197858
      Triad:       0->0             48727.5     0.197204     0.197014     0.197439
      
      global weighted interleave:
      numactl --cpunodebind=1 -w --interleave=1,2 ./stream_c.exe --ntimes 100 --array-size 400M --malloc
      Copy:        0->0            190085.9     0.034289     0.033669     0.034645
      Scale:       0->0            207677.4     0.031909     0.030817     0.033061
      Add:         0->0            202036.8     0.048737     0.047516     0.053409
      Triad:       0->0            217671.5     0.045819     0.044103     0.046755
      
      targted regions w/ global weights (modified stream to mbind2 malloc'd regions))
      numactl --cpunodebind=1 --membind=1 ./stream_c.exe -b --ntimes 100 --array-size 400M --malloc
      Copy:        0->0            205827.0     0.031445     0.031094     0.031984
      Scale:       0->0            208171.8     0.031320     0.030744     0.032505
      Add:         0->0            217352.0     0.045087     0.044168     0.046515
      Triad:       0->0            216884.8     0.045062     0.044263     0.046982
      
      =====================================================================
      Performance tests - XSBench
      From - Hyeongtak Ji <hyeongtak.ji@sk.com>
      
      Hardware: Single socket, Single CXL memory Expander
      
      NUMA node 0: 56 logical cores, 128 GB memory
      NUMA node 2: 96 GB CXL memory
      Threads:     56
      Lookups:     170,000,000
      
      Summary: +19% over DRAM. +47% over default interleave.
      
      Performance tests - XSBench
      1. dram only
      $ numactl -m 0 ./XSBench -s XL –p 5000000
      Runtime:     36.235 seconds
      Lookups/s:   4,691,618
      
      2. default interleave
      $ numactl –i 0,2 ./XSBench –s XL –p 5000000
      Runtime:     55.243 seconds
      Lookups/s:   3,077,293
      
      3. weighted interleave
      numactl –w –i 0,2 ./XSBench –s XL –p 5000000
      Runtime:     29.262 seconds
      Lookups/s:   5,809,513
      
      =====================================================================
      LTP Tests: https://github.com/gmprice/ltp/tree/mempolicy2
      
      = Existing tests
      set_mempolicy, get_mempolicy, mbind
      
      MPOL_WEIGHTED_INTERLEAVE added manually to test basic functionality but
      did not adjust tests for weighting.  Basically the weights were set to 1,
      which is the default, and it should behave the same as MPOL_INTERLEAVE if
      logic is correct.
      
      == set_mempolicy01 : passed   18, failed   0
      == set_mempolicy02 : passed   10, failed   0
      == set_mempolicy03 : passed   64, failed   0
      == set_mempolicy04 : passed   32, failed   0
      == set_mempolicy05 - n/a on non-x86
      == set_mempolicy06 : passed   10, failed   0
         this is set_mempolicy02 + MPOL_WEIGHTED_INTERLEAVE
      == set_mempolicy07 : passed   32, failed   0
         set_mempolicy04 + MPOL_WEIGHTED_INTERLEAVE
      == get_mempolicy01 : passed   12, failed   0
         change: added MPOL_WEIGHTED_INTERLEAVE
      == get_mempolicy02 : passed   2, failed   0
      == mbind01 : passed   15, failed   0
         added MPOL_WEIGHTED_INTERLEAVE
      == mbind02 : passed   4, failed   0
         added MPOL_WEIGHTED_INTERLEAVE
      == mbind03 : passed   16, failed   0
         added MPOL_WEIGHTED_INTERLEAVE
      == mbind04 : passed   48, failed   0
         added MPOL_WEIGHTED_INTERLEAVE
      
      =====================================================================
      numactl (set_mempolicy) w/ global weighting test
      numactl fork: https://github.com/gmprice/numactl/tree/weighted_interleave_master
      
      command: numactl -w --interleave=0,1 ./eatmem
      
      result (weights 1:1):
      0176a000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=32897 N1=32896 kernelpagesize_kB=4
      7fceeb9ff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=32768 N1=32769 kernelpagesize_kB=4
      50% distribution is correct
      
      result (weights 5:1):
      01b14000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=54828 N1=10965 kernelpagesize_kB=4
      7f47a1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=54614 N1=10923 kernelpagesize_kB=4
      16.666% distribution is correct
      
      result (weights 1:5):
      01f07000 weighted interleave:0-1 heap anon=65793 dirty=65793 active=0 N0=10966 N1=54827 kernelpagesize_kB=4
      7f17b1dff000 weighted interleave:0-1 anon=65537 dirty=65537 active=0 N0=10923 N1=54614 kernelpagesize_kB=4
      16.666% distribution is correct
      
      #include <stdio.h>
      #include <stdlib.h>
      #include <string.h>
      int main (void)
      {
              char* mem = malloc(1024*1024*256);
              memset(mem, 1, 1024*1024*256);
              for (int i = 0; i  < ((1024*1024*256)/4096); i++)
              {
                      mem = malloc(4096);
                      mem[0] = 1;
              }
              printf("done\n");
              getchar();
              return 0;
      }
      
      
      This patch (of 4):
      
      This patch provides a way to set interleave weight information under sysfs
      at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
      
      The sysfs structure is designed as follows.
      
        $ tree /sys/kernel/mm/mempolicy/
        /sys/kernel/mm/mempolicy/ [1]
        └── weighted_interleave [2]
            ├── node0 [3]
            └── node1
      
      Each file above can be explained as follows.
      
      [1] mm/mempolicy: configuration interface for mempolicy subsystem
      
      [2] weighted_interleave/: config interface for weighted interleave policy
      
      [3] weighted_interleave/nodeN: weight for nodeN
      
      If a node value is set to `0`, the system-default value will be used.
      As of this patch, the system-default for all nodes is always 1.
      
      Link: https://lkml.kernel.org/r/20240202170238.90004-1-gregory.price@memverge.com
      Link: https://lkml.kernel.org/r/20240202170238.90004-2-gregory.price@memverge.comSuggested-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: default avatarRakie Kim <rakie.kim@sk.com>
      Signed-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
      Co-developed-by: default avatarGregory Price <gregory.price@memverge.com>
      Signed-off-by: default avatarGregory Price <gregory.price@memverge.com>
      Co-developed-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Signed-off-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
      Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Gregory Price <gourry.memverge@gmail.com>
      Cc: Hasan Al Maruf <Hasan.Maruf@amd.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      dce41f5a
    • Yajun Deng's avatar
      mm/mmap: use SZ_{8K, 128K} helper macro · 9c793854
      Yajun Deng authored
      Use SZ_{8K, 128K} helper macro instead of the number in init_user_reserve
      and reserve_mem_notifier. This is more readable.
      
      Link: https://lkml.kernel.org/r/20240131031913.2058597-1-yajun.deng@linux.devSigned-off-by: default avatarYajun Deng <yajun.deng@linux.dev>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      9c793854