1. 17 Sep, 2024 4 commits
    • Peter Xu's avatar
      mm: allow THP orders for PFNMAPs · 5dd40721
      Peter Xu authored
      This enables PFNMAPs to be mapped at either pmd/pud layers.  Generalize the
      dax case into vma_is_special_huge() so as to cover both.  Meanwhile, rename
      the macro to THP_ORDERS_ALL_SPECIAL.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-5-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      5dd40721
    • Peter Xu's avatar
      mm: mark special bits for huge pfn mappings when inject · 3c8e44c9
      Peter Xu authored
      We need these special bits to be around on pfnmaps.  Mark properly for
      !devmap case, reflecting that there's no page struct backing the entry.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-4-peterx@redhat.comReviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      3c8e44c9
    • Peter Xu's avatar
      mm: drop is_huge_zero_pud() · ef713ec3
      Peter Xu authored
      It constantly returns false since 2017.  One assertion is added in 2019 but
      it should never have triggered, IOW it means what is checked should be
      asserted instead.
      
      If it didn't exist for 7 years maybe it's good idea to remove it and only
      add it when it comes.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-3-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJason Gunthorpe <jgg@nvidia.com>
      Acked-by: default avatarDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      ef713ec3
    • Peter Xu's avatar
      mm: introduce ARCH_SUPPORTS_HUGE_PFNMAP and special bits to pmd/pud · 6857be5f
      Peter Xu authored
      Patch series "mm: Support huge pfnmaps", v2.
      
      Overview
      ========
      
      This series implements huge pfnmaps support for mm in general.  Huge
      pfnmap allows e.g.  VM_PFNMAP vmas to map in either PMD or PUD levels,
      similar to what we do with dax / thp / hugetlb so far to benefit from TLB
      hits.  Now we extend that idea to PFN mappings, e.g.  PCI MMIO bars where
      it can grow as large as 8GB or even bigger.
      
      Currently, only x86_64 (1G+2M) and arm64 (2M) are supported.  The last
      patch (from Alex Williamson) will be the first user of huge pfnmap, so as
      to enable vfio-pci driver to fault in huge pfn mappings.
      
      Implementation
      ==============
      
      In reality, it's relatively simple to add such support comparing to many
      other types of mappings, because of PFNMAP's specialties when there's no
      vmemmap backing it, so that most of the kernel routines on huge mappings
      should simply already fail for them, like GUPs or old-school follow_page()
      (which is recently rewritten to be folio_walk* APIs by David).
      
      One trick here is that we're still unmature on PUDs in generic paths here
      and there, as DAX is so far the only user.  This patchset will add the 2nd
      user of it.  Hugetlb can be a 3rd user if the hugetlb unification work can
      go on smoothly, but to be discussed later.
      
      The other trick is how to allow gup-fast working for such huge mappings
      even if there's no direct sign of knowing whether it's a normal page or
      MMIO mapping.  This series chose to keep the pte_special solution, so that
      it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so
      that gup-fast will be able to identify them and fail properly.
      
      Along the way, we'll also notice that the major pgtable pfn walker, aka,
      follow_pte(), will need to retire soon due to the fact that it only works
      with ptes.  A new set of simple API is introduced (follow_pfnmap* API) to
      be able to do whatever follow_pte() can already do, plus that it can also
      process huge pfnmaps now.  Half of this series is about that and
      converting all existing pfnmap walkers to use the new API properly. 
      Hopefully the new API also looks better to avoid exposing e.g.  pgtable
      lock details into the callers, so that it can be used in an even more
      straightforward way.
      
      Here, three more options will be introduced and involved in huge pfnmap:
      
        - ARCH_SUPPORTS_HUGE_PFNMAP
      
          Arch developers will need to select this option when huge pfnmap is
          supported in arch's Kconfig.  After this patchset applied, both x86_64
          and arm64 will start to enable it by default.
      
        - ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
      
          These options are for driver developers to identify whether current
          arch / config supports huge pfnmaps, making decision on whether it can
          use the huge pfnmap APIs to inject them.  One can refer to the last
          vfio-pci patch from Alex on the use of them properly in a device
          driver.
      
      So after the whole set applied, and if one would enable some dynamic debug
      lines in vfio-pci core files, we should observe things like:
      
        vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
        vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
        vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
      
      In this specific case, it says that vfio-pci faults in PMDs properly for a
      few BAR0 offsets.
      
      Patch Layout
      ============
      
      Patch 1:         Introduce the new options mentioned above for huge PFNMAPs
      Patch 2:         A tiny cleanup
      Patch 3-8:       Preparation patches for huge pfnmap (include introduce
                       special bit for pmd/pud)
      Patch 9-16:      Introduce follow_pfnmap*() API, use it everywhere, and
                       then drop follow_pte() API
      Patch 17:        Add huge pfnmap support for x86_64
      Patch 18:        Add huge pfnmap support for arm64
      Patch 19:        Add vfio-pci support for all kinds of huge pfnmaps (Alex)
      
      TODO
      ====
      
      More architectures / More page sizes
      ------------------------------------
      
      Currently only x86_64 (2M+1G) and arm64 (2M) are supported.  There seems
      to have plan to support arm64 1G later on top of this series [2].
      
      Any arch will need to first support THP / THP_1G, then provide a special
      bit in pmds/puds to support huge pfnmaps.
      
      remap_pfn_range() support
      -------------------------
      
      Currently, remap_pfn_range() still only maps PTEs.  With the new option,
      remap_pfn_range() can logically start to inject either PMDs or PUDs when
      the alignment requirements match on the VAs.
      
      When the support is there, it should be able to silently benefit all
      drivers that is using remap_pfn_range() in its mmap() handler on better
      TLB hit rate and overall faster MMIO accesses similar to processor on
      hugepages.
      
      More driver support
      -------------------
      
      VFIO is so far the only consumer for the huge pfnmaps after this series
      applied.  Besides above remap_pfn_range() generic optimization, device
      driver can also try to optimize its mmap() on a better VA alignment for
      either PMD/PUD sizes.  This may, iiuc, normally require userspace changes,
      as the driver doesn't normally decide the VA to map a bar.  But I don't
      think I know all the drivers to know the full picture.
      
      Credits all go to Alex on help testing the GPU/NIC use cases above.
      
      [0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
      [1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
      [2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
      
      
      This patch (of 19):
      
      This patch introduces the option to introduce special pte bit into
      pmd/puds.  Archs can start to define pmd_special / pud_special when
      supported by selecting the new option.  Per-arch support will be added
      later.
      
      Before that, create fallbacks for these helpers so that they are always
      available.
      
      Link: https://lkml.kernel.org/r/20240826204353.2228736-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20240826204353.2228736-2-peterx@redhat.comSigned-off-by: default avatarPeter Xu <peterx@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Niklas Schnelle <schnelle@linux.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Ryan Roberts <ryan.roberts@arm.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      6857be5f
  2. 09 Sep, 2024 36 commits