• Joao Martins's avatar
    mm/sparse-vmemmap: add a pgmap argument to section activation · e3246d8f
    Joao Martins authored
    Patch series "sparse-vmemmap: memory savings for compound devmaps (device-dax)", v9.
    
    This series minimizes 'struct page' overhead by pursuing a similar
    approach as Muchun Song series "Free some vmemmap pages of hugetlb page"
    (now merged since v5.14), but applied to devmap with @vmemmap_shift
    (device-dax).  
    
    The vmemmap dedpulication original idea (already used in HugeTLB) is to
    reuse/deduplicate tail page vmemmap areas, particular the area which only
    describes tail pages.  So a vmemmap page describes 64 struct pages, and
    the first page for a given ZONE_DEVICE vmemmap would contain the head page
    and 63 tail pages.  The second vmemmap page would contain only tail pages,
    and that's what gets reused across the rest of the subsection/section. 
    The bigger the page size, the bigger the savings (2M hpage -> save 6
    vmemmap pages; 1G hpage -> save 4094 vmemmap pages).  
    
    This is done for PMEM /specifically only/ on device-dax configured
    namespaces, not fsdax.  In other words, a devmap with a @vmemmap_shift.
    
    In terms of savings, per 1Tb of memory, the struct page cost would go down
    with compound devmap:
    
    * with 2M pages we lose 4G instead of 16G (0.39% instead of 1.5% of
      total memory)
    
    * with 1G pages we lose 40MB instead of 16G (0.0014% instead of 1.5% of
      total memory)
    
    The series is mostly summed up by patch 4, and to summarize what the
    series does:
    
    Patches 1 - 3: Minor cleanups in preparation for patch 4.  Move the very
    nice docs of hugetlb_vmemmap.c into a Documentation/vm/ entry.
    
    Patch 4: Patch 4 is the one that takes care of the struct page savings
    (also referred to here as tail-page/vmemmap deduplication).  Much like
    Muchun series, we reuse the second PTE tail page vmemmap areas across a
    given @vmemmap_shift On important difference though, is that contrary to
    the hugetlbfs series, there's no vmemmap for the area because we are
    late-populating it as opposed to remapping a system-ram range.  IOW no
    freeing of pages of already initialized vmemmap like the case for
    hugetlbfs, which greatly simplifies the logic (besides not being
    arch-specific).  altmap case unchanged and still goes via the
    vmemmap_populate().  Also adjust the newly added docs to the device-dax
    case.
    
    [Note that device-dax is still a little behind HugeTLB in terms of
    savings.  I have an additional simple patch that reuses the head vmemmap
    page too, as a follow-up.  That will double the savings and namespaces
    initialization.]
    
    Patch 5: Initialize fewer struct pages depending on the page size with
    DRAM backed struct pages -- because fewer pages are unique and most tail
    pages (with bigger vmemmap_shift).
    
        NVDIMM namespace bootstrap improves from ~268-358 ms to
        ~80-110/<1ms on 128G NVDIMMs with 2M and 1G respectivally.  And struct
        page needed capacity will be 3.8x / 1071x smaller for 2M and 1G
        respectivelly.  Tested on x86 with 1.5Tb of pmem (including pinning,
        and RDMA registration/deregistration scalability with 2M MRs)
    
    
    This patch (of 5):
    
    In support of using compound pages for devmap mappings, plumb the pgmap
    down to the vmemmap_populate implementation.  Note that while altmap is
    retrievable from pgmap the memory hotplug code passes altmap without
    pgmap[*], so both need to be independently plumbed.
    
    So in addition to @altmap, pass @pgmap to sparse section populate
    functions namely:
    
    	sparse_add_section
    	  section_activate
    	    populate_section_memmap
       	      __populate_section_memmap
    
    Passing @pgmap allows __populate_section_memmap() to both fetch the
    vmemmap_shift in which memmap metadata is created for and also to let
    sparse-vmemmap fetch pgmap ranges to co-relate to a given section and pick
    whether to just reuse tail pages from past onlined sections.
    
    While at it, fix the kdoc for @altmap for sparse_add_section().
    
    [*] https://lore.kernel.org/linux-mm/20210319092635.6214-1-osalvador@suse.de/
    
    Link: https://lkml.kernel.org/r/20220420155310.9712-1-joao.m.martins@oracle.com
    Link: https://lkml.kernel.org/r/20220420155310.9712-2-joao.m.martins@oracle.comSigned-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
    Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
    Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Christoph Hellwig <hch@lst.de>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    e3246d8f
memory_hotplug.c 63.7 KB