• Joao Martins's avatar
    mm/page_alloc: split prep_compound_page into head and tail subparts · 5b24eeef
    Joao Martins authored
    Patch series "mm, device-dax: Introduce compound pages in devmap", v7.
    
    This series converts device-dax to use compound pages, and moves away
    from the 'struct page per basepage on PMD/PUD' that is done today.
    
    Doing so
     1) unlocks a few noticeable improvements on unpin_user_pages() and
        makes device-dax+altmap case 4x times faster in pinning (numbers
        below and in last patch)
     2) as mentioned in various other threads it's one important step
        towards cleaning up ZONE_DEVICE refcounting.
    
    I've split the compound pages on devmap part from the rest based on
    recent discussions on devmap pending and future work planned[5][6].
    There is consensus that device-dax should be using compound pages to
    represent its PMD/PUDs just like HugeTLB and THP, and that leads to less
    specialization of the dax parts.  I will pursue the rest of the work in
    parallel once this part is merged, particular the GUP-{slow,fast}
    improvements [7] and the tail struct page deduplication memory savings
    part[8].
    
    To summarize what the series does:
    
    Patch 1: Prepare hwpoisoning to work with dax compound pages.
    
    Patches 2-3: Split the current utility function of prep_compound_page()
    into head and tail and use those two helpers where appropriate to take
    advantage of caches being warm after __init_single_page().  This is used
    when initializing zone device when we bring up device-dax namespaces.
    
    Patches 4-10: Add devmap support for compound pages in device-dax.
    memmap_init_zone_device() initialize its metadata as compound pages, and
    it introduces a new devmap property known as vmemmap_shift which
    outlines how the vmemmap is structured (defaults to base pages as done
    today).  The property describe the page order of the metadata
    essentially.  While at it do a few cleanups in device-dax in patches
    5-9.  Finally enable device-dax usage of devmap @vmemmap_shift to a
    value based on its own @align property.  @vmemmap_shift returns 0 by
    default (which is today's case of base pages in devmap, like fsdax or
    the others) and the usage of compound devmap is optional.  Starting with
    device-dax (*not* fsdax) we enable it by default.  There are a few
    pinning improvements particular on the unpinning case and altmap, as
    well as unpin_user_page_range_dirty_lock() being just as effective as
    THP/hugetlb[0] pages.
    
        $ gup_test -f /dev/dax1.0 -m 16384 -r 10 -S -a -n 512 -w
        (pin_user_pages_fast 2M pages) put:~71 ms -> put:~22 ms
        [altmap]
        (pin_user_pages_fast 2M pages) get:~524ms put:~525 ms -> get: ~127ms put:~71ms
    
         $ gup_test -f /dev/dax1.0 -m 129022 -r 10 -S -a -n 512 -w
        (pin_user_pages_fast 2M pages) put:~513 ms -> put:~188 ms
        [altmap with -m 127004]
        (pin_user_pages_fast 2M pages) get:~4.1 secs put:~4.12 secs -> get:~1sec put:~563ms
    
    Tested on x86 with 1Tb+ of pmem (alongside registering it with RDMA with
    and without altmap), alongside gup_test selftests with dynamic dax
    regions and static dax regions.  Coupled with ndctl unit tests for
    dynamic dax devices that exercise all of this.  Note, for dynamic dax
    regions I had to revert commit 8aa83e63 ("x86/setup: Call
    early_reserve_memory() earlier"), it is a known issue that this commit
    broke efi_fake_mem=.
    
    This patch (of 11):
    
    Split the utility function prep_compound_page() into head and tail
    counterparts, and use them accordingly.
    
    This is in preparation for sharing the storage for compound page
    metadata.
    
    Link: https://lkml.kernel.org/r/20211202204422.26777-1-joao.m.martins@oracle.com
    Link: https://lkml.kernel.org/r/20211202204422.26777-3-joao.m.martins@oracle.comSigned-off-by: default avatarJoao Martins <joao.m.martins@oracle.com>
    Acked-by: default avatarMike Kravetz <mike.kravetz@oracle.com>
    Reviewed-by: default avatarDan Williams <dan.j.williams@intel.com>
    Reviewed-by: default avatarMuchun Song <songmuchun@bytedance.com>
    Cc: Vishal Verma <vishal.l.verma@intel.com>
    Cc: Dave Jiang <dave.jiang@intel.com>
    Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
    Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
    Cc: Jason Gunthorpe <jgg@ziepe.ca>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Jane Chu <jane.chu@oracle.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    5b24eeef
page_alloc.c 263 KB