• Zach O'Keefe's avatar
    mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse · 7d8faaf1
    Zach O'Keefe authored
    This idea was introduced by David Rientjes[1].
    
    Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request
    a synchronous collapse of memory at their own expense.
    
    The benefits of this approach are:
    
    * CPU is charged to the process that wants to spend the cycles for the
      THP
    * Avoid unpredictable timing of khugepaged collapse
    
    Semantics
    
    This call is independent of the system-wide THP sysfs settings, but will
    fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
    multiple VMAs, the semantics of the collapse over each VMA is independent
    from the others.  This implies a hugepage cannot cross a VMA boundary.  If
    collapse of a given hugepage-aligned/sized region fails, the operation may
    continue to attempt collapsing the remainder of memory specified.
    
    The memory ranges provided must be page-aligned, but are not required to
    be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
    start/end of the range will be clamped to the first/last hugepage-aligned
    address covered by said range.  The memory ranges must span at least one
    hugepage-sized region.
    
    All non-resident pages covered by the range will first be
    swapped/faulted-in, before being internally copied onto a freshly
    allocated hugepage.  Unmapped pages will have their data directly
    initialized to 0 in the new hugepage.  However, for every eligible
    hugepage aligned/sized region to-be collapsed, at least one page must
    currently be backed by memory (a PMD covering the address range must
    already exist).
    
    Allocation for the new hugepage may enter direct reclaim and/or
    compaction, regardless of VMA flags.  When the system has multiple NUMA
    nodes, the hugepage will be allocated from the node providing the most
    native pages.  This operation operates on the current state of the
    specified process and makes no persistent changes or guarantees on how
    pages will be mapped, constructed, or faulted in the future
    
    Return Value
    
    If all hugepage-sized/aligned regions covered by the provided range were
    either successfully collapsed, or were already PMD-mapped THPs, this
    operation will be deemed successful.  On success, process_madvise(2)
    returns the number of bytes advised, and madvise(2) returns 0.  Else, -1
    is returned and errno is set to indicate the error for the most-recently
    attempted hugepage collapse.  Note that many failures might have occurred,
    since the operation may continue to collapse in the event a single
    hugepage-sized/aligned region fails.
    
    	ENOMEM	Memory allocation failed or VMA not found
    	EBUSY	Memcg charging failed
    	EAGAIN	Required resource temporarily unavailable.  Try again
    		might succeed.
    	EINVAL	Other error: No PMD found, subpage doesn't have Present
    		bit set, "Special" page no backed by struct page, VMA
    		incorrectly sized, address not page-aligned, ...
    
    Most notable here is ENOMEM and EBUSY (new to madvise) which are intended
    to provide the caller with actionable feedback so they may take an
    appropriate fallback measure.
    
    Use Cases
    
    An immediate user of this new functionality are malloc() implementations
    that manage memory in hugepage-sized chunks, but sometimes subrelease
    memory back to the system in native-sized chunks via MADV_DONTNEED;
    zapping the pmd.  Later, when the memory is hot, the implementation could
    madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage
    coverage and dTLB performance.  TCMalloc is such an implementation that
    could benefit from this[2].
    
    Only privately-mapped anon memory is supported for now, but additional
    support for file, shmem, and HugeTLB high-granularity mappings[2] is
    expected.  File and tmpfs/shmem support would permit:
    
    * Backing executable text by THPs.  Current support provided by
      CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
      might impair services from serving at their full rated load after
      (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
      immediately realize iTLB performance prevents page sharing and demand
      paging, both of which increase steady state memory footprint.  With
      MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
      and lower RAM footprints.
    * Backing guest memory by hugapages after the memory contents have been
      migrated in native-page-sized chunks to a new host, in a
      userfaultfd-based live-migration stack.
    
    [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
    [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
    
    [jrdr.linux@gmail.com: avoid possible memory leak in failure path]
      Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com
    [zokeefe@google.com add missing kfree() to madvise_collapse()]
      Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/
      Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com
    [zokeefe@google.com: delay computation of hpage boundaries until use]]
      Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com
    Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.comSigned-off-by: default avatarZach O'Keefe <zokeefe@google.com>
    Signed-off-by: default avatar"Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
    Suggested-by: default avatarDavid Rientjes <rientjes@google.com>
    Cc: Alex Shi <alex.shi@linux.alibaba.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Arnd Bergmann <arnd@arndb.de>
    Cc: Axel Rasmussen <axelrasmussen@google.com>
    Cc: Chris Kennelly <ckennelly@google.com>
    Cc: Chris Zankel <chris@zankel.net>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: Helge Deller <deller@gmx.de>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
    Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
    Cc: Jens Axboe <axboe@kernel.dk>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Matthew Wilcox <willy@infradead.org>
    Cc: Matt Turner <mattst88@gmail.com>
    Cc: Max Filippov <jcmvbkbc@gmail.com>
    Cc: Miaohe Lin <linmiaohe@huawei.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Minchan Kim <minchan@kernel.org>
    Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
    Cc: Pavel Begunkov <asml.silence@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
    Cc: SeongJae Park <sj@kernel.org>
    Cc: Song Liu <songliubraving@fb.com>
    Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Yang Shi <shy828301@gmail.com>
    Cc: Zi Yan <ziy@nvidia.com>
    Cc: Dan Carpenter <dan.carpenter@oracle.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    7d8faaf1
mman-common.h 3.72 KB