• Baolin Wang's avatar
    mm: support multi-size THP numa balancing · d2136d74
    Baolin Wang authored
    Now the anonymous page allocation already supports multi-size THP (mTHP),
    but the numa balancing still prohibits mTHP migration even though it is an
    exclusive mapping, which is unreasonable.
    
    Allow scanning mTHP:
    Commit 859d4adc ("mm: numa: do not trap faults on shared data section
    pages") skips shared CoW pages' NUMA page migration to avoid shared data
    segment migration. In addition, commit 80d47f5d ("mm: don't try to
    NUMA-migrate COW pages that have other uses") change to use page_count()
    to avoid GUP pages migration, that will also skip the mTHP numa scanning.
    Theoretically, we can use folio_maybe_dma_pinned() to detect the GUP
    issue, although there is still a GUP race, the issue seems to have been
    resolved by commit 80d47f5d. Meanwhile, use the folio_likely_mapped_shared()
    to skip shared CoW pages though this is not a precise sharers count. To
    check if the folio is shared, ideally we want to make sure every page is
    mapped to the same process, but doing that seems expensive and using
    the estimated mapcount seems can work when running autonuma benchmark.
    
    Allow migrating mTHP:
    As mentioned in the previous thread[1], large folios (including THP) are
    more susceptible to false sharing issues among threads than 4K base page,
    leading to pages ping-pong back and forth during numa balancing, which is
    currently not easy to resolve. Therefore, as a start to support mTHP numa
    balancing, we can follow the PMD mapped THP's strategy, that means we can
    reuse the 2-stage filter in should_numa_migrate_memory() to check if the
    mTHP is being heavily contended among threads (through checking the CPU id
    and pid of the last access) to avoid false sharing at some degree. Thus,
    we can restore all PTE maps upon the first hint page fault of a large folio
    to follow the PMD mapped THP's strategy. In the future, we can continue to
    optimize the NUMA balancing algorithm to avoid the false sharing issue with
    large folios as much as possible.
    
    Performance data:
    Machine environment: 2 nodes, 128 cores Intel(R) Xeon(R) Platinum
    Base: 2024-03-25 mm-unstable branch
    Enable mTHP to run autonuma-benchmark
    
    mTHP:16K
    Base				Patched
    numa01				numa01
    224.70				143.48
    numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
    118.05				47.43
    numa02				numa02
    13.45				9.29
    numa02_SMT			numa02_SMT
    14.80				7.50
    
    mTHP:64K
    Base				Patched
    numa01				numa01
    216.15				114.40
    numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
    115.35				47.41
    numa02				numa02
    13.24				9.25
    numa02_SMT			numa02_SMT
    14.67				7.34
    
    mTHP:128K
    Base				Patched
    numa01				numa01
    205.13				144.45
    numa01_THREAD_ALLOC		numa01_THREAD_ALLOC
    112.93				41.88
    numa02				numa02
    13.16				9.18
    numa02_SMT			numa02_SMT
    14.81				7.49
    
    [1] https://lore.kernel.org/all/20231117100745.fnpijbk4xgmals3k@techsingularity.net/
    
    [baolin.wang@linux.alibaba.com: v3]
      Link: https://lkml.kernel.org/r/c33a5c0b0a0323b1f8ed53772f50501f4b196e25.1712132950.git.baolin.wang@linux.alibaba.com
    Link: https://lkml.kernel.org/r/d28d276d599c26df7f38c9de8446f60e22dd1950.1711683069.git.baolin.wang@linux.alibaba.comSigned-off-by: default avatarBaolin Wang <baolin.wang@linux.alibaba.com>
    Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
    Cc: David Hildenbrand <david@redhat.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Ryan Roberts <ryan.roberts@arm.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    d2136d74
mprotect.c 22.4 KB