• Linus Torvalds's avatar
    mm: fix up numa read-only thread grouping logic · 53da3bc2
    Linus Torvalds authored
    Dave Chinner reported that commit 4d942466 ("mm: convert
    p[te|md]_mknonnuma and remaining page table manipulations") slowed down
    his xfsrepair test enormously.  In particular, it was using more system
    time due to extra TLB flushing.
    
    The ultimate reason turns out to be how the change to use the regular
    page table accessor functions broke the NUMA grouping logic.  The old
    special mknuma/mknonnuma code accessed the page table present bit and
    the magic NUMA bit directly, while the new code just changes the page
    protections using PROT_NONE and the regular vma protections.
    
    That sounds equivalent, and from a fault standpoint it really is, but a
    subtle side effect is that the *other* protection bits of the page table
    entries also change.  And the code to decide how to group the NUMA
    entries together used the writable bit to decide whether a particular
    page was likely to be shared read-only or not.
    
    And with the change to make the NUMA handling use the regular permission
    setting functions, that writable bit was basically always cleared for
    private mappings due to COW.  So even if the page actually ends up being
    written to in the end, the NUMA balancing would act as if it was always
    shared RO.
    
    This code is a heuristic anyway, so the fix - at least for now - is to
    instead check whether the page is dirty rather than writable.  The bit
    doesn't change with protection changes.
    
    NOTE! This also adds a FIXME comment to revisit this issue,
    
    Not only should we probably re-visit the whole "is this a shared
    read-only page" heuristic (we might want to take the vma permissions
    into account and base this more on those than the per-page ones, and
    also look at whether the particular access that triggers it is a write
    or not), but the whole COW issue shows that we should think about the
    NUMA fault handling some more.
    
    For example, maybe we should do the early-COW thing that a regular fault
    does.  Or maybe we should accept that while using the same bits as
    PROTNONE was a good thing (and got rid of the specual NUMA bit), we
    might still want to just preseve the other protection bits across NUMA
    faulting.
    
    Those are bigger questions, left for later.  This just fixes up the
    heuristic so that it at least approximates working again.  More analysis
    and work needed.
    Reported-by: default avatarDave Chinner <david@fromorbit.com>
    Tested-by: default avatarMel Gorman <mgorman@suse.de>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Ingo Molnar <mingo@kernel.org>,
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    53da3bc2
huge_memory.c 78.3 KB