You need to sign in or sign up before continuing.
  • Dave Hansen's avatar
    mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes · b27abacc
    Dave Hansen authored
    Patch series "Introduce multi-preference mempolicy", v7.
    
    This patch series introduces the concept of the MPOL_PREFERRED_MANY
    mempolicy.  This mempolicy mode can be used with either the
    set_mempolicy(2) or mbind(2) interfaces.  Like the MPOL_PREFERRED
    interface, it allows an application to set a preference for nodes which
    will fulfil memory allocation requests.  Unlike the MPOL_PREFERRED mode,
    it takes a set of nodes.  Like the MPOL_BIND interface, it works over a
    set of nodes.  Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the
    OOM killer if those preferred nodes are not available.
    
    Along with these patches are patches for libnuma, numactl, numademo, and
    memhog.  They still need some polish, but can be found here:
    https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new
    usage: `numactl -P 0,3,4`
    
    The goal of the new mode is to enable some use-cases when using tiered memory
    usage models which I've lovingly named.
    
    1a. The Hare - The interconnect is fast enough to meet bandwidth and
        latency requirements allowing preference to be given to all nodes with
        "fast" memory.
    1b. The Indiscriminate Hare - An application knows it wants fast
        memory (or perhaps slow memory), but doesn't care which node it runs
        on.  The application can prefer a set of nodes and then xpu bind to
        the local node (cpu, accelerator, etc).  This reverses the nodes are
        chosen today where the kernel attempts to use local memory to the CPU
        whenever possible.  This will attempt to use the local accelerator to
        the memory.
    2.  The Tortoise - The administrator (or the application itself) is
        aware it only needs slow memory, and so can prefer that.
    
    Much of this is almost achievable with the bind interface, but the bind
    interface suffers from an inability to fallback to another set of nodes if
    binding fails to all nodes in the nodemask.
    
    Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
    preference.
    
    > /* Set first two nodes as preferred in an 8 node system. */
    > const unsigned long nodes = 0x3
    > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
    
    > /* Mimic interleave policy, but have fallback *.
    > const unsigned long nodes = 0xaa
    > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);
    
    Some internal discussion took place around the interface. There are two
    alternatives which we have discussed, plus one I stuck in:
    
    1. Ordered list of nodes.  Currently it's believed that the added
       complexity is nod needed for expected usecases.
    2. A flag for bind to allow falling back to other nodes.  This
       confuses the notion of binding and is less flexible than the current
       solution.
    3. Create flags or new modes that helps with some ordering.  This
       offers both a friendlier API as well as a solution for more customized
       usage.  It's unknown if it's worth the complexity to support this.
       Here is sample code for how this might work:
    
    > // Prefer specific nodes for some something wacky
    > set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
    >
    > // Default
    > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
    > // which is the same as
    > set_mempolicy(MPOL_DEFAULT, NULL, 0);
    >
    > // The Hare
    > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
    >
    > // The Tortoise
    > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
    >
    > // Prefer the fast memory of the first two sockets
    > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
    >
    
    This patch (of 5):
    
    The NUMA APIs currently allow passing in a "preferred node" as a single
    bit set in a nodemask.  If more than one bit it set, bits after the first
    are ignored.
    
    This single node is generally OK for location-based NUMA where memory
    being allocated will eventually be operated on by a single CPU.  However,
    in systems with multiple memory types, folks want to target a *type* of
    memory instead of a location.  For instance, someone might want some
    high-bandwidth memory but do not care about the CPU next to which it is
    allocated.  Or, they want a cheap, high capacity allocation and want to
    target all NUMA nodes which have persistent memory in volatile mode.  In
    both of these cases, the application wants to target a *set* of nodes, but
    does not want strict MPOL_BIND behavior as that could lead to OOM killer
    or SIGSEGV.
    
    So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes
    requirement.  This is not a pie-in-the-sky dream for an API.  This was a
    response to a specific ask of more than one group at Intel.  Specifically:
    
    1. There are existing libraries that target memory types such as
       https://github.com/memkind/memkind.  These are known to suffer from
       SIGSEGV's when memory is low on targeted memory "kinds" that span more
       than one node.  The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an
       example of this.
    
    2. Volatile-use persistent memory users want to have a memory policy
       which is targeted at either "cheap and slow" (PMEM) or "expensive and
       fast" (DRAM).  However, they do not want to experience allocation
       failures when the targeted type is unavailable.
    
    3. Allocate-then-run.  Generally, we let the process scheduler decide
       on which physical CPU to run a task.  That location provides a default
       allocation policy, and memory availability is not generally considered
       when placing tasks.  For situations where memory is valuable and
       constrained, some users want to allocate memory first, *then* allocate
       close compute resources to the allocation.  This is the reverse of the
       normal (CPU) model.  Accelerators such as GPUs that operate on
       core-mm-managed memory are interested in this model.
    
    A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
    policy to be used for now, and will be removed in later patch after all
    implementations for 'prefer_many' are ready, as suggested by Michal Hocko.
    
    [mhocko@kernel.org: suggest to refine policy_node/policy_nodemask handling]
    
    Link: https://lkml.kernel.org/r/1627970362-61305-1-git-send-email-feng.tang@intel.com
    Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com
    Link: https://lkml.kernel.org/r/1627970362-61305-2-git-send-email-feng.tang@intel.com
    
    Co-developed-by: default avatarBen Widawsky <ben.widawsky@intel.com>
    Signed-off-by: default avatarBen Widawsky <ben.widawsky@intel.com>
    Signed-off-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
    Signed-off-by: default avatarFeng Tang <feng.tang@intel.com>
    Cc: Michal Hocko <mhocko@kernel.org>
    Acked-by: default avatarMichal Hocko <mhocko@suse.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Randy Dunlap <rdunlap@infradead.org>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Andi Kleen <ak@linux.intel.com>
    Cc: Dan Williams <dan.j.williams@intel.com>
    Cc: Huang Ying <ying.huang@intel.com>b
    Cc: Michal Hocko <mhocko@suse.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    b27abacc
mempolicy.c 77.3 KB