Commit fa3bea4e authored by Gregory Price's avatar Gregory Price Committed by Andrew Morton

mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for weighted interleaving

When a system has multiple NUMA nodes and it becomes bandwidth hungry,
using the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such as
socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin based
interleave policy does not optimally distribute data to make use of their
different bandwidth characteristics.

Instead, interleave is more effective when the allocation policy follows
each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
enabling weighted interleave between NUMA nodes.  Weighted interleave
allows for proportional distribution of memory across multiple numa nodes,
preferably apportioned to match the bandwidth of each node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate weight
distribution is (2:1).

Weights for each node can be assigned via the new sysfs extension:
/sys/kernel/mm/mempolicy/weighted_interleave/

For now, the default value of all nodes will be `1`, which matches the
behavior of standard 1:1 round-robin interleave.  An extension will be
added in the future to allow default values to be registered at kernel and
device bringup time.

The policy allocates a number of pages equal to the set weights.  For
example, if the weights are (2,1), then 2 pages will be allocated on node0
for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

Some high level notes about the pieces of weighted interleave:

current->il_prev:
    Tracks the node previously allocated from.

current->il_weight:
    The active weight of the current node (current->il_prev)
    When this reaches 0, current->il_prev is set to the next node
    and current->il_weight is set to the next weight.

weighted_interleave_nodes:
    Counts the number of allocations as they occur, and applies the
    weight for the current node.  When the weight reaches 0, switch
    to the next node.  Operates only on task->mempolicy.

weighted_interleave_nid:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the node based on the given index.
    Operates on VMA policies.

bulk_array_weighted_interleave:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the number of "interleave rounds" as
    well as any delta ("partial round").  Calculates the number of
    pages for each node and allocates them.

    If a node was scheduled for interleave via interleave_nodes, the
    current weight will be allocated first.

    Operates only on the task->mempolicy.

One piece of complexity is the interaction between a recent refactor which
split the logic to acquire the "ilx" (interleave index) of an allocation
and the actually application of the interleave.  If a call to
alloc_pages_mpol() were made with a weighted-interleave policy and ilx set
to NO_INTERLEAVE_INDEX, weighted_interleave_nodes() would operate on a VMA
policy - violating the description above.

An inspection of all callers of alloc_pages_mpol() shows that all external
callers set ilx to `0`, an index value, or will call get_vma_policy() to
acquire the ilx.

For example, mm/shmem.c may call into alloc_pages_mpol.  The call stacks
all set (pgoff_t ilx) or end up in `get_vma_policy()`.  This enforces the
`weighted_interleave_nodes()` and `weighted_interleave_nid()` policy
requirements (task/vma respectively).

Link: https://lkml.kernel.org/r/20240202170238.90004-4-gregory.price@memverge.comSuggested-by: default avatarHasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: default avatarGregory Price <gregory.price@memverge.com>
Co-developed-by: default avatarRakie Kim <rakie.kim@sk.com>
Signed-off-by: default avatarRakie Kim <rakie.kim@sk.com>
Co-developed-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
Signed-off-by: default avatarHonggyu Kim <honggyu.kim@sk.com>
Co-developed-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: default avatarHyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: default avatarSrinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Signed-off-by: default avatarSrinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Co-developed-by: default avatarRavi Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: default avatarRavi Jonnalagadda <ravis.opensrc@micron.com>
Reviewed-by: default avatar"Huang, Ying" <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
parent 9685e6e3
......@@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
can fall back to all existing numa nodes. This is effectively
MPOL_PREFERRED allowed for a mask rather than a single node.
MPOL_WEIGHTED_INTERLEAVE
This mode operates the same as MPOL_INTERLEAVE, except that
interleaving behavior is executed based on weights set in
/sys/kernel/mm/mempolicy/weighted_interleave/
Weighted interleave allocates pages on nodes according to a
weight. For example if nodes [0,1] are weighted [5,2], 5 pages
will be allocated on node0 for every 2 pages allocated on node1.
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES
......
......@@ -1259,6 +1259,7 @@ struct task_struct {
/* Protected by alloc_lock: */
struct mempolicy *mempolicy;
short il_prev;
u8 il_weight;
short pref_node_fork;
#endif
#ifdef CONFIG_NUMA_BALANCING
......
......@@ -23,6 +23,7 @@ enum {
MPOL_INTERLEAVE,
MPOL_LOCAL,
MPOL_PREFERRED_MANY,
MPOL_WEIGHTED_INTERLEAVE,
MPOL_MAX, /* always last member of enum */
};
......
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment