Commit cb5e4376 authored by Mike Rapoport's avatar Mike Rapoport Committed by Jonathan Corbet

docs/vm: numa_memory_policy.txt: convert to ReST format

Signed-off-by: default avatarMike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: default avatarJonathan Corbet <corbet@lwn.net>
parent 16f9f7f9
.. _numa_memory_policy:
===================
Linux Memory Policy
===================
What is Linux Memory Policy? What is Linux Memory Policy?
============================
In the Linux kernel, "memory policy" determines from which node the kernel will In the Linux kernel, "memory policy" determines from which node the kernel will
allocate memory in a NUMA system or in an emulated NUMA system. Linux has allocate memory in a NUMA system or in an emulated NUMA system. Linux has
...@@ -9,35 +15,36 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy ...@@ -9,35 +15,36 @@ document attempts to describe the concepts and APIs of the 2.6 memory policy
support. support.
Memory policies should not be confused with cpusets Memory policies should not be confused with cpusets
(Documentation/cgroup-v1/cpusets.txt) (``Documentation/cgroup-v1/cpusets.txt``)
which is an administrative mechanism for restricting the nodes from which which is an administrative mechanism for restricting the nodes from which
memory may be allocated by a set of processes. Memory policies are a memory may be allocated by a set of processes. Memory policies are a
programming interface that a NUMA-aware application can take advantage of. When programming interface that a NUMA-aware application can take advantage of. When
both cpusets and policies are applied to a task, the restrictions of the cpuset both cpusets and policies are applied to a task, the restrictions of the cpuset
takes priority. See "MEMORY POLICIES AND CPUSETS" below for more details. takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>`
below for more details.
MEMORY POLICY CONCEPTS Memory Policy Concepts
======================
Scope of Memory Policies Scope of Memory Policies
------------------------
The Linux kernel supports _scopes_ of memory policy, described here from The Linux kernel supports _scopes_ of memory policy, described here from
most general to most specific: most general to most specific:
System Default Policy: this policy is "hard coded" into the kernel. It System Default Policy
is the policy that governs all page allocations that aren't controlled this policy is "hard coded" into the kernel. It is the policy
by one of the more specific policy scopes discussed below. When the that governs all page allocations that aren't controlled by
system is "up and running", the system default policy will use "local one of the more specific policy scopes discussed below. When
allocation" described below. However, during boot up, the system the system is "up and running", the system default policy will
default policy will be set to interleave allocations across all nodes use "local allocation" described below. However, during boot
with "sufficient" memory, so as not to overload the initial boot node up, the system default policy will be set to interleave
with boot-time allocations. allocations across all nodes with "sufficient" memory, so as
not to overload the initial boot node with boot-time
Task/Process Policy: this is an optional, per-task policy. When defined allocations.
for a specific task, this policy controls all page allocations made by or
on behalf of the task that aren't controlled by a more specific scope. Task/Process Policy
If a task does not define a task policy, then all page allocations that this is an optional, per-task policy. When defined for a specific task, this policy controls all page allocations made by or on behalf of the task that aren't controlled by a more specific scope. If a task does not define a task policy, then all page allocations that would have been controlled by the task policy "fall back" to the System Default Policy.
would have been controlled by the task policy "fall back" to the System
Default Policy.
The task policy applies to the entire address space of a task. Thus, The task policy applies to the entire address space of a task. Thus,
it is inheritable, and indeed is inherited, across both fork() it is inheritable, and indeed is inherited, across both fork()
...@@ -58,56 +65,66 @@ most general to most specific: ...@@ -58,56 +65,66 @@ most general to most specific:
changes its task policy remain where they were allocated based on changes its task policy remain where they were allocated based on
the policy at the time they were allocated. the policy at the time they were allocated.
VMA Policy: A "VMA" or "Virtual Memory Area" refers to a range of a task's .. _vma_policy:
VMA Policy
A "VMA" or "Virtual Memory Area" refers to a range of a task's
virtual address space. A task may define a specific policy for a range virtual address space. A task may define a specific policy for a range
of its virtual address space. See the MEMORY POLICIES APIS section, of its virtual address space. See the MEMORY POLICIES APIS section,
below, for an overview of the mbind() system call used to set a VMA below, for an overview of the mbind() system call used to set a VMA
policy. policy.
A VMA policy will govern the allocation of pages that back this region of A VMA policy will govern the allocation of pages that back
the address space. Any regions of the task's address space that don't this region ofthe address space. Any regions of the task's
have an explicit VMA policy will fall back to the task policy, which may address space that don't have an explicit VMA policy will fall
itself fall back to the System Default Policy. back to the task policy, which may itself fall back to the
System Default Policy.
VMA policies have a few complicating details: VMA policies have a few complicating details:
VMA policy applies ONLY to anonymous pages. These include pages * VMA policy applies ONLY to anonymous pages. These include
allocated for anonymous segments, such as the task stack and heap, and pages allocated for anonymous segments, such as the task
any regions of the address space mmap()ed with the MAP_ANONYMOUS flag. stack and heap, and any regions of the address space
If a VMA policy is applied to a file mapping, it will be ignored if mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is
the mapping used the MAP_SHARED flag. If the file mapping used the applied to a file mapping, it will be ignored if the mapping
MAP_PRIVATE flag, the VMA policy will only be applied when an used the MAP_SHARED flag. If the file mapping used the
anonymous page is allocated on an attempt to write to the mapping-- MAP_PRIVATE flag, the VMA policy will only be applied when
i.e., at Copy-On-Write. an anonymous page is allocated on an attempt to write to the
mapping-- i.e., at Copy-On-Write.
VMA policies are shared between all tasks that share a virtual address
space--a.k.a. threads--independent of when the policy is installed; and * VMA policies are shared between all tasks that share a
they are inherited across fork(). However, because VMA policies refer virtual address space--a.k.a. threads--independent of when
to a specific region of a task's address space, and because the address the policy is installed; and they are inherited across
space is discarded and recreated on exec*(), VMA policies are NOT fork(). However, because VMA policies refer to a specific
inheritable across exec(). Thus, only NUMA-aware applications may region of a task's address space, and because the address
use VMA policies. space is discarded and recreated on exec*(), VMA policies
are NOT inheritable across exec(). Thus, only NUMA-aware
A task may install a new VMA policy on a sub-range of a previously applications may use VMA policies.
mmap()ed region. When this happens, Linux splits the existing virtual
memory area into 2 or 3 VMAs, each with it's own policy. * A task may install a new VMA policy on a sub-range of a
previously mmap()ed region. When this happens, Linux splits
By default, VMA policy applies only to pages allocated after the policy the existing virtual memory area into 2 or 3 VMAs, each with
is installed. Any pages already faulted into the VMA range remain it's own policy.
where they were allocated based on the policy at the time they were
allocated. However, since 2.6.16, Linux supports page migration via * By default, VMA policy applies only to pages allocated after
the mbind() system call, so that page contents can be moved to match the policy is installed. Any pages already faulted into the
a newly installed policy. VMA range remain where they were allocated based on the
policy at the time they were allocated. However, since
Shared Policy: Conceptually, shared policies apply to "memory objects" 2.6.16, Linux supports page migration via the mbind() system
mapped shared into one or more tasks' distinct address spaces. An call, so that page contents can be moved to match a newly
application installs a shared policies the same way as VMA policies--using installed policy.
the mbind() system call specifying a range of virtual addresses that map
the shared object. However, unlike VMA policies, which can be considered Shared Policy
to be an attribute of a range of a task's address space, shared policies Conceptually, shared policies apply to "memory objects" mapped
apply directly to the shared object. Thus, all tasks that attach to the shared into one or more tasks' distinct address spaces. An
object share the policy, and all pages allocated for the shared object, application installs a shared policies the same way as VMA
by any task, will obey the shared policy. policies--using the mbind() system call specifying a range of
virtual addresses that map the shared object. However, unlike
VMA policies, which can be considered to be an attribute of a
range of a task's address space, shared policies apply
directly to the shared object. Thus, all tasks that attach to
the object share the policy, and all pages allocated for the
shared object, by any task, will obey the shared policy.
As of 2.6.22, only shared memory segments, created by shmget() or As of 2.6.22, only shared memory segments, created by shmget() or
mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared
...@@ -118,11 +135,12 @@ most general to most specific: ...@@ -118,11 +135,12 @@ most general to most specific:
Although hugetlbfs segments now support lazy allocation, their support Although hugetlbfs segments now support lazy allocation, their support
for shared policy has not been completed. for shared policy has not been completed.
As mentioned above [re: VMA policies], allocations of page cache As mentioned above :ref:`VMA policies <vma_policy>`,
pages for regular files mmap()ed with MAP_SHARED ignore any VMA allocations of page cache pages for regular files mmap()ed
policy installed on the virtual address range backed by the shared with MAP_SHARED ignore any VMA policy installed on the virtual
file mapping. Rather, shared page cache pages, including pages backing address range backed by the shared file mapping. Rather,
private mappings that have not yet been written by the task, follow shared page cache pages, including pages backing private
mappings that have not yet been written by the task, follow
task policy, if any, else System Default Policy. task policy, if any, else System Default Policy.
The shared policy infrastructure supports different policies on subset The shared policy infrastructure supports different policies on subset
...@@ -135,24 +153,27 @@ most general to most specific: ...@@ -135,24 +153,27 @@ most general to most specific:
one or more ranges of the region. one or more ranges of the region.
Components of Memory Policies Components of Memory Policies
-----------------------------
A Linux memory policy consists of a "mode", optional mode flags, and an A Linux memory policy consists of a "mode", optional mode flags, and
optional set of nodes. The mode determines the behavior of the policy, an optional set of nodes. The mode determines the behavior of the
the optional mode flags determine the behavior of the mode, and the policy, the optional mode flags determine the behavior of the mode,
optional set of nodes can be viewed as the arguments to the policy and the optional set of nodes can be viewed as the arguments to the
behavior. policy behavior.
Internally, memory policies are implemented by a reference counted Internally, memory policies are implemented by a reference counted
structure, struct mempolicy. Details of this structure will be discussed structure, struct mempolicy. Details of this structure will be
in context, below, as required to explain the behavior. discussed in context, below, as required to explain the behavior.
Linux memory policy supports the following 4 behavioral modes: Linux memory policy supports the following 4 behavioral modes:
Default Mode--MPOL_DEFAULT: This mode is only used in the memory Default Mode--MPOL_DEFAULT
policy APIs. Internally, MPOL_DEFAULT is converted to the NULL This mode is only used in the memory policy APIs. Internally,
memory policy in all policy scopes. Any existing non-default policy MPOL_DEFAULT is converted to the NULL memory policy in all
will simply be removed when MPOL_DEFAULT is specified. As a result, policy scopes. Any existing non-default policy will simply be
MPOL_DEFAULT means "fall back to the next most specific policy scope." removed when MPOL_DEFAULT is specified. As a result,
MPOL_DEFAULT means "fall back to the next most specific policy
scope."
For example, a NULL or default task policy will fall back to the For example, a NULL or default task policy will fall back to the
system default policy. A NULL or default vma policy will fall system default policy. A NULL or default vma policy will fall
...@@ -164,57 +185,63 @@ Components of Memory Policies ...@@ -164,57 +185,63 @@ Components of Memory Policies
It is an error for the set of nodes specified for this policy to It is an error for the set of nodes specified for this policy to
be non-empty. be non-empty.
MPOL_BIND: This mode specifies that memory must come from the MPOL_BIND
set of nodes specified by the policy. Memory will be allocated from This mode specifies that memory must come from the set of
the node in the set with sufficient free memory that is closest to nodes specified by the policy. Memory will be allocated from
the node where the allocation takes place. the node in the set with sufficient free memory that is
closest to the node where the allocation takes place.
MPOL_PREFERRED: This mode specifies that the allocation should be MPOL_PREFERRED
attempted from the single node specified in the policy. If that This mode specifies that the allocation should be attempted
allocation fails, the kernel will search other nodes, in order of from the single node specified in the policy. If that
increasing distance from the preferred node based on information allocation fails, the kernel will search other nodes, in order
provided by the platform firmware. of increasing distance from the preferred node based on
information provided by the platform firmware.
Internally, the Preferred policy uses a single node--the Internally, the Preferred policy uses a single node--the
preferred_node member of struct mempolicy. When the internal preferred_node member of struct mempolicy. When the internal
mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and mode flag MPOL_F_LOCAL is set, the preferred_node is ignored
the policy is interpreted as local allocation. "Local" allocation and the policy is interpreted as local allocation. "Local"
policy can be viewed as a Preferred policy that starts at the node allocation policy can be viewed as a Preferred policy that
containing the cpu where the allocation takes place. starts at the node containing the cpu where the allocation
takes place.
It is possible for the user to specify that local allocation is
always preferred by passing an empty nodemask with this mode. It is possible for the user to specify that local allocation
If an empty nodemask is passed, the policy cannot use the is always preferred by passing an empty nodemask with this
MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described mode. If an empty nodemask is passed, the policy cannot use
below. the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags
described below.
MPOL_INTERLEAVED: This mode specifies that page allocations be
interleaved, on a page granularity, across the nodes specified in MPOL_INTERLEAVED
the policy. This mode also behaves slightly differently, based on This mode specifies that page allocations be interleaved, on a
the context where it is used: page granularity, across the nodes specified in the policy.
This mode also behaves slightly differently, based on the
context where it is used:
For allocation of anonymous pages and shared memory pages, For allocation of anonymous pages and shared memory pages,
Interleave mode indexes the set of nodes specified by the policy Interleave mode indexes the set of nodes specified by the
using the page offset of the faulting address into the segment policy using the page offset of the faulting address into the
[VMA] containing the address modulo the number of nodes specified segment [VMA] containing the address modulo the number of
by the policy. It then attempts to allocate a page, starting at nodes specified by the policy. It then attempts to allocate a
the selected node, as if the node had been specified by a Preferred page, starting at the selected node, as if the node had been
policy or had been selected by a local allocation. That is, specified by a Preferred policy or had been selected by a
allocation will follow the per node zonelist. local allocation. That is, allocation will follow the per
node zonelist.
For allocation of page cache pages, Interleave mode indexes the set
of nodes specified by the policy using a node counter maintained For allocation of page cache pages, Interleave mode indexes
per task. This counter wraps around to the lowest specified node the set of nodes specified by the policy using a node counter
after it reaches the highest specified node. This will tend to maintained per task. This counter wraps around to the lowest
spread the pages out over the nodes specified by the policy based specified node after it reaches the highest specified node.
on the order in which they are allocated, rather than based on any This will tend to spread the pages out over the nodes
page offset into an address range or file. During system boot up, specified by the policy based on the order in which they are
the temporary interleaved system default policy works in this allocated, rather than based on any page offset into an
mode. address range or file. During system boot up, the temporary
interleaved system default policy works in this mode.
Linux memory policy supports the following optional mode flags:
Linux memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by
MPOL_F_STATIC_NODES
This flag specifies that the nodemask passed by
the user should not be remapped if the task or VMA's set of allowed the user should not be remapped if the task or VMA's set of allowed
nodes changes after the memory policy has been defined. nodes changes after the memory policy has been defined.
...@@ -242,7 +269,8 @@ Components of Memory Policies ...@@ -242,7 +269,8 @@ Components of Memory Policies
MPOL_PREFERRED policies that were created with an empty nodemask MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation). (local allocation).
MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed MPOL_F_RELATIVE_NODES
This flag specifies that the nodemask passed
by the user will be mapped relative to the set of the task or VMA's by the user will be mapped relative to the set of the task or VMA's
set of allowed nodes. The kernel stores the user-passed nodemask, set of allowed nodes. The kernel stores the user-passed nodemask,
and if the allowed nodes changes, then that original nodemask will and if the allowed nodes changes, then that original nodemask will
...@@ -292,7 +320,8 @@ Components of Memory Policies ...@@ -292,7 +320,8 @@ Components of Memory Policies
MPOL_PREFERRED policies that were created with an empty nodemask MPOL_PREFERRED policies that were created with an empty nodemask
(local allocation). (local allocation).
MEMORY POLICY REFERENCE COUNTING Memory Policy Reference Counting
================================
To resolve use/free races, struct mempolicy contains an atomic reference To resolve use/free races, struct mempolicy contains an atomic reference
count field. Internal interfaces, mpol_get()/mpol_put() increment and count field. Internal interfaces, mpol_get()/mpol_put() increment and
...@@ -360,60 +389,62 @@ follows: ...@@ -360,60 +389,62 @@ follows:
or by prefaulting the entire shared memory region into memory and locking or by prefaulting the entire shared memory region into memory and locking
it down. However, this might not be appropriate for all applications. it down. However, this might not be appropriate for all applications.
MEMORY POLICY APIs Memory Policy APIs
Linux supports 3 system calls for controlling memory policy. These APIS Linux supports 3 system calls for controlling memory policy. These APIS
always affect only the calling task, the calling task's address space, or always affect only the calling task, the calling task's address space, or
some shared object mapped into the calling task's address space. some shared object mapped into the calling task's address space.
Note: the headers that define these APIs and the parameter data types .. note::
for user space applications reside in a package that is not part of the headers that define these APIs and the parameter data types for
the Linux kernel. The kernel system call interfaces, with the 'sys_' user space applications reside in a package that is not part of the
Linux kernel. The kernel system call interfaces, with the 'sys\_'
prefix, are defined in <linux/syscalls.h>; the mode and flag prefix, are defined in <linux/syscalls.h>; the mode and flag
definitions are defined in <linux/mempolicy.h>. definitions are defined in <linux/mempolicy.h>.
Set [Task] Memory Policy: Set [Task] Memory Policy::
long set_mempolicy(int mode, const unsigned long *nmask, long set_mempolicy(int mode, const unsigned long *nmask,
unsigned long maxnode); unsigned long maxnode);
Set's the calling task's "task/process memory policy" to mode Set's the calling task's "task/process memory policy" to mode
specified by the 'mode' argument and the set of nodes defined specified by the 'mode' argument and the set of nodes defined by
by 'nmask'. 'nmask' points to a bit mask of node ids containing 'nmask'. 'nmask' points to a bit mask of node ids containing at least
at least 'maxnode' ids. Optional mode flags may be passed by 'maxnode' ids. Optional mode flags may be passed by combining the
combining the 'mode' argument with the flag (for example: 'mode' argument with the flag (for example: MPOL_INTERLEAVE |
MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). MPOL_F_STATIC_NODES).
See the set_mempolicy(2) man page for more details See the set_mempolicy(2) man page for more details
Get [Task] Memory Policy or Related Information Get [Task] Memory Policy or Related Information::
long get_mempolicy(int *mode, long get_mempolicy(int *mode,
const unsigned long *nmask, unsigned long maxnode, const unsigned long *nmask, unsigned long maxnode,
void *addr, int flags); void *addr, int flags);
Queries the "task/process memory policy" of the calling task, or Queries the "task/process memory policy" of the calling task, or the
the policy or location of a specified virtual address, depending policy or location of a specified virtual address, depending on the
on the 'flags' argument. 'flags' argument.
See the get_mempolicy(2) man page for more details See the get_mempolicy(2) man page for more details
Install VMA/Shared Policy for a Range of Task's Address Space Install VMA/Shared Policy for a Range of Task's Address Space::
long mbind(void *start, unsigned long len, int mode, long mbind(void *start, unsigned long len, int mode,
const unsigned long *nmask, unsigned long maxnode, const unsigned long *nmask, unsigned long maxnode,
unsigned flags); unsigned flags);
mbind() installs the policy specified by (mode, nmask, maxnodes) as mbind() installs the policy specified by (mode, nmask, maxnodes) as a
a VMA policy for the range of the calling task's address space VMA policy for the range of the calling task's address space specified
specified by the 'start' and 'len' arguments. Additional actions by the 'start' and 'len' arguments. Additional actions may be
may be requested via the 'flags' argument. requested via the 'flags' argument.
See the mbind(2) man page for more details. See the mbind(2) man page for more details.
MEMORY POLICY COMMAND LINE INTERFACE Memory Policy Command Line Interface
====================================
Although not strictly part of the Linux implementation of memory policy, Although not strictly part of the Linux implementation of memory policy,
a command line tool, numactl(8), exists that allows one to: a command line tool, numactl(8), exists that allows one to:
...@@ -428,8 +459,10 @@ containing the memory policy system call wrappers. Some distributions ...@@ -428,8 +459,10 @@ containing the memory policy system call wrappers. Some distributions
package the headers and compile-time libraries in a separate development package the headers and compile-time libraries in a separate development
package. package.
.. _mem_pol_and_cpusets:
MEMORY POLICIES AND CPUSETS Memory Policies and cpusets
===========================
Memory policies work within cpusets as described above. For memory policies Memory policies work within cpusets as described above. For memory policies
that require a node or set of nodes, the nodes are restricted to the set of that require a node or set of nodes, the nodes are restricted to the set of
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment