Commit 78567e2b authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'cgroup-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cpuset isolation improvements

 - cpuset cgroup1 support is split into its own file behind the new
   config option CONFIG_CPUSET_V1. This makes it the second controller
   which makes cgroup1 support optional after memcg

 - Handling of unavailable v1 controller handling improved during
   cgroup1 mount operations

 - union_find applied to cpuset. It makes code simpler and more
   efficient

 - Reduce spurious events in pids.events

 - Cleanups and other misc changes

 - Contains a merge of cgroup/for-6.11-fixes to receive cpuset fixes
   that further changes build upon

* tag 'cgroup-for-6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (34 commits)
  cgroup: Do not report unavailable v1 controllers in /proc/cgroups
  cgroup: Disallow mounting v1 hierarchies without controller implementation
  cgroup/cpuset: Expose cpuset filesystem with cpuset v1 only
  cgroup/cpuset: Move cpu.h include to cpuset-internal.h
  cgroup/cpuset: add sefltest for cpuset v1
  cgroup/cpuset: guard cpuset-v1 code under CONFIG_CPUSETS_V1
  cgroup/cpuset: rename functions shared between v1 and v2
  cgroup/cpuset: move v1 interfaces to cpuset-v1.c
  cgroup/cpuset: move validate_change_legacy to cpuset-v1.c
  cgroup/cpuset: move legacy hotplug update to cpuset-v1.c
  cgroup/cpuset: add callback_lock helper
  cgroup/cpuset: move memory_spread to cpuset-v1.c
  cgroup/cpuset: move relax_domain_level to cpuset-v1.c
  cgroup/cpuset: move memory_pressure to cpuset-v1.c
  cgroup/cpuset: move common code to cpuset-internal.h
  cgroup/cpuset: introduce cpuset-v1.c
  selftest/cgroup: Make test_cpuset_prs.sh deal with pre-isolated CPUs
  cgroup/cpuset: Account for boot time isolated CPUs
  cgroup/cpuset: remove use_parent_ecpus of cpuset
  cgroup/cpuset: remove fetch_xcpus
  ...
parents 2f27fce6 af000ce8
......@@ -533,10 +533,12 @@ cgroup namespace on namespace creation.
Because the resource control interface files in a given directory
control the distribution of the parent's resources, the delegatee
shouldn't be allowed to write to them. For the first method, this is
achieved by not granting access to these files. For the second, the
kernel rejects writes to all files other than "cgroup.procs" and
"cgroup.subtree_control" on a namespace root from inside the
namespace.
achieved by not granting access to these files. For the second, files
outside the namespace should be hidden from the delegatee by the means
of at least mount namespacing, and the kernel rejects writes to all
files on a namespace root from inside the cgroup namespace, except for
those files listed in "/sys/kernel/cgroup/delegate" (including
"cgroup.procs", "cgroup.threads", "cgroup.subtree_control", etc.).
The end results are equivalent for both delegation types. Once
delegated, the user can build sub-hierarchy under the directory,
......@@ -981,6 +983,14 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
nr_subsys_<cgroup_subsys>
Total number of live cgroup subsystems (e.g memory
cgroup) at and beneath the current cgroup.
nr_dying_subsys_<cgroup_subsys>
Total number of dying cgroup subsystems (e.g. memory
cgroup) at and beneath the current cgroup.
cgroup.freeze
A read-write single value file which exists on non-root cgroups.
Allowed values are "0" and "1". The default is "0".
......@@ -2940,8 +2950,8 @@ Deprecated v1 Core Features
- "cgroup.clone_children" is removed.
- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
at the root instead.
- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
"cgroup.stat" files at the root instead.
Issues with v1 and Rationales for v2
......
......@@ -49,6 +49,7 @@ Library functionality that is used throughout the kernel.
wrappers/atomic_t
wrappers/atomic_bitops
floating-point
union_find
Low level entry and exit
========================
......
.. SPDX-License-Identifier: GPL-2.0
====================
Union-Find in Linux
====================
:Date: June 21, 2024
:Author: Xavier <xavier_qy@163.com>
What is union-find, and what is it used for?
------------------------------------------------
Union-find is a data structure used to handle the merging and querying
of disjoint sets. The primary operations supported by union-find are:
Initialization: Resetting each element as an individual set, with
each set's initial parent node pointing to itself.
Find: Determine which set a particular element belongs to, usually by
returning a “representative element” of that set. This operation
is used to check if two elements are in the same set.
Union: Merge two sets into one.
As a data structure used to maintain sets (groups), union-find is commonly
utilized to solve problems related to offline queries, dynamic connectivity,
and graph theory. It is also a key component in Kruskal's algorithm for
computing the minimum spanning tree, which is crucial in scenarios like
network routing. Consequently, union-find is widely referenced. Additionally,
union-find has applications in symbolic computation, register allocation,
and more.
Space Complexity: O(n), where n is the number of nodes.
Time Complexity: Using path compression can reduce the time complexity of
the find operation, and using union by rank can reduce the time complexity
of the union operation. These optimizations reduce the average time
complexity of each find and union operation to O(α(n)), where α(n) is the
inverse Ackermann function. This can be roughly considered a constant time
complexity for practical purposes.
This document covers use of the Linux union-find implementation. For more
information on the nature and implementation of union-find, see:
Wikipedia entry on union-find
https://en.wikipedia.org/wiki/Disjoint-set_data_structure
Linux implementation of union-find
-----------------------------------
Linux's union-find implementation resides in the file "lib/union_find.c".
To use it, "#include <linux/union_find.h>".
The union-find data structure is defined as follows::
struct uf_node {
struct uf_node *parent;
unsigned int rank;
};
In this structure, parent points to the parent node of the current node.
The rank field represents the height of the current tree. During a union
operation, the tree with the smaller rank is attached under the tree with the
larger rank to maintain balance.
Initializing union-find
-----------------------
You can complete the initialization using either static or initialization
interface. Initialize the parent pointer to point to itself and set the rank
to 0.
Example::
struct uf_node my_node = UF_INIT_NODE(my_node);
or
uf_node_init(&my_node);
Find the Root Node of union-find
--------------------------------
This operation is mainly used to determine whether two nodes belong to the same
set in the union-find. If they have the same root, they are in the same set.
During the find operation, path compression is performed to improve the
efficiency of subsequent find operations.
Example::
int connected;
struct uf_node *root1 = uf_find(&node_1);
struct uf_node *root2 = uf_find(&node_2);
if (root1 == root2)
connected = 1;
else
connected = 0;
Union Two Sets in union-find
----------------------------
To union two sets in the union-find, you first find their respective root nodes
and then link the smaller node to the larger node based on the rank of the root
nodes.
Example::
uf_union(&node_1, &node_2);
......@@ -49,6 +49,7 @@
generic-radix-tree
packing
this_cpu_ops
union_find
=======
......
.. SPDX-License-Identifier: GPL-2.0
.. include:: ../disclaimer-zh_CN.rst
:Original: Documentation/core-api/union_find.rst
=============================
Linux中的并查集(Union-Find)
=============================
:日期: 2024年6月21日
:作者: Xavier <xavier_qy@163.com>
何为并查集,它有什么用?
------------------------
并查集是一种数据结构,用于处理一些不交集的合并及查询问题。并查集支持的主要操作:
初始化:将每个元素初始化为单独的集合,每个集合的初始父节点指向自身。
查询:查询某个元素属于哪个集合,通常是返回集合中的一个“代表元素”。这个操作是为
了判断两个元素是否在同一个集合之中。
合并:将两个集合合并为一个。
并查集作为一种用于维护集合(组)的数据结构,它通常用于解决一些离线查询、动态连通性和
图论等相关问题,同时也是用于计算最小生成树的克鲁斯克尔算法中的关键,由于最小生成树在
网络路由等场景下十分重要,并查集也得到了广泛的引用。此外,并查集在符号计算,寄存器分
配等方面也有应用。
空间复杂度: O(n),n为节点数。
时间复杂度:使用路径压缩可以减少查找操作的时间复杂度,使用按秩合并可以减少合并操作的
时间复杂度,使得并查集每个查询和合并操作的平均时间复杂度仅为O(α(n)),其中α(n)是反阿
克曼函数,可以粗略地认为并查集的操作有常数的时间复杂度。
本文档涵盖了对Linux并查集实现的使用方法。更多关于并查集的性质和实现的信息,参见:
维基百科并查集词条
https://en.wikipedia.org/wiki/Disjoint-set_data_structure
并查集的Linux实现
------------------
Linux的并查集实现在文件“lib/union_find.c”中。要使用它,需要
“#include <linux/union_find.h>”。
并查集的数据结构定义如下::
struct uf_node {
struct uf_node *parent;
unsigned int rank;
};
其中parent为当前节点的父节点,rank为当前树的高度,在合并时将rank小的节点接到rank大
的节点下面以增加平衡性。
初始化并查集
-------------
可以采用静态或初始化接口完成初始化操作。初始化时,parent 指针指向自身,rank 设置
为 0。
示例::
struct uf_node my_node = UF_INIT_NODE(my_node);
uf_node_init(&my_node);
查找并查集的根节点
------------------
主要用于判断两个并查集是否属于一个集合,如果根相同,那么他们就是一个集合。在查找过程中
会对路径进行压缩,提高后续查找效率。
示例::
int connected;
struct uf_node *root1 = uf_find(&node_1);
struct uf_node *root2 = uf_find(&node_2);
if (root1 == root2)
connected = 1;
else
connected = 0;
合并两个并查集
--------------
对于两个相交的并查集进行合并,会首先查找它们各自的根节点,然后根据根节点秩大小,将小的
节点连接到大的节点下面。
示例::
uf_union(&node_1, &node_2);
......@@ -5736,9 +5736,12 @@ S: Maintained
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
F: Documentation/admin-guide/cgroup-v1/cpusets.rst
F: include/linux/cpuset.h
F: kernel/cgroup/cpuset-internal.h
F: kernel/cgroup/cpuset-v1.c
F: kernel/cgroup/cpuset.c
F: tools/testing/selftests/cgroup/test_cpuset.c
F: tools/testing/selftests/cgroup/test_cpuset_prs.sh
F: tools/testing/selftests/cgroup/test_cpuset_v1_base.sh
CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
M: Johannes Weiner <hannes@cmpxchg.org>
......@@ -23606,6 +23609,15 @@ F: drivers/cdrom/cdrom.c
F: include/linux/cdrom.h
F: include/uapi/linux/cdrom.h
UNION-FIND
M: Xavier <xavier_qy@163.com>
L: linux-kernel@vger.kernel.org
S: Maintained
F: Documentation/core-api/union_find.rst
F: Documentation/translations/zh_CN/core-api/union_find.rst
F: include/linux/union_find.h
F: lib/union_find.c
UNIVERSAL FLASH STORAGE HOST CONTROLLER DRIVER
R: Alim Akhtar <alim.akhtar@samsung.com>
R: Avri Altman <avri.altman@wdc.com>
......
......@@ -210,6 +210,14 @@ struct cgroup_subsys_state {
* fields of the containing structure.
*/
struct cgroup_subsys_state *parent;
/*
* Keep track of total numbers of visible descendant CSSes.
* The total number of dying CSSes is tracked in
* css->cgroup->nr_dying_subsys[ssid].
* Protected by cgroup_mutex.
*/
int nr_descendants;
};
/*
......@@ -470,6 +478,12 @@ struct cgroup {
/* Private pointers for each registered subsystem */
struct cgroup_subsys_state __rcu *subsys[CGROUP_SUBSYS_COUNT];
/*
* Keep track of total number of dying CSSes at and below this cgroup.
* Protected by cgroup_mutex.
*/
int nr_dying_subsys[CGROUP_SUBSYS_COUNT];
struct cgroup_root *root;
/*
......
......@@ -99,6 +99,7 @@ static inline bool cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
const struct task_struct *tsk2);
#ifdef CONFIG_CPUSETS_V1
#define cpuset_memory_pressure_bump() \
do { \
if (cpuset_memory_pressure_enabled) \
......@@ -106,6 +107,9 @@ extern int cpuset_mems_allowed_intersects(const struct task_struct *tsk1,
} while (0)
extern int cpuset_memory_pressure_enabled;
extern void __cpuset_memory_pressure_bump(void);
#else
static inline void cpuset_memory_pressure_bump(void) { }
#endif
extern void cpuset_task_status_allowed(struct seq_file *m,
struct task_struct *task);
......@@ -113,7 +117,6 @@ extern int proc_cpuset_show(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *tsk);
extern int cpuset_mem_spread_node(void);
extern int cpuset_slab_spread_node(void);
static inline int cpuset_do_page_mem_spread(void)
{
......@@ -246,11 +249,6 @@ static inline int cpuset_mem_spread_node(void)
return 0;
}
static inline int cpuset_slab_spread_node(void)
{
return 0;
}
static inline int cpuset_do_page_mem_spread(void)
{
return 0;
......
......@@ -1243,7 +1243,6 @@ struct task_struct {
/* Sequence number to catch updates: */
seqcount_spinlock_t mems_allowed_seq;
int cpuset_mem_spread_rotor;
int cpuset_slab_spread_rotor;
#endif
#ifdef CONFIG_CGROUPS
/* Control Group info protected by css_set_lock: */
......
/* SPDX-License-Identifier: GPL-2.0 */
#ifndef __LINUX_UNION_FIND_H
#define __LINUX_UNION_FIND_H
/**
* union_find.h - union-find data structure implementation
*
* This header provides functions and structures to implement the union-find
* data structure. The union-find data structure is used to manage disjoint
* sets and supports efficient union and find operations.
*
* See Documentation/core-api/union_find.rst for documentation and samples.
*/
struct uf_node {
struct uf_node *parent;
unsigned int rank;
};
/* This macro is used for static initialization of a union-find node. */
#define UF_INIT_NODE(node) {.parent = &node, .rank = 0}
/**
* uf_node_init - Initialize a union-find node
* @node: pointer to the union-find node to be initialized
*
* This function sets the parent of the node to itself and
* initializes its rank to 0.
*/
static inline void uf_node_init(struct uf_node *node)
{
node->parent = node;
node->rank = 0;
}
/* find the root of a node */
struct uf_node *uf_find(struct uf_node *node);
/* Merge two intersecting nodes */
void uf_union(struct uf_node *node1, struct uf_node *node2);
#endif /* __LINUX_UNION_FIND_H */
......@@ -1143,6 +1143,19 @@ config CPUSETS
Say N if unsure.
config CPUSETS_V1
bool "Legacy cgroup v1 cpusets controller"
depends on CPUSETS
default n
help
Legacy cgroup v1 cpusets controller which has been deprecated by
cgroup v2 implementation. The v1 is there for legacy applications
which haven't migrated to the new cgroup v2 interface yet. If you
do not have any such application then you are completely fine leaving
this option disabled.
Say N if unsure.
config PROC_PID_CPUSET
bool "Include legacy /proc/<pid>/cpuset file"
depends on CPUSETS
......
......@@ -5,5 +5,6 @@ obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
obj-$(CONFIG_CGROUP_PIDS) += pids.o
obj-$(CONFIG_CGROUP_RDMA) += rdma.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_CPUSETS_V1) += cpuset-v1.o
obj-$(CONFIG_CGROUP_MISC) += misc.o
obj-$(CONFIG_CGROUP_DEBUG) += debug.o
......@@ -46,6 +46,12 @@ bool cgroup1_ssid_disabled(int ssid)
return cgroup_no_v1_mask & (1 << ssid);
}
static bool cgroup1_subsys_absent(struct cgroup_subsys *ss)
{
/* Check also dfl_cftypes for file-less controllers, i.e. perf_event */
return ss->legacy_cftypes == NULL && ss->dfl_cftypes;
}
/**
* cgroup_attach_task_all - attach task 'tsk' to all cgroups of task 'from'
* @from: attach to all cgroups of a given task
......@@ -675,11 +681,14 @@ int proc_cgroupstats_show(struct seq_file *m, void *v)
* cgroup_mutex contention.
*/
for_each_subsys(ss, i)
for_each_subsys(ss, i) {
if (cgroup1_subsys_absent(ss))
continue;
seq_printf(m, "%s\t%d\t%d\t%d\n",
ss->legacy_name, ss->root->hierarchy_id,
atomic_read(&ss->root->nr_cgrps),
cgroup_ssid_enabled(i));
}
return 0;
}
......@@ -932,7 +941,8 @@ int cgroup1_parse_param(struct fs_context *fc, struct fs_parameter *param)
if (ret != -ENOPARAM)
return ret;
for_each_subsys(ss, i) {
if (strcmp(param->key, ss->legacy_name))
if (strcmp(param->key, ss->legacy_name) ||
cgroup1_subsys_absent(ss))
continue;
if (!cgroup_ssid_enabled(i) || cgroup1_ssid_disabled(i))
return invalfc(fc, "Disabled controller '%s'",
......@@ -1024,7 +1034,8 @@ static int check_cgroupfs_options(struct fs_context *fc)
mask = ~((u16)1 << cpuset_cgrp_id);
#endif
for_each_subsys(ss, i)
if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i))
if (cgroup_ssid_enabled(i) && !cgroup1_ssid_disabled(i) &&
!cgroup1_subsys_absent(ss))
enabled |= 1 << i;
ctx->subsys_mask &= enabled;
......
......@@ -2331,7 +2331,7 @@ static struct file_system_type cgroup2_fs_type = {
.fs_flags = FS_USERNS_MOUNT,
};
#ifdef CONFIG_CPUSETS
#ifdef CONFIG_CPUSETS_V1
static const struct fs_context_operations cpuset_fs_context_ops = {
.get_tree = cgroup1_get_tree,
.free = cgroup_fs_context_free,
......@@ -3669,12 +3669,40 @@ static int cgroup_events_show(struct seq_file *seq, void *v)
static int cgroup_stat_show(struct seq_file *seq, void *v)
{
struct cgroup *cgroup = seq_css(seq)->cgroup;
struct cgroup_subsys_state *css;
int dying_cnt[CGROUP_SUBSYS_COUNT];
int ssid;
seq_printf(seq, "nr_descendants %d\n",
cgroup->nr_descendants);
/*
* Show the number of live and dying csses associated with each of
* non-inhibited cgroup subsystems that is bound to cgroup v2.
*
* Without proper lock protection, racing is possible. So the
* numbers may not be consistent when that happens.
*/
rcu_read_lock();
for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
dying_cnt[ssid] = -1;
if ((BIT(ssid) & cgrp_dfl_inhibit_ss_mask) ||
(cgroup_subsys[ssid]->root != &cgrp_dfl_root))
continue;
css = rcu_dereference_raw(cgroup->subsys[ssid]);
dying_cnt[ssid] = cgroup->nr_dying_subsys[ssid];
seq_printf(seq, "nr_subsys_%s %d\n", cgroup_subsys[ssid]->name,
css ? (css->nr_descendants + 1) : 0);
}
seq_printf(seq, "nr_dying_descendants %d\n",
cgroup->nr_dying_descendants);
for (ssid = 0; ssid < CGROUP_SUBSYS_COUNT; ssid++) {
if (dying_cnt[ssid] >= 0)
seq_printf(seq, "nr_dying_subsys_%s %d\n",
cgroup_subsys[ssid]->name, dying_cnt[ssid]);
}
rcu_read_unlock();
return 0;
}
......@@ -4096,7 +4124,7 @@ static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
* If namespaces are delegation boundaries, disallow writes to
* files in an non-init namespace root from inside the namespace
* except for the files explicitly marked delegatable -
* cgroup.procs and cgroup.subtree_control.
* eg. cgroup.procs, cgroup.threads and cgroup.subtree_control.
*/
if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
......@@ -5424,6 +5452,8 @@ static void css_release_work_fn(struct work_struct *work)
list_del_rcu(&css->sibling);
if (ss) {
struct cgroup *parent_cgrp;
/* css release path */
if (!list_empty(&css->rstat_css_node)) {
cgroup_rstat_flush(cgrp);
......@@ -5433,6 +5463,21 @@ static void css_release_work_fn(struct work_struct *work)
cgroup_idr_replace(&ss->css_idr, NULL, css->id);
if (ss->css_released)
ss->css_released(css);
cgrp->nr_dying_subsys[ss->id]--;
/*
* When a css is released and ready to be freed, its
* nr_descendants must be zero. However, the corresponding
* cgrp->nr_dying_subsys[ss->id] may not be 0 if a subsystem
* is activated and deactivated multiple times with one or
* more of its previous activation leaving behind dying csses.
*/
WARN_ON_ONCE(css->nr_descendants);
parent_cgrp = cgroup_parent(cgrp);
while (parent_cgrp) {
parent_cgrp->nr_dying_subsys[ss->id]--;
parent_cgrp = cgroup_parent(parent_cgrp);
}
} else {
struct cgroup *tcgrp;
......@@ -5517,8 +5562,11 @@ static int online_css(struct cgroup_subsys_state *css)
rcu_assign_pointer(css->cgroup->subsys[ss->id], css);
atomic_inc(&css->online_cnt);
if (css->parent)
if (css->parent) {
atomic_inc(&css->parent->online_cnt);
while ((css = css->parent))
css->nr_descendants++;
}
}
return ret;
}
......@@ -5540,6 +5588,16 @@ static void offline_css(struct cgroup_subsys_state *css)
RCU_INIT_POINTER(css->cgroup->subsys[ss->id], NULL);
wake_up_all(&css->cgroup->offline_waitq);
css->cgroup->nr_dying_subsys[ss->id]++;
/*
* Parent css and cgroup cannot be freed until after the freeing
* of child css, see css_free_rwork_fn().
*/
while ((css = css->parent)) {
css->nr_descendants--;
css->cgroup->nr_dying_subsys[ss->id]++;
}
}
/**
......@@ -6178,7 +6236,7 @@ int __init cgroup_init(void)
WARN_ON(register_filesystem(&cgroup_fs_type));
WARN_ON(register_filesystem(&cgroup2_fs_type));
WARN_ON(!proc_create_single("cgroups", 0, NULL, proc_cgroupstats_show));
#ifdef CONFIG_CPUSETS
#ifdef CONFIG_CPUSETS_V1
WARN_ON(register_filesystem(&cpuset_fs_type));
#endif
......
/* SPDX-License-Identifier: GPL-2.0-or-later */
#ifndef __CPUSET_INTERNAL_H
#define __CPUSET_INTERNAL_H
#include <linux/cgroup.h>
#include <linux/cpu.h>
#include <linux/cpumask.h>
#include <linux/cpuset.h>
#include <linux/spinlock.h>
#include <linux/union_find.h>
/* See "Frequency meter" comments, below. */
struct fmeter {
int cnt; /* unprocessed events count */
int val; /* most recent output value */
time64_t time; /* clock (secs) when val computed */
spinlock_t lock; /* guards read or write of above */
};
/*
* Invalid partition error code
*/
enum prs_errcode {
PERR_NONE = 0,
PERR_INVCPUS,
PERR_INVPARENT,
PERR_NOTPART,
PERR_NOTEXCL,
PERR_NOCPUS,
PERR_HOTPLUG,
PERR_CPUSEMPTY,
PERR_HKEEPING,
PERR_ACCESS,
};
/* bits in struct cpuset flags field */
typedef enum {
CS_ONLINE,
CS_CPU_EXCLUSIVE,
CS_MEM_EXCLUSIVE,
CS_MEM_HARDWALL,
CS_MEMORY_MIGRATE,
CS_SCHED_LOAD_BALANCE,
CS_SPREAD_PAGE,
CS_SPREAD_SLAB,
} cpuset_flagbits_t;
/* The various types of files and directories in a cpuset file system */
typedef enum {
FILE_MEMORY_MIGRATE,
FILE_CPULIST,
FILE_MEMLIST,
FILE_EFFECTIVE_CPULIST,
FILE_EFFECTIVE_MEMLIST,
FILE_SUBPARTS_CPULIST,
FILE_EXCLUSIVE_CPULIST,
FILE_EFFECTIVE_XCPULIST,
FILE_ISOLATED_CPULIST,
FILE_CPU_EXCLUSIVE,
FILE_MEM_EXCLUSIVE,
FILE_MEM_HARDWALL,
FILE_SCHED_LOAD_BALANCE,
FILE_PARTITION_ROOT,
FILE_SCHED_RELAX_DOMAIN_LEVEL,
FILE_MEMORY_PRESSURE_ENABLED,
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
} cpuset_filetype_t;
struct cpuset {
struct cgroup_subsys_state css;
unsigned long flags; /* "unsigned long" so bitops work */
/*
* On default hierarchy:
*
* The user-configured masks can only be changed by writing to
* cpuset.cpus and cpuset.mems, and won't be limited by the
* parent masks.
*
* The effective masks is the real masks that apply to the tasks
* in the cpuset. They may be changed if the configured masks are
* changed or hotplug happens.
*
* effective_mask == configured_mask & parent's effective_mask,
* and if it ends up empty, it will inherit the parent's mask.
*
*
* On legacy hierarchy:
*
* The user-configured masks are always the same with effective masks.
*/
/* user-configured CPUs and Memory Nodes allow to tasks */
cpumask_var_t cpus_allowed;
nodemask_t mems_allowed;
/* effective CPUs and Memory Nodes allow to tasks */
cpumask_var_t effective_cpus;
nodemask_t effective_mems;
/*
* Exclusive CPUs dedicated to current cgroup (default hierarchy only)
*
* The effective_cpus of a valid partition root comes solely from its
* effective_xcpus and some of the effective_xcpus may be distributed
* to sub-partitions below & hence excluded from its effective_cpus.
* For a valid partition root, its effective_cpus have no relationship
* with cpus_allowed unless its exclusive_cpus isn't set.
*
* This value will only be set if either exclusive_cpus is set or
* when this cpuset becomes a local partition root.
*/
cpumask_var_t effective_xcpus;
/*
* Exclusive CPUs as requested by the user (default hierarchy only)
*
* Its value is independent of cpus_allowed and designates the set of
* CPUs that can be granted to the current cpuset or its children when
* it becomes a valid partition root. The effective set of exclusive
* CPUs granted (effective_xcpus) depends on whether those exclusive
* CPUs are passed down by its ancestors and not yet taken up by
* another sibling partition root along the way.
*
* If its value isn't set, it defaults to cpus_allowed.
*/
cpumask_var_t exclusive_cpus;
/*
* This is old Memory Nodes tasks took on.
*
* - top_cpuset.old_mems_allowed is initialized to mems_allowed.
* - A new cpuset's old_mems_allowed is initialized when some
* task is moved into it.
* - old_mems_allowed is used in cpuset_migrate_mm() when we change
* cpuset.mems_allowed and have tasks' nodemask updated, and
* then old_mems_allowed is updated to mems_allowed.
*/
nodemask_t old_mems_allowed;
struct fmeter fmeter; /* memory_pressure filter */
/*
* Tasks are being attached to this cpuset. Used to prevent
* zeroing cpus/mems_allowed between ->can_attach() and ->attach().
*/
int attach_in_progress;
/* for custom sched domain */
int relax_domain_level;
/* number of valid local child partitions */
int nr_subparts;
/* partition root state */
int partition_root_state;
/*
* number of SCHED_DEADLINE tasks attached to this cpuset, so that we
* know when to rebuild associated root domain bandwidth information.
*/
int nr_deadline_tasks;
int nr_migrate_dl_tasks;
u64 sum_migrate_dl_bw;
/* Invalid partition error code, not lock protected */
enum prs_errcode prs_err;
/* Handle for cpuset.cpus.partition */
struct cgroup_file partition_file;
/* Remote partition silbling list anchored at remote_children */
struct list_head remote_sibling;
/* Used to merge intersecting subsets for generate_sched_domains */
struct uf_node node;
};
static inline struct cpuset *css_cs(struct cgroup_subsys_state *css)
{
return css ? container_of(css, struct cpuset, css) : NULL;
}
/* Retrieve the cpuset for a task */
static inline struct cpuset *task_cs(struct task_struct *task)
{
return css_cs(task_css(task, cpuset_cgrp_id));
}
static inline struct cpuset *parent_cs(struct cpuset *cs)
{
return css_cs(cs->css.parent);
}
/* convenient tests for these bits */
static inline bool is_cpuset_online(struct cpuset *cs)
{
return test_bit(CS_ONLINE, &cs->flags) && !css_is_dying(&cs->css);
}
static inline int is_cpu_exclusive(const struct cpuset *cs)
{
return test_bit(CS_CPU_EXCLUSIVE, &cs->flags);
}
static inline int is_mem_exclusive(const struct cpuset *cs)
{
return test_bit(CS_MEM_EXCLUSIVE, &cs->flags);
}
static inline int is_mem_hardwall(const struct cpuset *cs)
{
return test_bit(CS_MEM_HARDWALL, &cs->flags);
}
static inline int is_sched_load_balance(const struct cpuset *cs)
{
return test_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
}
static inline int is_memory_migrate(const struct cpuset *cs)
{
return test_bit(CS_MEMORY_MIGRATE, &cs->flags);
}
static inline int is_spread_page(const struct cpuset *cs)
{
return test_bit(CS_SPREAD_PAGE, &cs->flags);
}
static inline int is_spread_slab(const struct cpuset *cs)
{
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}
/**
* cpuset_for_each_child - traverse online children of a cpuset
* @child_cs: loop cursor pointing to the current child
* @pos_css: used for iteration
* @parent_cs: target cpuset to walk children of
*
* Walk @child_cs through the online children of @parent_cs. Must be used
* with RCU read locked.
*/
#define cpuset_for_each_child(child_cs, pos_css, parent_cs) \
css_for_each_child((pos_css), &(parent_cs)->css) \
if (is_cpuset_online(((child_cs) = css_cs((pos_css)))))
/**
* cpuset_for_each_descendant_pre - pre-order walk of a cpuset's descendants
* @des_cs: loop cursor pointing to the current descendant
* @pos_css: used for iteration
* @root_cs: target cpuset to walk ancestor of
*
* Walk @des_cs through the online descendants of @root_cs. Must be used
* with RCU read locked. The caller may modify @pos_css by calling
* css_rightmost_descendant() to skip subtree. @root_cs is included in the
* iteration and the first node to be visited.
*/
#define cpuset_for_each_descendant_pre(des_cs, pos_css, root_cs) \
css_for_each_descendant_pre((pos_css), &(root_cs)->css) \
if (is_cpuset_online(((des_cs) = css_cs((pos_css)))))
void rebuild_sched_domains_locked(void);
void cpuset_callback_lock_irq(void);
void cpuset_callback_unlock_irq(void);
void cpuset_update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus);
void cpuset_update_tasks_nodemask(struct cpuset *cs);
int cpuset_update_flag(cpuset_flagbits_t bit, struct cpuset *cs, int turning_on);
ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off);
int cpuset_common_seq_show(struct seq_file *sf, void *v);
/*
* cpuset-v1.c
*/
#ifdef CONFIG_CPUSETS_V1
extern struct cftype cpuset1_files[];
void fmeter_init(struct fmeter *fmp);
void cpuset1_update_task_spread_flags(struct cpuset *cs,
struct task_struct *tsk);
void cpuset1_update_tasks_flags(struct cpuset *cs);
void cpuset1_hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated);
int cpuset1_validate_change(struct cpuset *cur, struct cpuset *trial);
#else
static inline void fmeter_init(struct fmeter *fmp) {}
static inline void cpuset1_update_task_spread_flags(struct cpuset *cs,
struct task_struct *tsk) {}
static inline void cpuset1_update_tasks_flags(struct cpuset *cs) {}
static inline void cpuset1_hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated) {}
static inline int cpuset1_validate_change(struct cpuset *cur,
struct cpuset *trial) { return 0; }
#endif /* CONFIG_CPUSETS_V1 */
#endif /* __CPUSET_INTERNAL_H */
This diff is collapsed.
This diff is collapsed.
......@@ -244,7 +244,6 @@ static void pids_event(struct pids_cgroup *pids_forking,
struct pids_cgroup *pids_over_limit)
{
struct pids_cgroup *p = pids_forking;
bool limit = false;
/* Only log the first time limit is hit. */
if (atomic64_inc_return(&p->events_local[PIDCG_FORKFAIL]) == 1) {
......@@ -252,20 +251,17 @@ static void pids_event(struct pids_cgroup *pids_forking,
pr_cont_cgroup_path(p->css.cgroup);
pr_cont("\n");
}
cgroup_file_notify(&p->events_local_file);
if (!cgroup_subsys_on_dfl(pids_cgrp_subsys) ||
cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS)
cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS) {
cgroup_file_notify(&p->events_local_file);
return;
}
for (; parent_pids(p); p = parent_pids(p)) {
if (p == pids_over_limit) {
limit = true;
atomic64_inc(&p->events_local[PIDCG_MAX]);
cgroup_file_notify(&p->events_local_file);
}
if (limit)
atomic64_inc(&p->events[PIDCG_MAX]);
atomic64_inc(&pids_over_limit->events_local[PIDCG_MAX]);
cgroup_file_notify(&pids_over_limit->events_local_file);
for (p = pids_over_limit; parent_pids(p); p = parent_pids(p)) {
atomic64_inc(&p->events[PIDCG_MAX]);
cgroup_file_notify(&p->events_file);
}
}
......@@ -276,15 +272,10 @@ static void pids_event(struct pids_cgroup *pids_forking,
*/
static int pids_can_fork(struct task_struct *task, struct css_set *cset)
{
struct cgroup_subsys_state *css;
struct pids_cgroup *pids, *pids_over_limit;
int err;
if (cset)
css = cset->subsys[pids_cgrp_id];
else
css = task_css_check(current, pids_cgrp_id, true);
pids = css_pids(css);
pids = css_pids(cset->subsys[pids_cgrp_id]);
err = pids_try_charge(pids, 1, &pids_over_limit);
if (err)
pids_event(pids, pids_over_limit);
......@@ -294,14 +285,9 @@ static int pids_can_fork(struct task_struct *task, struct css_set *cset)
static void pids_cancel_fork(struct task_struct *task, struct css_set *cset)
{
struct cgroup_subsys_state *css;
struct pids_cgroup *pids;
if (cset)
css = cset->subsys[pids_cgrp_id];
else
css = task_css_check(current, pids_cgrp_id, true);
pids = css_pids(css);
pids = css_pids(cset->subsys[pids_cgrp_id]);
pids_uncharge(pids, 1);
}
......
......@@ -2311,7 +2311,6 @@ __latent_entropy struct task_struct *copy_process(
#endif
#ifdef CONFIG_CPUSETS
p->cpuset_mem_spread_rotor = NUMA_NO_NODE;
p->cpuset_slab_spread_rotor = NUMA_NO_NODE;
seqcount_spinlock_init(&p->mems_allowed_seq, &p->alloc_lock);
#endif
#ifdef CONFIG_TRACE_IRQFLAGS
......
......@@ -34,7 +34,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
earlycpio.o seq_buf.o siphash.o dec_and_lock.o \
nmi_backtrace.o win_minmax.o memcat_p.o \
buildid.o objpool.o
buildid.o objpool.o union_find.o
lib-$(CONFIG_PRINTK) += dump_stack.o
lib-$(CONFIG_SMP) += cpumask.o
......
// SPDX-License-Identifier: GPL-2.0
#include <linux/union_find.h>
/**
* uf_find - Find the root of a node and perform path compression
* @node: the node to find the root of
*
* This function returns the root of the node by following the parent
* pointers. It also performs path compression, making the tree shallower.
*
* Returns the root node of the set containing node.
*/
struct uf_node *uf_find(struct uf_node *node)
{
struct uf_node *parent;
while (node->parent != node) {
parent = node->parent;
node->parent = parent->parent;
node = parent;
}
return node;
}
/**
* uf_union - Merge two sets, using union by rank
* @node1: the first node
* @node2: the second node
*
* This function merges the sets containing node1 and node2, by comparing
* the ranks to keep the tree balanced.
*/
void uf_union(struct uf_node *node1, struct uf_node *node2)
{
struct uf_node *root1 = uf_find(node1);
struct uf_node *root2 = uf_find(node2);
if (root1 == root2)
return;
if (root1->rank < root2->rank) {
root1->parent = root2;
} else if (root1->rank > root2->rank) {
root2->parent = root1;
} else {
root2->parent = root1;
root1->rank++;
}
}
......@@ -84,6 +84,20 @@ echo member > test/cpuset.cpus.partition
echo "" > test/cpuset.cpus
[[ $RESULT -eq 0 ]] && skip_test "Child cgroups are using cpuset!"
#
# If isolated CPUs have been reserved at boot time (as shown in
# cpuset.cpus.isolated), these isolated CPUs should be outside of CPUs 0-7
# that will be used by this script for testing purpose. If not, some of
# the tests may fail incorrectly. These isolated CPUs will also be removed
# before being compared with the expected results.
#
BOOT_ISOLCPUS=$(cat $CGROUP2/cpuset.cpus.isolated)
if [[ -n "$BOOT_ISOLCPUS" ]]
then
[[ $(echo $BOOT_ISOLCPUS | sed -e "s/[,-].*//") -le 7 ]] &&
skip_test "Pre-isolated CPUs ($BOOT_ISOLCPUS) overlap CPUs to be tested"
echo "Pre-isolated CPUs: $BOOT_ISOLCPUS"
fi
cleanup()
{
online_cpus
......@@ -321,7 +335,7 @@ TEST_MATRIX=(
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
# ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
#
# Incorrect change to cpuset.cpus invalidates partition root
# Incorrect change to cpuset.cpus[.exclusive] invalidates partition root
#
# Adding CPUs to partition root that are not in parent's
# cpuset.cpus is allowed, but those extra CPUs are ignored.
......@@ -365,6 +379,16 @@ TEST_MATRIX=(
# cpuset.cpus can overlap with sibling cpuset.cpus.exclusive but not subsumed by it
" C0-3 . . C4-5 X5 . . . 0 A1:0-3,B1:4-5"
# Child partition root that try to take all CPUs from parent partition
# with tasks will remain invalid.
" C1-4:P1:S+ P1 . . . . . . 0 A1:1-4,A2:1-4 A1:P1,A2:P-1"
" C1-4:P1:S+ P1 . . . C1-4 . . 0 A1,A2:1-4 A1:P1,A2:P1"
" C1-4:P1:S+ P1 . . T C1-4 . . 0 A1:1-4,A2:1-4 A1:P1,A2:P-1"
# Clearing of cpuset.cpus with a preset cpuset.cpus.exclusive shouldn't
# affect cpuset.cpus.exclusive.effective.
" C1-4:X3:S+ C1:X3 . . . C . . 0 A2:1-4,XA2:3"
# old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate ISOLCPUS
# ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------ --------
# Failure cases:
......@@ -632,7 +656,8 @@ check_cgroup_states()
# Note that isolated CPUs from the sched/domains context include offline
# CPUs as well as CPUs in non-isolated 1-CPU partition. Those CPUs may
# not be included in the cpuset.cpus.isolated control file which contains
# only CPUs in isolated partitions.
# only CPUs in isolated partitions as well as those that are isolated at
# boot time.
#
# $1 - expected isolated cpu list(s) <isolcpus1>{,<isolcpus2>}
# <isolcpus1> - expected sched/domains value
......@@ -659,18 +684,21 @@ check_isolcpus()
fi
#
# Check the debug isolated cpumask, if present
# Check cpuset.cpus.isolated cpumask
#
[[ -f $ISCPUS ]] && {
if [[ -z "$BOOT_ISOLCPUS" ]]
then
ISOLCPUS=$(cat $ISCPUS)
else
ISOLCPUS=$(cat $ISCPUS | sed -e "s/,*$BOOT_ISOLCPUS//")
fi
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && {
# Take a 50ms pause and try again
pause 0.05
ISOLCPUS=$(cat $ISCPUS)
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && {
# Take a 50ms pause and try again
pause 0.05
ISOLCPUS=$(cat $ISCPUS)
}
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && return 1
ISOLCPUS=
}
[[ "$EXPECT_VAL2" != "$ISOLCPUS" ]] && return 1
ISOLCPUS=
#
# Use the sched domain in debugfs to check isolated CPUs, if available
......@@ -703,6 +731,9 @@ check_isolcpus()
fi
done
[[ "$ISOLCPUS" = *- ]] && ISOLCPUS=${ISOLCPUS}$LASTISOLCPU
[[ -n "BOOT_ISOLCPUS" ]] &&
ISOLCPUS=$(echo $ISOLCPUS | sed -e "s/,*$BOOT_ISOLCPUS//")
[[ "$EXPECT_VAL" = "$ISOLCPUS" ]]
}
......@@ -720,7 +751,8 @@ test_fail()
}
#
# Check to see if there are unexpected isolated CPUs left
# Check to see if there are unexpected isolated CPUs left beyond the boot
# time isolated ones.
#
null_isolcpus_check()
{
......
#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
#
# Basc test for cpuset v1 interfaces write/read
#
skip_test() {
echo "$1"
echo "Test SKIPPED"
exit 4 # ksft_skip
}
write_test() {
dir=$1
interface=$2
value=$3
original=$(cat $dir/$interface)
echo "testing $interface $value"
echo $value > $dir/$interface
new=$(cat $dir/$interface)
[[ $value -ne $(cat $dir/$interface) ]] && {
echo "$interface write $value failed: new:$new"
exit 1
}
}
[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
# Find cpuset v1 mount point
CPUSET=$(mount -t cgroup | grep cpuset | head -1 | awk '{print $3}')
[[ -n "$CPUSET" ]] || skip_test "cpuset v1 mount point not found!"
#
# Create a test cpuset, read write test
#
TDIR=test$$
[[ -d $CPUSET/$TDIR ]] || mkdir $CPUSET/$TDIR
ITF_MATRIX=(
#interface value expect root_only
'cpuset.cpus 0-1 0-1 0'
'cpuset.mem_exclusive 1 1 0'
'cpuset.mem_exclusive 0 0 0'
'cpuset.mem_hardwall 1 1 0'
'cpuset.mem_hardwall 0 0 0'
'cpuset.memory_migrate 1 1 0'
'cpuset.memory_migrate 0 0 0'
'cpuset.memory_spread_page 1 1 0'
'cpuset.memory_spread_page 0 0 0'
'cpuset.memory_spread_slab 1 1 0'
'cpuset.memory_spread_slab 0 0 0'
'cpuset.mems 0 0 0'
'cpuset.sched_load_balance 1 1 0'
'cpuset.sched_load_balance 0 0 0'
'cpuset.sched_relax_domain_level 2 2 0'
'cpuset.memory_pressure_enabled 1 1 1'
'cpuset.memory_pressure_enabled 0 0 1'
)
run_test()
{
cnt="${ITF_MATRIX[@]}"
for i in "${ITF_MATRIX[@]}" ; do
args=($i)
root_only=${args[3]}
[[ $root_only -eq 1 ]] && {
write_test "$CPUSET" "${args[0]}" "${args[1]}" "${args[2]}"
continue
}
write_test "$CPUSET/$TDIR" "${args[0]}" "${args[1]}" "${args[2]}"
done
}
run_test
rmdir $CPUSET/$TDIR
echo "Test PASSED"
exit 0
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment