Commit adf4bfc4 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cpuset now support isolated cpus.partition type, which will enable
   dynamic CPU isolation

 - pids.peak added to remember the max number of pids used

 - holes in cgroup namespace plugged

 - internal cleanups

* tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (25 commits)
  cgroup: use strscpy() is more robust and safer
  iocost_monitor: reorder BlkgIterator
  cgroup: simplify code in cgroup_apply_control
  cgroup: Make cgroup_get_from_id() prettier
  cgroup/cpuset: remove unreachable code
  cgroup: Remove CFTYPE_PRESSURE
  cgroup: Improve cftype add/rm error handling
  kselftest/cgroup: Add cpuset v2 partition root state test
  cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule
  cgroup/cpuset: Relocate a code block in validate_change()
  cgroup/cpuset: Show invalid partition reason string
  cgroup/cpuset: Add a new isolated cpus.partition type
  cgroup/cpuset: Relax constraints to partition & cpus changes
  cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
  cgroup/cpuset: Miscellaneous cleanups & add helper functions
  cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset
  cgroup: add pids.peak interface for pids controller
  cgroup: Remove data-race around cgrp_dfl_visible
  cgroup: Fix build failure when CONFIG_SHRINKER_DEBUG
  ...
parents 8adc0486 8619e94d
...@@ -2190,75 +2190,93 @@ Cpuset Interface Files ...@@ -2190,75 +2190,93 @@ Cpuset Interface Files
It accepts only the following input values when written to. It accepts only the following input values when written to.
======== ================================ ========== =====================================
"root" a partition root "member" Non-root member of a partition
"member" a non-root member of a partition "root" Partition root
======== ================================ "isolated" Partition root without load balancing
========== =====================================
When set to be a partition root, the current cgroup is the
root of a new partition or scheduling domain that comprises The root cgroup is always a partition root and its state
itself and all its descendants except those that are separate cannot be changed. All other non-root cgroups start out as
partition roots themselves and their descendants. The root "member".
cgroup is always a partition root.
When set to "root", the current cgroup is the root of a new
There are constraints on where a partition root can be set. partition or scheduling domain that comprises itself and all
It can only be set in a cgroup if all the following conditions its descendants except those that are separate partition roots
are true. themselves and their descendants.
1) The "cpuset.cpus" is not empty and the list of CPUs are When set to "isolated", the CPUs in that partition root will
exclusive, i.e. they are not shared by any of its siblings. be in an isolated state without any load balancing from the
2) The parent cgroup is a partition root. scheduler. Tasks placed in such a partition with multiple
3) The "cpuset.cpus" is also a proper subset of the parent's CPUs should be carefully distributed and bound to each of the
"cpuset.cpus.effective". individual CPUs for optimal performance.
4) There is no child cgroups with cpuset enabled. This is for
eliminating corner cases that have to be handled if such a The value shown in "cpuset.cpus.effective" of a partition root
condition is allowed. is the CPUs that the partition root can dedicate to a potential
new child partition root. The new child subtracts available
Setting it to partition root will take the CPUs away from the CPUs from its parent "cpuset.cpus.effective".
effective CPUs of the parent cgroup. Once it is set, this
file cannot be reverted back to "member" if there are any child A partition root ("root" or "isolated") can be in one of the
cgroups with cpuset enabled. two possible states - valid or invalid. An invalid partition
root is in a degraded state where some state information may
A parent partition cannot distribute all its CPUs to its be retained, but behaves more like a "member".
child partitions. There must be at least one cpu left in the
parent partition. All possible state transitions among "member", "root" and
"isolated" are allowed.
Once becoming a partition root, changes to "cpuset.cpus" is
generally allowed as long as the first condition above is true, On read, the "cpuset.cpus.partition" file can show the following
the change will not take away all the CPUs from the parent values.
partition and the new "cpuset.cpus" value is a superset of its
children's "cpuset.cpus" values. ============================= =====================================
"member" Non-root member of a partition
Sometimes, external factors like changes to ancestors' "root" Partition root
"cpuset.cpus" or cpu hotplug can cause the state of the partition "isolated" Partition root without load balancing
root to change. On read, the "cpuset.sched.partition" file "root invalid (<reason>)" Invalid partition root
can show the following values. "isolated invalid (<reason>)" Invalid isolated partition root
============================= =====================================
============== ==============================
"member" Non-root member of a partition In the case of an invalid partition root, a descriptive string on
"root" Partition root why the partition is invalid is included within parentheses.
"root invalid" Invalid partition root
============== ============================== For a partition root to become valid, the following conditions
must be met.
It is a partition root if the first 2 partition root conditions
above are true and at least one CPU from "cpuset.cpus" is 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they
granted by the parent cgroup. are not shared by any of its siblings (exclusivity rule).
2) The parent cgroup is a valid partition root.
A partition root can become invalid if none of CPUs requested 3) The "cpuset.cpus" is not empty and must contain at least
in "cpuset.cpus" can be granted by the parent cgroup or the one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
parent cgroup is no longer a partition root itself. In this 4) The "cpuset.cpus.effective" cannot be empty unless there is
case, it is not a real partition even though the restriction no task associated with this partition.
of the first partition root condition above will still apply.
The cpu affinity of all the tasks in the cgroup will then be External events like hotplug or changes to "cpuset.cpus" can
associated with CPUs in the nearest ancestor partition. cause a valid partition root to become invalid and vice versa.
Note that a task cannot be moved to a cgroup with empty
An invalid partition root can be transitioned back to a "cpuset.cpus.effective".
real partition root if at least one of the requested CPUs
can now be granted by its parent. In this case, the cpu For a valid partition root with the sibling cpu exclusivity
affinity of all the tasks in the formerly invalid partition rule enabled, changes made to "cpuset.cpus" that violate the
will be associated to the CPUs of the newly formed partition. exclusivity rule will invalidate the partition as well as its
Changing the partition state of an invalid partition root to sibiling partitions with conflicting cpuset.cpus values. So
"member" is always allowed even if child cpusets are present. care must be taking in changing "cpuset.cpus".
A valid non-root parent partition may distribute out all its CPUs
to its child partitions when there is no task associated with it.
Care must be taken to change a valid partition root to
"member" as all its child partitions, if present, will become
invalid causing disruption to tasks running in those child
partitions. These inactivated partitions could be recovered if
their parent is switched back to a partition root with a proper
set of "cpuset.cpus".
Poll and inotify events are triggered whenever the state of
"cpuset.cpus.partition" changes. That includes changes caused
by write to "cpuset.cpus.partition", cpu hotplug or other
changes that modify the validity status of the partition.
This will allow user space agents to monitor unexpected changes
to "cpuset.cpus.partition" without the need to do continuous
polling.
Device controller Device controller
......
...@@ -19,8 +19,8 @@ int blkcg_set_fc_appid(char *app_id, u64 cgrp_id, size_t app_id_len) ...@@ -19,8 +19,8 @@ int blkcg_set_fc_appid(char *app_id, u64 cgrp_id, size_t app_id_len)
return -EINVAL; return -EINVAL;
cgrp = cgroup_get_from_id(cgrp_id); cgrp = cgroup_get_from_id(cgrp_id);
if (!cgrp) if (IS_ERR(cgrp))
return -ENOENT; return PTR_ERR(cgrp);
css = cgroup_get_e_css(cgrp, &io_cgrp_subsys); css = cgroup_get_e_css(cgrp, &io_cgrp_subsys);
if (!css) { if (!css) {
ret = -ENOENT; ret = -ENOENT;
......
...@@ -126,11 +126,11 @@ enum { ...@@ -126,11 +126,11 @@ enum {
CFTYPE_NO_PREFIX = (1 << 3), /* (DON'T USE FOR NEW FILES) no subsys prefix */ CFTYPE_NO_PREFIX = (1 << 3), /* (DON'T USE FOR NEW FILES) no subsys prefix */
CFTYPE_WORLD_WRITABLE = (1 << 4), /* (DON'T USE FOR NEW FILES) S_IWUGO */ CFTYPE_WORLD_WRITABLE = (1 << 4), /* (DON'T USE FOR NEW FILES) S_IWUGO */
CFTYPE_DEBUG = (1 << 5), /* create when cgroup_debug */ CFTYPE_DEBUG = (1 << 5), /* create when cgroup_debug */
CFTYPE_PRESSURE = (1 << 6), /* only if pressure feature is enabled */
/* internal flags, do not use outside cgroup core proper */ /* internal flags, do not use outside cgroup core proper */
__CFTYPE_ONLY_ON_DFL = (1 << 16), /* only on default hierarchy */ __CFTYPE_ONLY_ON_DFL = (1 << 16), /* only on default hierarchy */
__CFTYPE_NOT_ON_DFL = (1 << 17), /* not on default hierarchy */ __CFTYPE_NOT_ON_DFL = (1 << 17), /* not on default hierarchy */
__CFTYPE_ADDED = (1 << 18),
}; };
/* /*
...@@ -384,7 +384,7 @@ struct cgroup { ...@@ -384,7 +384,7 @@ struct cgroup {
/* /*
* The depth this cgroup is at. The root is at depth zero and each * The depth this cgroup is at. The root is at depth zero and each
* step down the hierarchy increments the level. This along with * step down the hierarchy increments the level. This along with
* ancestor_ids[] can determine whether a given cgroup is a * ancestors[] can determine whether a given cgroup is a
* descendant of another without traversing the hierarchy. * descendant of another without traversing the hierarchy.
*/ */
int level; int level;
...@@ -504,8 +504,8 @@ struct cgroup { ...@@ -504,8 +504,8 @@ struct cgroup {
/* Used to store internal freezer state */ /* Used to store internal freezer state */
struct cgroup_freezer_state freezer; struct cgroup_freezer_state freezer;
/* ids of the ancestors at each level including self */ /* All ancestors including self */
u64 ancestor_ids[]; struct cgroup *ancestors[];
}; };
/* /*
...@@ -522,11 +522,15 @@ struct cgroup_root { ...@@ -522,11 +522,15 @@ struct cgroup_root {
/* Unique id for this hierarchy. */ /* Unique id for this hierarchy. */
int hierarchy_id; int hierarchy_id;
/* The root cgroup. Root is destroyed on its release. */ /*
* The root cgroup. The containing cgroup_root will be destroyed on its
* release. cgrp->ancestors[0] will be used overflowing into the
* following field. cgrp_ancestor_storage must immediately follow.
*/
struct cgroup cgrp; struct cgroup cgrp;
/* for cgrp->ancestor_ids[0] */ /* must follow cgrp for cgrp->ancestors[0], see above */
u64 cgrp_ancestor_id_storage; struct cgroup *cgrp_ancestor_storage;
/* Number of cgroups in the hierarchy, used only for /proc/cgroups */ /* Number of cgroups in the hierarchy, used only for /proc/cgroups */
atomic_t nr_cgrps; atomic_t nr_cgrps;
......
...@@ -575,7 +575,7 @@ static inline bool cgroup_is_descendant(struct cgroup *cgrp, ...@@ -575,7 +575,7 @@ static inline bool cgroup_is_descendant(struct cgroup *cgrp,
{ {
if (cgrp->root != ancestor->root || cgrp->level < ancestor->level) if (cgrp->root != ancestor->root || cgrp->level < ancestor->level)
return false; return false;
return cgrp->ancestor_ids[ancestor->level] == cgroup_id(ancestor); return cgrp->ancestors[ancestor->level] == ancestor;
} }
/** /**
...@@ -592,11 +592,9 @@ static inline bool cgroup_is_descendant(struct cgroup *cgrp, ...@@ -592,11 +592,9 @@ static inline bool cgroup_is_descendant(struct cgroup *cgrp,
static inline struct cgroup *cgroup_ancestor(struct cgroup *cgrp, static inline struct cgroup *cgroup_ancestor(struct cgroup *cgrp,
int ancestor_level) int ancestor_level)
{ {
if (cgrp->level < ancestor_level) if (ancestor_level < 0 || ancestor_level > cgrp->level)
return NULL; return NULL;
while (cgrp && cgrp->level > ancestor_level) return cgrp->ancestors[ancestor_level];
cgrp = cgroup_parent(cgrp);
return cgrp;
} }
/** /**
...@@ -748,11 +746,6 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task, ...@@ -748,11 +746,6 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen) static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
{} {}
static inline struct cgroup *cgroup_get_from_id(u64 id)
{
return NULL;
}
#endif /* !CONFIG_CGROUPS */ #endif /* !CONFIG_CGROUPS */
#ifdef CONFIG_CGROUPS #ifdef CONFIG_CGROUPS
......
...@@ -250,6 +250,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup, ...@@ -250,6 +250,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader, int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
bool threadgroup); bool threadgroup);
void cgroup_attach_lock(bool lock_threadgroup);
void cgroup_attach_unlock(bool lock_threadgroup);
struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup, struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
bool *locked) bool *locked)
__acquires(&cgroup_threadgroup_rwsem); __acquires(&cgroup_threadgroup_rwsem);
......
...@@ -59,8 +59,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) ...@@ -59,8 +59,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
int retval = 0; int retval = 0;
mutex_lock(&cgroup_mutex); mutex_lock(&cgroup_mutex);
cpus_read_lock(); cgroup_attach_lock(true);
percpu_down_write(&cgroup_threadgroup_rwsem);
for_each_root(root) { for_each_root(root) {
struct cgroup *from_cgrp; struct cgroup *from_cgrp;
...@@ -72,8 +71,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk) ...@@ -72,8 +71,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
if (retval) if (retval)
break; break;
} }
percpu_up_write(&cgroup_threadgroup_rwsem); cgroup_attach_unlock(true);
cpus_read_unlock();
mutex_unlock(&cgroup_mutex); mutex_unlock(&cgroup_mutex);
return retval; return retval;
......
...@@ -217,6 +217,7 @@ struct cgroup_namespace init_cgroup_ns = { ...@@ -217,6 +217,7 @@ struct cgroup_namespace init_cgroup_ns = {
static struct file_system_type cgroup2_fs_type; static struct file_system_type cgroup2_fs_type;
static struct cftype cgroup_base_files[]; static struct cftype cgroup_base_files[];
static struct cftype cgroup_psi_files[];
/* cgroup optional features */ /* cgroup optional features */
enum cgroup_opt_features { enum cgroup_opt_features {
...@@ -1689,12 +1690,16 @@ static void css_clear_dir(struct cgroup_subsys_state *css) ...@@ -1689,12 +1690,16 @@ static void css_clear_dir(struct cgroup_subsys_state *css)
css->flags &= ~CSS_VISIBLE; css->flags &= ~CSS_VISIBLE;
if (!css->ss) { if (!css->ss) {
if (cgroup_on_dfl(cgrp)) if (cgroup_on_dfl(cgrp)) {
cfts = cgroup_base_files; cgroup_addrm_files(css, cgrp,
else cgroup_base_files, false);
cfts = cgroup1_base_files; if (cgroup_psi_enabled())
cgroup_addrm_files(css, cgrp,
cgroup_addrm_files(css, cgrp, cfts, false); cgroup_psi_files, false);
} else {
cgroup_addrm_files(css, cgrp,
cgroup1_base_files, false);
}
} else { } else {
list_for_each_entry(cfts, &css->ss->cfts, node) list_for_each_entry(cfts, &css->ss->cfts, node)
cgroup_addrm_files(css, cgrp, cfts, false); cgroup_addrm_files(css, cgrp, cfts, false);
...@@ -1717,14 +1722,22 @@ static int css_populate_dir(struct cgroup_subsys_state *css) ...@@ -1717,14 +1722,22 @@ static int css_populate_dir(struct cgroup_subsys_state *css)
return 0; return 0;
if (!css->ss) { if (!css->ss) {
if (cgroup_on_dfl(cgrp)) if (cgroup_on_dfl(cgrp)) {
cfts = cgroup_base_files; ret = cgroup_addrm_files(&cgrp->self, cgrp,
else cgroup_base_files, true);
cfts = cgroup1_base_files; if (ret < 0)
return ret;
ret = cgroup_addrm_files(&cgrp->self, cgrp, cfts, true);
if (ret < 0) if (cgroup_psi_enabled()) {
return ret; ret = cgroup_addrm_files(&cgrp->self, cgrp,
cgroup_psi_files, true);
if (ret < 0)
return ret;
}
} else {
cgroup_addrm_files(css, cgrp,
cgroup1_base_files, true);
}
} else { } else {
list_for_each_entry(cfts, &css->ss->cfts, node) { list_for_each_entry(cfts, &css->ss->cfts, node) {
ret = cgroup_addrm_files(css, cgrp, cfts, true); ret = cgroup_addrm_files(css, cgrp, cfts, true);
...@@ -2050,7 +2063,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask) ...@@ -2050,7 +2063,7 @@ int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask)
} }
root_cgrp->kn = kernfs_root_to_node(root->kf_root); root_cgrp->kn = kernfs_root_to_node(root->kf_root);
WARN_ON_ONCE(cgroup_ino(root_cgrp) != 1); WARN_ON_ONCE(cgroup_ino(root_cgrp) != 1);
root_cgrp->ancestor_ids[0] = cgroup_id(root_cgrp); root_cgrp->ancestors[0] = root_cgrp;
ret = css_populate_dir(&root_cgrp->self); ret = css_populate_dir(&root_cgrp->self);
if (ret) if (ret)
...@@ -2173,7 +2186,7 @@ static int cgroup_get_tree(struct fs_context *fc) ...@@ -2173,7 +2186,7 @@ static int cgroup_get_tree(struct fs_context *fc)
struct cgroup_fs_context *ctx = cgroup_fc2context(fc); struct cgroup_fs_context *ctx = cgroup_fc2context(fc);
int ret; int ret;
cgrp_dfl_visible = true; WRITE_ONCE(cgrp_dfl_visible, true);
cgroup_get_live(&cgrp_dfl_root.cgrp); cgroup_get_live(&cgrp_dfl_root.cgrp);
ctx->root = &cgrp_dfl_root; ctx->root = &cgrp_dfl_root;
...@@ -2361,7 +2374,7 @@ int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen) ...@@ -2361,7 +2374,7 @@ int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
ret = cgroup_path_ns_locked(cgrp, buf, buflen, &init_cgroup_ns); ret = cgroup_path_ns_locked(cgrp, buf, buflen, &init_cgroup_ns);
} else { } else {
/* if no hierarchy exists, everyone is in "/" */ /* if no hierarchy exists, everyone is in "/" */
ret = strlcpy(buf, "/", buflen); ret = strscpy(buf, "/", buflen);
} }
spin_unlock_irq(&css_set_lock); spin_unlock_irq(&css_set_lock);
...@@ -2393,7 +2406,7 @@ EXPORT_SYMBOL_GPL(task_cgroup_path); ...@@ -2393,7 +2406,7 @@ EXPORT_SYMBOL_GPL(task_cgroup_path);
* write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that * write-locking cgroup_threadgroup_rwsem. This allows ->attach() to assume that
* CPU hotplug is disabled on entry. * CPU hotplug is disabled on entry.
*/ */
static void cgroup_attach_lock(bool lock_threadgroup) void cgroup_attach_lock(bool lock_threadgroup)
{ {
cpus_read_lock(); cpus_read_lock();
if (lock_threadgroup) if (lock_threadgroup)
...@@ -2404,7 +2417,7 @@ static void cgroup_attach_lock(bool lock_threadgroup) ...@@ -2404,7 +2417,7 @@ static void cgroup_attach_lock(bool lock_threadgroup)
* cgroup_attach_unlock - Undo cgroup_attach_lock() * cgroup_attach_unlock - Undo cgroup_attach_lock()
* @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem * @lock_threadgroup: whether to up_write cgroup_threadgroup_rwsem
*/ */
static void cgroup_attach_unlock(bool lock_threadgroup) void cgroup_attach_unlock(bool lock_threadgroup)
{ {
if (lock_threadgroup) if (lock_threadgroup)
percpu_up_write(&cgroup_threadgroup_rwsem); percpu_up_write(&cgroup_threadgroup_rwsem);
...@@ -3292,11 +3305,7 @@ static int cgroup_apply_control(struct cgroup *cgrp) ...@@ -3292,11 +3305,7 @@ static int cgroup_apply_control(struct cgroup *cgrp)
* making the following cgroup_update_dfl_csses() properly update * making the following cgroup_update_dfl_csses() properly update
* css associations of all tasks in the subtree. * css associations of all tasks in the subtree.
*/ */
ret = cgroup_update_dfl_csses(cgrp); return cgroup_update_dfl_csses(cgrp);
if (ret)
return ret;
return 0;
} }
/** /**
...@@ -4132,8 +4141,6 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css, ...@@ -4132,8 +4141,6 @@ static int cgroup_addrm_files(struct cgroup_subsys_state *css,
restart: restart:
for (cft = cfts; cft != cft_end && cft->name[0] != '\0'; cft++) { for (cft = cfts; cft != cft_end && cft->name[0] != '\0'; cft++) {
/* does cft->flags tell us to skip this file on @cgrp? */ /* does cft->flags tell us to skip this file on @cgrp? */
if ((cft->flags & CFTYPE_PRESSURE) && !cgroup_psi_enabled())
continue;
if ((cft->flags & __CFTYPE_ONLY_ON_DFL) && !cgroup_on_dfl(cgrp)) if ((cft->flags & __CFTYPE_ONLY_ON_DFL) && !cgroup_on_dfl(cgrp))
continue; continue;
if ((cft->flags & __CFTYPE_NOT_ON_DFL) && cgroup_on_dfl(cgrp)) if ((cft->flags & __CFTYPE_NOT_ON_DFL) && cgroup_on_dfl(cgrp))
...@@ -4198,21 +4205,25 @@ static void cgroup_exit_cftypes(struct cftype *cfts) ...@@ -4198,21 +4205,25 @@ static void cgroup_exit_cftypes(struct cftype *cfts)
cft->ss = NULL; cft->ss = NULL;
/* revert flags set by cgroup core while adding @cfts */ /* revert flags set by cgroup core while adding @cfts */
cft->flags &= ~(__CFTYPE_ONLY_ON_DFL | __CFTYPE_NOT_ON_DFL); cft->flags &= ~(__CFTYPE_ONLY_ON_DFL | __CFTYPE_NOT_ON_DFL |
__CFTYPE_ADDED);
} }
} }
static int cgroup_init_cftypes(struct cgroup_subsys *ss, struct cftype *cfts) static int cgroup_init_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
{ {
struct cftype *cft; struct cftype *cft;
int ret = 0;
for (cft = cfts; cft->name[0] != '\0'; cft++) { for (cft = cfts; cft->name[0] != '\0'; cft++) {
struct kernfs_ops *kf_ops; struct kernfs_ops *kf_ops;
WARN_ON(cft->ss || cft->kf_ops); WARN_ON(cft->ss || cft->kf_ops);
if ((cft->flags & CFTYPE_PRESSURE) && !cgroup_psi_enabled()) if (cft->flags & __CFTYPE_ADDED) {
continue; ret = -EBUSY;
break;
}
if (cft->seq_start) if (cft->seq_start)
kf_ops = &cgroup_kf_ops; kf_ops = &cgroup_kf_ops;
...@@ -4226,26 +4237,26 @@ static int cgroup_init_cftypes(struct cgroup_subsys *ss, struct cftype *cfts) ...@@ -4226,26 +4237,26 @@ static int cgroup_init_cftypes(struct cgroup_subsys *ss, struct cftype *cfts)
if (cft->max_write_len && cft->max_write_len != PAGE_SIZE) { if (cft->max_write_len && cft->max_write_len != PAGE_SIZE) {
kf_ops = kmemdup(kf_ops, sizeof(*kf_ops), GFP_KERNEL); kf_ops = kmemdup(kf_ops, sizeof(*kf_ops), GFP_KERNEL);
if (!kf_ops) { if (!kf_ops) {
cgroup_exit_cftypes(cfts); ret = -ENOMEM;
return -ENOMEM; break;
} }
kf_ops->atomic_write_len = cft->max_write_len; kf_ops->atomic_write_len = cft->max_write_len;
} }
cft->kf_ops = kf_ops; cft->kf_ops = kf_ops;
cft->ss = ss; cft->ss = ss;
cft->flags |= __CFTYPE_ADDED;
} }
return 0; if (ret)
cgroup_exit_cftypes(cfts);
return ret;
} }
static int cgroup_rm_cftypes_locked(struct cftype *cfts) static int cgroup_rm_cftypes_locked(struct cftype *cfts)
{ {
lockdep_assert_held(&cgroup_mutex); lockdep_assert_held(&cgroup_mutex);
if (!cfts || !cfts[0].ss)
return -ENOENT;
list_del(&cfts->node); list_del(&cfts->node);
cgroup_apply_cftypes(cfts, false); cgroup_apply_cftypes(cfts, false);
cgroup_exit_cftypes(cfts); cgroup_exit_cftypes(cfts);
...@@ -4267,6 +4278,12 @@ int cgroup_rm_cftypes(struct cftype *cfts) ...@@ -4267,6 +4278,12 @@ int cgroup_rm_cftypes(struct cftype *cfts)
{ {
int ret; int ret;
if (!cfts || cfts[0].name[0] == '\0')
return 0;
if (!(cfts[0].flags & __CFTYPE_ADDED))
return -ENOENT;
mutex_lock(&cgroup_mutex); mutex_lock(&cgroup_mutex);
ret = cgroup_rm_cftypes_locked(cfts); ret = cgroup_rm_cftypes_locked(cfts);
mutex_unlock(&cgroup_mutex); mutex_unlock(&cgroup_mutex);
...@@ -5151,10 +5168,13 @@ static struct cftype cgroup_base_files[] = { ...@@ -5151,10 +5168,13 @@ static struct cftype cgroup_base_files[] = {
.name = "cpu.stat", .name = "cpu.stat",
.seq_show = cpu_stat_show, .seq_show = cpu_stat_show,
}, },
{ } /* terminate */
};
static struct cftype cgroup_psi_files[] = {
#ifdef CONFIG_PSI #ifdef CONFIG_PSI
{ {
.name = "io.pressure", .name = "io.pressure",
.flags = CFTYPE_PRESSURE,
.seq_show = cgroup_io_pressure_show, .seq_show = cgroup_io_pressure_show,
.write = cgroup_io_pressure_write, .write = cgroup_io_pressure_write,
.poll = cgroup_pressure_poll, .poll = cgroup_pressure_poll,
...@@ -5162,7 +5182,6 @@ static struct cftype cgroup_base_files[] = { ...@@ -5162,7 +5182,6 @@ static struct cftype cgroup_base_files[] = {
}, },
{ {
.name = "memory.pressure", .name = "memory.pressure",
.flags = CFTYPE_PRESSURE,
.seq_show = cgroup_memory_pressure_show, .seq_show = cgroup_memory_pressure_show,
.write = cgroup_memory_pressure_write, .write = cgroup_memory_pressure_write,
.poll = cgroup_pressure_poll, .poll = cgroup_pressure_poll,
...@@ -5170,7 +5189,6 @@ static struct cftype cgroup_base_files[] = { ...@@ -5170,7 +5189,6 @@ static struct cftype cgroup_base_files[] = {
}, },
{ {
.name = "cpu.pressure", .name = "cpu.pressure",
.flags = CFTYPE_PRESSURE,
.seq_show = cgroup_cpu_pressure_show, .seq_show = cgroup_cpu_pressure_show,
.write = cgroup_cpu_pressure_write, .write = cgroup_cpu_pressure_write,
.poll = cgroup_pressure_poll, .poll = cgroup_pressure_poll,
...@@ -5452,8 +5470,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, ...@@ -5452,8 +5470,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
int ret; int ret;
/* allocate the cgroup and its ID, 0 is reserved for the root */ /* allocate the cgroup and its ID, 0 is reserved for the root */
cgrp = kzalloc(struct_size(cgrp, ancestor_ids, (level + 1)), cgrp = kzalloc(struct_size(cgrp, ancestors, (level + 1)), GFP_KERNEL);
GFP_KERNEL);
if (!cgrp) if (!cgrp)
return ERR_PTR(-ENOMEM); return ERR_PTR(-ENOMEM);
...@@ -5505,7 +5522,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name, ...@@ -5505,7 +5522,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent, const char *name,
spin_lock_irq(&css_set_lock); spin_lock_irq(&css_set_lock);
for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) { for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) {
cgrp->ancestor_ids[tcgrp->level] = cgroup_id(tcgrp); cgrp->ancestors[tcgrp->level] = tcgrp;
if (tcgrp != cgrp) { if (tcgrp != cgrp) {
tcgrp->nr_descendants++; tcgrp->nr_descendants++;
...@@ -5938,6 +5955,7 @@ int __init cgroup_init(void) ...@@ -5938,6 +5955,7 @@ int __init cgroup_init(void)
BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 16); BUILD_BUG_ON(CGROUP_SUBSYS_COUNT > 16);
BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files)); BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup_psi_files));
BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files)); BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
cgroup_rstat_boot(); cgroup_rstat_boot();
...@@ -6058,19 +6076,22 @@ void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen) ...@@ -6058,19 +6076,22 @@ void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
/* /*
* cgroup_get_from_id : get the cgroup associated with cgroup id * cgroup_get_from_id : get the cgroup associated with cgroup id
* @id: cgroup id * @id: cgroup id
* On success return the cgrp, on failure return NULL * On success return the cgrp or ERR_PTR on failure
* Only cgroups within current task's cgroup NS are valid.
*/ */
struct cgroup *cgroup_get_from_id(u64 id) struct cgroup *cgroup_get_from_id(u64 id)
{ {
struct kernfs_node *kn; struct kernfs_node *kn;
struct cgroup *cgrp = NULL; struct cgroup *cgrp, *root_cgrp;
kn = kernfs_find_and_get_node_by_id(cgrp_dfl_root.kf_root, id); kn = kernfs_find_and_get_node_by_id(cgrp_dfl_root.kf_root, id);
if (!kn) if (!kn)
goto out; return ERR_PTR(-ENOENT);
if (kernfs_type(kn) != KERNFS_DIR) if (kernfs_type(kn) != KERNFS_DIR) {
goto put; kernfs_put(kn);
return ERR_PTR(-ENOENT);
}
rcu_read_lock(); rcu_read_lock();
...@@ -6079,9 +6100,19 @@ struct cgroup *cgroup_get_from_id(u64 id) ...@@ -6079,9 +6100,19 @@ struct cgroup *cgroup_get_from_id(u64 id)
cgrp = NULL; cgrp = NULL;
rcu_read_unlock(); rcu_read_unlock();
put:
kernfs_put(kn); kernfs_put(kn);
out:
if (!cgrp)
return ERR_PTR(-ENOENT);
spin_lock_irq(&css_set_lock);
root_cgrp = current_cgns_cgroup_from_root(&cgrp_dfl_root);
spin_unlock_irq(&css_set_lock);
if (!cgroup_is_descendant(cgrp, root_cgrp)) {
cgroup_put(cgrp);
return ERR_PTR(-ENOENT);
}
return cgrp; return cgrp;
} }
EXPORT_SYMBOL_GPL(cgroup_get_from_id); EXPORT_SYMBOL_GPL(cgroup_get_from_id);
...@@ -6111,7 +6142,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns, ...@@ -6111,7 +6142,7 @@ int proc_cgroup_show(struct seq_file *m, struct pid_namespace *ns,
struct cgroup *cgrp; struct cgroup *cgrp;
int ssid, count = 0; int ssid, count = 0;
if (root == &cgrp_dfl_root && !cgrp_dfl_visible) if (root == &cgrp_dfl_root && !READ_ONCE(cgrp_dfl_visible))
continue; continue;
seq_printf(m, "%d:", root->hierarchy_id); seq_printf(m, "%d:", root->hierarchy_id);
...@@ -6653,8 +6684,12 @@ struct cgroup *cgroup_get_from_path(const char *path) ...@@ -6653,8 +6684,12 @@ struct cgroup *cgroup_get_from_path(const char *path)
{ {
struct kernfs_node *kn; struct kernfs_node *kn;
struct cgroup *cgrp = ERR_PTR(-ENOENT); struct cgroup *cgrp = ERR_PTR(-ENOENT);
struct cgroup *root_cgrp;
kn = kernfs_walk_and_get(cgrp_dfl_root.cgrp.kn, path); spin_lock_irq(&css_set_lock);
root_cgrp = current_cgns_cgroup_from_root(&cgrp_dfl_root);
kn = kernfs_walk_and_get(root_cgrp->kn, path);
spin_unlock_irq(&css_set_lock);
if (!kn) if (!kn)
goto out; goto out;
...@@ -6812,9 +6847,6 @@ static ssize_t show_delegatable_files(struct cftype *files, char *buf, ...@@ -6812,9 +6847,6 @@ static ssize_t show_delegatable_files(struct cftype *files, char *buf,
if (!(cft->flags & CFTYPE_NS_DELEGATABLE)) if (!(cft->flags & CFTYPE_NS_DELEGATABLE))
continue; continue;
if ((cft->flags & CFTYPE_PRESSURE) && !cgroup_psi_enabled())
continue;
if (prefix) if (prefix)
ret += snprintf(buf + ret, size - ret, "%s.", prefix); ret += snprintf(buf + ret, size - ret, "%s.", prefix);
...@@ -6834,8 +6866,11 @@ static ssize_t delegate_show(struct kobject *kobj, struct kobj_attribute *attr, ...@@ -6834,8 +6866,11 @@ static ssize_t delegate_show(struct kobject *kobj, struct kobj_attribute *attr,
int ssid; int ssid;
ssize_t ret = 0; ssize_t ret = 0;
ret = show_delegatable_files(cgroup_base_files, buf, PAGE_SIZE - ret, ret = show_delegatable_files(cgroup_base_files, buf + ret,
NULL); PAGE_SIZE - ret, NULL);
if (cgroup_psi_enabled())
ret += show_delegatable_files(cgroup_psi_files, buf + ret,
PAGE_SIZE - ret, NULL);
for_each_subsys(ss, ssid) for_each_subsys(ss, ssid)
ret += show_delegatable_files(ss->dfl_cftypes, buf + ret, ret += show_delegatable_files(ss->dfl_cftypes, buf + ret,
......
...@@ -33,6 +33,7 @@ ...@@ -33,6 +33,7 @@
#include <linux/interrupt.h> #include <linux/interrupt.h>
#include <linux/kernel.h> #include <linux/kernel.h>
#include <linux/kmod.h> #include <linux/kmod.h>
#include <linux/kthread.h>
#include <linux/list.h> #include <linux/list.h>
#include <linux/mempolicy.h> #include <linux/mempolicy.h>
#include <linux/mm.h> #include <linux/mm.h>
...@@ -85,6 +86,30 @@ struct fmeter { ...@@ -85,6 +86,30 @@ struct fmeter {
spinlock_t lock; /* guards read or write of above */ spinlock_t lock; /* guards read or write of above */
}; };
/*
* Invalid partition error code
*/
enum prs_errcode {
PERR_NONE = 0,
PERR_INVCPUS,
PERR_INVPARENT,
PERR_NOTPART,
PERR_NOTEXCL,
PERR_NOCPUS,
PERR_HOTPLUG,
PERR_CPUSEMPTY,
};
static const char * const perr_strings[] = {
[PERR_INVCPUS] = "Invalid cpu list in cpuset.cpus",
[PERR_INVPARENT] = "Parent is an invalid partition root",
[PERR_NOTPART] = "Parent is not a partition root",
[PERR_NOTEXCL] = "Cpu list in cpuset.cpus not exclusive",
[PERR_NOCPUS] = "Parent unable to distribute cpu downstream",
[PERR_HOTPLUG] = "No cpu available due to hotplug",
[PERR_CPUSEMPTY] = "cpuset.cpus is empty",
};
struct cpuset { struct cpuset {
struct cgroup_subsys_state css; struct cgroup_subsys_state css;
...@@ -168,6 +193,9 @@ struct cpuset { ...@@ -168,6 +193,9 @@ struct cpuset {
int use_parent_ecpus; int use_parent_ecpus;
int child_ecpus_count; int child_ecpus_count;
/* Invalid partition error code, not lock protected */
enum prs_errcode prs_err;
/* Handle for cpuset.cpus.partition */ /* Handle for cpuset.cpus.partition */
struct cgroup_file partition_file; struct cgroup_file partition_file;
}; };
...@@ -175,20 +203,22 @@ struct cpuset { ...@@ -175,20 +203,22 @@ struct cpuset {
/* /*
* Partition root states: * Partition root states:
* *
* 0 - not a partition root * 0 - member (not a partition root)
*
* 1 - partition root * 1 - partition root
* * 2 - partition root without load balancing (isolated)
* -1 - invalid partition root * -1 - invalid partition root
* None of the cpus in cpus_allowed can be put into the parent's * -2 - invalid isolated partition root
* subparts_cpus. In this case, the cpuset is not a real partition
* root anymore. However, the CPU_EXCLUSIVE bit will still be set
* and the cpuset can be restored back to a partition root if the
* parent cpuset can give more CPUs back to this child cpuset.
*/ */
#define PRS_DISABLED 0 #define PRS_MEMBER 0
#define PRS_ENABLED 1 #define PRS_ROOT 1
#define PRS_ERROR -1 #define PRS_ISOLATED 2
#define PRS_INVALID_ROOT -1
#define PRS_INVALID_ISOLATED -2
static inline bool is_prs_invalid(int prs_state)
{
return prs_state < 0;
}
/* /*
* Temporary cpumasks for working with partitions that are passed among * Temporary cpumasks for working with partitions that are passed among
...@@ -268,25 +298,43 @@ static inline int is_spread_slab(const struct cpuset *cs) ...@@ -268,25 +298,43 @@ static inline int is_spread_slab(const struct cpuset *cs)
return test_bit(CS_SPREAD_SLAB, &cs->flags); return test_bit(CS_SPREAD_SLAB, &cs->flags);
} }
static inline int is_partition_root(const struct cpuset *cs) static inline int is_partition_valid(const struct cpuset *cs)
{ {
return cs->partition_root_state > 0; return cs->partition_root_state > 0;
} }
static inline int is_partition_invalid(const struct cpuset *cs)
{
return cs->partition_root_state < 0;
}
/*
* Callers should hold callback_lock to modify partition_root_state.
*/
static inline void make_partition_invalid(struct cpuset *cs)
{
if (is_partition_valid(cs))
cs->partition_root_state = -cs->partition_root_state;
}
/* /*
* Send notification event of whenever partition_root_state changes. * Send notification event of whenever partition_root_state changes.
*/ */
static inline void notify_partition_change(struct cpuset *cs, static inline void notify_partition_change(struct cpuset *cs, int old_prs)
int old_prs, int new_prs)
{ {
if (old_prs != new_prs) if (old_prs == cs->partition_root_state)
cgroup_file_notify(&cs->partition_file); return;
cgroup_file_notify(&cs->partition_file);
/* Reset prs_err if not invalid */
if (is_partition_valid(cs))
WRITE_ONCE(cs->prs_err, PERR_NONE);
} }
static struct cpuset top_cpuset = { static struct cpuset top_cpuset = {
.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) | .flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
(1 << CS_MEM_EXCLUSIVE)), (1 << CS_MEM_EXCLUSIVE)),
.partition_root_state = PRS_ENABLED, .partition_root_state = PRS_ROOT,
}; };
/** /**
...@@ -404,6 +452,41 @@ static inline bool is_in_v2_mode(void) ...@@ -404,6 +452,41 @@ static inline bool is_in_v2_mode(void)
(cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE); (cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
} }
/**
* partition_is_populated - check if partition has tasks
* @cs: partition root to be checked
* @excluded_child: a child cpuset to be excluded in task checking
* Return: true if there are tasks, false otherwise
*
* It is assumed that @cs is a valid partition root. @excluded_child should
* be non-NULL when this cpuset is going to become a partition itself.
*/
static inline bool partition_is_populated(struct cpuset *cs,
struct cpuset *excluded_child)
{
struct cgroup_subsys_state *css;
struct cpuset *child;
if (cs->css.cgroup->nr_populated_csets)
return true;
if (!excluded_child && !cs->nr_subparts_cpus)
return cgroup_is_populated(cs->css.cgroup);
rcu_read_lock();
cpuset_for_each_child(child, css, cs) {
if (child == excluded_child)
continue;
if (is_partition_valid(child))
continue;
if (cgroup_is_populated(child->css.cgroup)) {
rcu_read_unlock();
return true;
}
}
rcu_read_unlock();
return false;
}
/* /*
* Return in pmask the portion of a task's cpusets's cpus_allowed that * Return in pmask the portion of a task's cpusets's cpus_allowed that
* are online and are capable of running the task. If none are found, * are online and are capable of running the task. If none are found,
...@@ -658,22 +741,6 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial) ...@@ -658,22 +741,6 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
par = parent_cs(cur); par = parent_cs(cur);
/*
* If either I or some sibling (!= me) is exclusive, we can't
* overlap
*/
ret = -EINVAL;
cpuset_for_each_child(c, css, par) {
if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) &&
c != cur &&
cpumask_intersects(trial->cpus_allowed, c->cpus_allowed))
goto out;
if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) &&
c != cur &&
nodes_intersects(trial->mems_allowed, c->mems_allowed))
goto out;
}
/* /*
* Cpusets with tasks - existing or newly being attached - can't * Cpusets with tasks - existing or newly being attached - can't
* be changed to have empty cpus_allowed or mems_allowed. * be changed to have empty cpus_allowed or mems_allowed.
...@@ -698,6 +765,22 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial) ...@@ -698,6 +765,22 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
trial->cpus_allowed)) trial->cpus_allowed))
goto out; goto out;
/*
* If either I or some sibling (!= me) is exclusive, we can't
* overlap
*/
ret = -EINVAL;
cpuset_for_each_child(c, css, par) {
if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) &&
c != cur &&
cpumask_intersects(trial->cpus_allowed, c->cpus_allowed))
goto out;
if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) &&
c != cur &&
nodes_intersects(trial->mems_allowed, c->mems_allowed))
goto out;
}
ret = 0; ret = 0;
out: out:
rcu_read_unlock(); rcu_read_unlock();
...@@ -875,7 +958,7 @@ static int generate_sched_domains(cpumask_var_t **domains, ...@@ -875,7 +958,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
csa[csn++] = cp; csa[csn++] = cp;
/* skip @cp's subtree if not a partition root */ /* skip @cp's subtree if not a partition root */
if (!is_partition_root(cp)) if (!is_partition_valid(cp))
pos_css = css_rightmost_descendant(pos_css); pos_css = css_rightmost_descendant(pos_css);
} }
rcu_read_unlock(); rcu_read_unlock();
...@@ -1081,7 +1164,7 @@ static void rebuild_sched_domains_locked(void) ...@@ -1081,7 +1164,7 @@ static void rebuild_sched_domains_locked(void)
if (top_cpuset.nr_subparts_cpus) { if (top_cpuset.nr_subparts_cpus) {
rcu_read_lock(); rcu_read_lock();
cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) { cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
if (!is_partition_root(cs)) { if (!is_partition_valid(cs)) {
pos_css = css_rightmost_descendant(pos_css); pos_css = css_rightmost_descendant(pos_css);
continue; continue;
} }
...@@ -1127,10 +1210,18 @@ static void update_tasks_cpumask(struct cpuset *cs) ...@@ -1127,10 +1210,18 @@ static void update_tasks_cpumask(struct cpuset *cs)
{ {
struct css_task_iter it; struct css_task_iter it;
struct task_struct *task; struct task_struct *task;
bool top_cs = cs == &top_cpuset;
css_task_iter_start(&cs->css, 0, &it); css_task_iter_start(&cs->css, 0, &it);
while ((task = css_task_iter_next(&it))) while ((task = css_task_iter_next(&it))) {
/*
* Percpu kthreads in top_cpuset are ignored
*/
if (top_cs && (task->flags & PF_KTHREAD) &&
kthread_is_per_cpu(task))
continue;
set_cpus_allowed_ptr(task, cs->effective_cpus); set_cpus_allowed_ptr(task, cs->effective_cpus);
}
css_task_iter_end(&it); css_task_iter_end(&it);
} }
...@@ -1165,15 +1256,18 @@ enum subparts_cmd { ...@@ -1165,15 +1256,18 @@ enum subparts_cmd {
partcmd_enable, /* Enable partition root */ partcmd_enable, /* Enable partition root */
partcmd_disable, /* Disable partition root */ partcmd_disable, /* Disable partition root */
partcmd_update, /* Update parent's subparts_cpus */ partcmd_update, /* Update parent's subparts_cpus */
partcmd_invalidate, /* Make partition invalid */
}; };
static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
int turning_on);
/** /**
* update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset * update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset
* @cpuset: The cpuset that requests change in partition root state * @cpuset: The cpuset that requests change in partition root state
* @cmd: Partition root state change command * @cmd: Partition root state change command
* @newmask: Optional new cpumask for partcmd_update * @newmask: Optional new cpumask for partcmd_update
* @tmp: Temporary addmask and delmask * @tmp: Temporary addmask and delmask
* Return: 0, 1 or an error code * Return: 0 or a partition root state error code
* *
* For partcmd_enable, the cpuset is being transformed from a non-partition * For partcmd_enable, the cpuset is being transformed from a non-partition
* root to a partition root. The cpus_allowed mask of the given cpuset will * root to a partition root. The cpus_allowed mask of the given cpuset will
...@@ -1184,38 +1278,36 @@ enum subparts_cmd { ...@@ -1184,38 +1278,36 @@ enum subparts_cmd {
* For partcmd_disable, the cpuset is being transformed from a partition * For partcmd_disable, the cpuset is being transformed from a partition
* root back to a non-partition root. Any CPUs in cpus_allowed that are in * root back to a non-partition root. Any CPUs in cpus_allowed that are in
* parent's subparts_cpus will be taken away from that cpumask and put back * parent's subparts_cpus will be taken away from that cpumask and put back
* into parent's effective_cpus. 0 should always be returned. * into parent's effective_cpus. 0 will always be returned.
*
* For partcmd_update, if the optional newmask is specified, the cpu
* list is to be changed from cpus_allowed to newmask. Otherwise,
* cpus_allowed is assumed to remain the same. The cpuset should either
* be a partition root or an invalid partition root. The partition root
* state may change if newmask is NULL and none of the requested CPUs can
* be granted by the parent. The function will return 1 if changes to
* parent's subparts_cpus and effective_cpus happen or 0 otherwise.
* Error code should only be returned when newmask is non-NULL.
* *
* The partcmd_enable and partcmd_disable commands are used by * For partcmd_update, if the optional newmask is specified, the cpu list is
* update_prstate(). The partcmd_update command is used by * to be changed from cpus_allowed to newmask. Otherwise, cpus_allowed is
* update_cpumasks_hier() with newmask NULL and update_cpumask() with * assumed to remain the same. The cpuset should either be a valid or invalid
* newmask set. * partition root. The partition root state may change from valid to invalid
* or vice versa. An error code will only be returned if transitioning from
* invalid to valid violates the exclusivity rule.
* *
* The checking is more strict when enabling partition root than the * For partcmd_invalidate, the current partition will be made invalid.
* other two commands.
* *
* Because of the implicit cpu exclusive nature of a partition root, * The partcmd_enable and partcmd_disable commands are used by
* cpumask changes that violates the cpu exclusivity rule will not be * update_prstate(). An error code may be returned and the caller will check
* permitted when checked by validate_change(). * for error.
*
* The partcmd_update command is used by update_cpumasks_hier() with newmask
* NULL and update_cpumask() with newmask set. The partcmd_invalidate is used
* by update_cpumask() with NULL newmask. In both cases, the callers won't
* check for error and so partition_root_state and prs_error will be updated
* directly.
*/ */
static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
struct cpumask *newmask, struct cpumask *newmask,
struct tmpmasks *tmp) struct tmpmasks *tmp)
{ {
struct cpuset *parent = parent_cs(cpuset); struct cpuset *parent = parent_cs(cs);
int adding; /* Moving cpus from effective_cpus to subparts_cpus */ int adding; /* Moving cpus from effective_cpus to subparts_cpus */
int deleting; /* Moving cpus from subparts_cpus to effective_cpus */ int deleting; /* Moving cpus from subparts_cpus to effective_cpus */
int old_prs, new_prs; int old_prs, new_prs;
bool part_error = false; /* Partition error? */ int part_error = PERR_NONE; /* Partition error? */
percpu_rwsem_assert_held(&cpuset_rwsem); percpu_rwsem_assert_held(&cpuset_rwsem);
...@@ -1224,125 +1316,164 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, ...@@ -1224,125 +1316,164 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
* The new cpumask, if present, or the current cpus_allowed must * The new cpumask, if present, or the current cpus_allowed must
* not be empty. * not be empty.
*/ */
if (!is_partition_root(parent) || if (!is_partition_valid(parent)) {
(newmask && cpumask_empty(newmask)) || return is_partition_invalid(parent)
(!newmask && cpumask_empty(cpuset->cpus_allowed))) ? PERR_INVPARENT : PERR_NOTPART;
return -EINVAL; }
if ((newmask && cpumask_empty(newmask)) ||
/* (!newmask && cpumask_empty(cs->cpus_allowed)))
* Enabling/disabling partition root is not allowed if there are return PERR_CPUSEMPTY;
* online children.
*/
if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
return -EBUSY;
/*
* Enabling partition root is not allowed if not all the CPUs
* can be granted from parent's effective_cpus or at least one
* CPU will be left after that.
*/
if ((cmd == partcmd_enable) &&
(!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
return -EINVAL;
/* /*
* A cpumask update cannot make parent's effective_cpus become empty. * new_prs will only be changed for the partcmd_update and
* partcmd_invalidate commands.
*/ */
adding = deleting = false; adding = deleting = false;
old_prs = new_prs = cpuset->partition_root_state; old_prs = new_prs = cs->partition_root_state;
if (cmd == partcmd_enable) { if (cmd == partcmd_enable) {
cpumask_copy(tmp->addmask, cpuset->cpus_allowed); /*
* Enabling partition root is not allowed if cpus_allowed
* doesn't overlap parent's cpus_allowed.
*/
if (!cpumask_intersects(cs->cpus_allowed, parent->cpus_allowed))
return PERR_INVCPUS;
/*
* A parent can be left with no CPU as long as there is no
* task directly associated with the parent partition.
*/
if (!cpumask_intersects(cs->cpus_allowed, parent->effective_cpus) &&
partition_is_populated(parent, cs))
return PERR_NOCPUS;
cpumask_copy(tmp->addmask, cs->cpus_allowed);
adding = true; adding = true;
} else if (cmd == partcmd_disable) { } else if (cmd == partcmd_disable) {
deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed, /*
* Need to remove cpus from parent's subparts_cpus for valid
* partition root.
*/
deleting = !is_prs_invalid(old_prs) &&
cpumask_and(tmp->delmask, cs->cpus_allowed,
parent->subparts_cpus);
} else if (cmd == partcmd_invalidate) {
if (is_prs_invalid(old_prs))
return 0;
/*
* Make the current partition invalid. It is assumed that
* invalidation is caused by violating cpu exclusivity rule.
*/
deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
parent->subparts_cpus); parent->subparts_cpus);
if (old_prs > 0) {
new_prs = -old_prs;
part_error = PERR_NOTEXCL;
}
} else if (newmask) { } else if (newmask) {
/* /*
* partcmd_update with newmask: * partcmd_update with newmask:
* *
* Compute add/delete mask to/from subparts_cpus
*
* delmask = cpus_allowed & ~newmask & parent->subparts_cpus * delmask = cpus_allowed & ~newmask & parent->subparts_cpus
* addmask = newmask & parent->effective_cpus * addmask = newmask & parent->cpus_allowed
* & ~parent->subparts_cpus * & ~parent->subparts_cpus
*/ */
cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask); cpumask_andnot(tmp->delmask, cs->cpus_allowed, newmask);
deleting = cpumask_and(tmp->delmask, tmp->delmask, deleting = cpumask_and(tmp->delmask, tmp->delmask,
parent->subparts_cpus); parent->subparts_cpus);
cpumask_and(tmp->addmask, newmask, parent->effective_cpus); cpumask_and(tmp->addmask, newmask, parent->cpus_allowed);
adding = cpumask_andnot(tmp->addmask, tmp->addmask, adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->subparts_cpus); parent->subparts_cpus);
/* /*
* Return error if the new effective_cpus could become empty. * Make partition invalid if parent's effective_cpus could
* become empty and there are tasks in the parent.
*/ */
if (adding && if (adding &&
cpumask_equal(parent->effective_cpus, tmp->addmask)) { cpumask_subset(parent->effective_cpus, tmp->addmask) &&
if (!deleting) !cpumask_intersects(tmp->delmask, cpu_active_mask) &&
return -EINVAL; partition_is_populated(parent, cs)) {
/* part_error = PERR_NOCPUS;
* As some of the CPUs in subparts_cpus might have adding = false;
* been offlined, we need to compute the real delmask deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
* to confirm that. parent->subparts_cpus);
*/
if (!cpumask_and(tmp->addmask, tmp->delmask,
cpu_active_mask))
return -EINVAL;
cpumask_copy(tmp->addmask, parent->effective_cpus);
} }
} else { } else {
/* /*
* partcmd_update w/o newmask: * partcmd_update w/o newmask:
* *
* addmask = cpus_allowed & parent->effective_cpus * delmask = cpus_allowed & parent->subparts_cpus
* addmask = cpus_allowed & parent->cpus_allowed
* & ~parent->subparts_cpus
* *
* Note that parent's subparts_cpus may have been * This gets invoked either due to a hotplug event or from
* pre-shrunk in case there is a change in the cpu list. * update_cpumasks_hier(). This can cause the state of a
* So no deletion is needed. * partition root to transition from valid to invalid or vice
* versa. So we still need to compute the addmask and delmask.
* A partition error happens when:
* 1) Cpuset is valid partition, but parent does not distribute
* out any CPUs.
* 2) Parent has tasks and all its effective CPUs will have
* to be distributed out.
*/ */
adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed, cpumask_and(tmp->addmask, cs->cpus_allowed,
parent->effective_cpus); parent->cpus_allowed);
part_error = cpumask_equal(tmp->addmask, adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->effective_cpus); parent->subparts_cpus);
if ((is_partition_valid(cs) && !parent->nr_subparts_cpus) ||
(adding &&
cpumask_subset(parent->effective_cpus, tmp->addmask) &&
partition_is_populated(parent, cs))) {
part_error = PERR_NOCPUS;
adding = false;
}
if (part_error && is_partition_valid(cs) &&
parent->nr_subparts_cpus)
deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
parent->subparts_cpus);
} }
if (part_error)
WRITE_ONCE(cs->prs_err, part_error);
if (cmd == partcmd_update) { if (cmd == partcmd_update) {
int prev_prs = cpuset->partition_root_state;
/* /*
* Check for possible transition between PRS_ENABLED * Check for possible transition between valid and invalid
* and PRS_ERROR. * partition root.
*/ */
switch (cpuset->partition_root_state) { switch (cs->partition_root_state) {
case PRS_ENABLED: case PRS_ROOT:
case PRS_ISOLATED:
if (part_error) if (part_error)
new_prs = PRS_ERROR; new_prs = -old_prs;
break; break;
case PRS_ERROR: case PRS_INVALID_ROOT:
case PRS_INVALID_ISOLATED:
if (!part_error) if (!part_error)
new_prs = PRS_ENABLED; new_prs = -old_prs;
break; break;
} }
/*
* Set part_error if previously in invalid state.
*/
part_error = (prev_prs == PRS_ERROR);
}
if (!part_error && (new_prs == PRS_ERROR))
return 0; /* Nothing need to be done */
if (new_prs == PRS_ERROR) {
/*
* Remove all its cpus from parent's subparts_cpus.
*/
adding = false;
deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
parent->subparts_cpus);
} }
if (!adding && !deleting && (new_prs == old_prs)) if (!adding && !deleting && (new_prs == old_prs))
return 0; return 0;
/*
* Transitioning between invalid to valid or vice versa may require
* changing CS_CPU_EXCLUSIVE and CS_SCHED_LOAD_BALANCE.
*/
if (old_prs != new_prs) {
if (is_prs_invalid(old_prs) && !is_cpu_exclusive(cs) &&
(update_flag(CS_CPU_EXCLUSIVE, cs, 1) < 0))
return PERR_NOTEXCL;
if (is_prs_invalid(new_prs) && is_cpu_exclusive(cs))
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
}
/* /*
* Change the parent's subparts_cpus. * Change the parent's subparts_cpus.
* Newly added CPUs will be removed from effective_cpus and * Newly added CPUs will be removed from effective_cpus and
...@@ -1369,18 +1500,32 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, ...@@ -1369,18 +1500,32 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus); parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);
if (old_prs != new_prs) if (old_prs != new_prs)
cpuset->partition_root_state = new_prs; cs->partition_root_state = new_prs;
spin_unlock_irq(&callback_lock); spin_unlock_irq(&callback_lock);
notify_partition_change(cpuset, old_prs, new_prs);
return cmd == partcmd_update; if (adding || deleting)
update_tasks_cpumask(parent);
/*
* Set or clear CS_SCHED_LOAD_BALANCE when partcmd_update, if necessary.
* rebuild_sched_domains_locked() may be called.
*/
if (old_prs != new_prs) {
if (old_prs == PRS_ISOLATED)
update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
else if (new_prs == PRS_ISOLATED)
update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
}
notify_partition_change(cs, old_prs);
return 0;
} }
/* /*
* update_cpumasks_hier - Update effective cpumasks and tasks in the subtree * update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
* @cs: the cpuset to consider * @cs: the cpuset to consider
* @tmp: temp variables for calculating effective_cpus & partition setup * @tmp: temp variables for calculating effective_cpus & partition setup
* @force: don't skip any descendant cpusets if set
* *
* When configured cpumask is changed, the effective cpumasks of this cpuset * When configured cpumask is changed, the effective cpumasks of this cpuset
* and all its descendants need to be updated. * and all its descendants need to be updated.
...@@ -1389,7 +1534,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd, ...@@ -1389,7 +1534,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
* *
* Called with cpuset_rwsem held * Called with cpuset_rwsem held
*/ */
static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
bool force)
{ {
struct cpuset *cp; struct cpuset *cp;
struct cgroup_subsys_state *pos_css; struct cgroup_subsys_state *pos_css;
...@@ -1399,14 +1545,21 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -1399,14 +1545,21 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
rcu_read_lock(); rcu_read_lock();
cpuset_for_each_descendant_pre(cp, pos_css, cs) { cpuset_for_each_descendant_pre(cp, pos_css, cs) {
struct cpuset *parent = parent_cs(cp); struct cpuset *parent = parent_cs(cp);
bool update_parent = false;
compute_effective_cpumask(tmp->new_cpus, cp, parent); compute_effective_cpumask(tmp->new_cpus, cp, parent);
/* /*
* If it becomes empty, inherit the effective mask of the * If it becomes empty, inherit the effective mask of the
* parent, which is guaranteed to have some CPUs. * parent, which is guaranteed to have some CPUs unless
* it is a partition root that has explicitly distributed
* out all its CPUs.
*/ */
if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) { if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
if (is_partition_valid(cp) &&
cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
goto update_parent_subparts;
cpumask_copy(tmp->new_cpus, parent->effective_cpus); cpumask_copy(tmp->new_cpus, parent->effective_cpus);
if (!cp->use_parent_ecpus) { if (!cp->use_parent_ecpus) {
cp->use_parent_ecpus = true; cp->use_parent_ecpus = true;
...@@ -1420,14 +1573,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -1420,14 +1573,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
/* /*
* Skip the whole subtree if the cpumask remains the same * Skip the whole subtree if the cpumask remains the same
* and has no partition root state. * and has no partition root state and force flag not set.
*/ */
if (!cp->partition_root_state && if (!cp->partition_root_state && !force &&
cpumask_equal(tmp->new_cpus, cp->effective_cpus)) { cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
pos_css = css_rightmost_descendant(pos_css); pos_css = css_rightmost_descendant(pos_css);
continue; continue;
} }
update_parent_subparts:
/* /*
* update_parent_subparts_cpumask() should have been called * update_parent_subparts_cpumask() should have been called
* for cs already in update_cpumask(). We should also call * for cs already in update_cpumask(). We should also call
...@@ -1437,36 +1591,22 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -1437,36 +1591,22 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
old_prs = new_prs = cp->partition_root_state; old_prs = new_prs = cp->partition_root_state;
if ((cp != cs) && old_prs) { if ((cp != cs) && old_prs) {
switch (parent->partition_root_state) { switch (parent->partition_root_state) {
case PRS_DISABLED: case PRS_ROOT:
/* case PRS_ISOLATED:
* If parent is not a partition root or an update_parent = true;
* invalid partition root, clear its state
* and its CS_CPU_EXCLUSIVE flag.
*/
WARN_ON_ONCE(cp->partition_root_state
!= PRS_ERROR);
new_prs = PRS_DISABLED;
/*
* clear_bit() is an atomic operation and
* readers aren't interested in the state
* of CS_CPU_EXCLUSIVE anyway. So we can
* just update the flag without holding
* the callback_lock.
*/
clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
break; break;
case PRS_ENABLED: default:
if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
update_tasks_cpumask(parent);
break;
case PRS_ERROR:
/* /*
* When parent is invalid, it has to be too. * When parent is not a partition root or is
* invalid, child partition roots become
* invalid too.
*/ */
new_prs = PRS_ERROR; if (is_partition_valid(cp))
new_prs = -cp->partition_root_state;
WRITE_ONCE(cp->prs_err,
is_partition_invalid(parent)
? PERR_INVPARENT : PERR_NOTPART);
break; break;
} }
} }
...@@ -1475,42 +1615,44 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -1475,42 +1615,44 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
continue; continue;
rcu_read_unlock(); rcu_read_unlock();
if (update_parent) {
update_parent_subparts_cpumask(cp, partcmd_update, NULL,
tmp);
/*
* The cpuset partition_root_state may become
* invalid. Capture it.
*/
new_prs = cp->partition_root_state;
}
spin_lock_irq(&callback_lock); spin_lock_irq(&callback_lock);
cpumask_copy(cp->effective_cpus, tmp->new_cpus); if (cp->nr_subparts_cpus && !is_partition_valid(cp)) {
if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) { /*
* Put all active subparts_cpus back to effective_cpus.
*/
cpumask_or(tmp->new_cpus, tmp->new_cpus,
cp->subparts_cpus);
cpumask_and(tmp->new_cpus, tmp->new_cpus,
cpu_active_mask);
cp->nr_subparts_cpus = 0; cp->nr_subparts_cpus = 0;
cpumask_clear(cp->subparts_cpus); cpumask_clear(cp->subparts_cpus);
} else if (cp->nr_subparts_cpus) { }
cpumask_copy(cp->effective_cpus, tmp->new_cpus);
if (cp->nr_subparts_cpus) {
/* /*
* Make sure that effective_cpus & subparts_cpus * Make sure that effective_cpus & subparts_cpus
* are mutually exclusive. * are mutually exclusive.
*
* In the unlikely event that effective_cpus
* becomes empty. we clear cp->nr_subparts_cpus and
* let its child partition roots to compete for
* CPUs again.
*/ */
cpumask_andnot(cp->effective_cpus, cp->effective_cpus, cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
cp->subparts_cpus); cp->subparts_cpus);
if (cpumask_empty(cp->effective_cpus)) {
cpumask_copy(cp->effective_cpus, tmp->new_cpus);
cpumask_clear(cp->subparts_cpus);
cp->nr_subparts_cpus = 0;
} else if (!cpumask_subset(cp->subparts_cpus,
tmp->new_cpus)) {
cpumask_andnot(cp->subparts_cpus,
cp->subparts_cpus, tmp->new_cpus);
cp->nr_subparts_cpus
= cpumask_weight(cp->subparts_cpus);
}
} }
if (new_prs != old_prs) cp->partition_root_state = new_prs;
cp->partition_root_state = new_prs;
spin_unlock_irq(&callback_lock); spin_unlock_irq(&callback_lock);
notify_partition_change(cp, old_prs, new_prs);
notify_partition_change(cp, old_prs);
WARN_ON(!is_in_v2_mode() && WARN_ON(!is_in_v2_mode() &&
!cpumask_equal(cp->cpus_allowed, cp->effective_cpus)); !cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
...@@ -1526,7 +1668,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -1526,7 +1668,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
if (!cpumask_empty(cp->cpus_allowed) && if (!cpumask_empty(cp->cpus_allowed) &&
is_sched_load_balance(cp) && is_sched_load_balance(cp) &&
(!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) || (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
is_partition_root(cp))) is_partition_valid(cp)))
need_rebuild_sched_domains = true; need_rebuild_sched_domains = true;
rcu_read_lock(); rcu_read_lock();
...@@ -1570,7 +1712,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs, ...@@ -1570,7 +1712,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
continue; continue;
rcu_read_unlock(); rcu_read_unlock();
update_cpumasks_hier(sibling, tmp); update_cpumasks_hier(sibling, tmp, false);
rcu_read_lock(); rcu_read_lock();
css_put(&sibling->css); css_put(&sibling->css);
} }
...@@ -1588,6 +1730,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, ...@@ -1588,6 +1730,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
{ {
int retval; int retval;
struct tmpmasks tmp; struct tmpmasks tmp;
bool invalidate = false;
/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */ /* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
if (cs == &top_cpuset) if (cs == &top_cpuset)
...@@ -1615,10 +1758,6 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, ...@@ -1615,10 +1758,6 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed)) if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed))
return 0; return 0;
retval = validate_change(cs, trialcs);
if (retval < 0)
return retval;
#ifdef CONFIG_CPUMASK_OFFSTACK #ifdef CONFIG_CPUMASK_OFFSTACK
/* /*
* Use the cpumasks in trialcs for tmpmasks when they are pointers * Use the cpumasks in trialcs for tmpmasks when they are pointers
...@@ -1629,28 +1768,70 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs, ...@@ -1629,28 +1768,70 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
tmp.new_cpus = trialcs->cpus_allowed; tmp.new_cpus = trialcs->cpus_allowed;
#endif #endif
retval = validate_change(cs, trialcs);
if ((retval == -EINVAL) && cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
struct cpuset *cp, *parent;
struct cgroup_subsys_state *css;
/*
* The -EINVAL error code indicates that partition sibling
* CPU exclusivity rule has been violated. We still allow
* the cpumask change to proceed while invalidating the
* partition. However, any conflicting sibling partitions
* have to be marked as invalid too.
*/
invalidate = true;
rcu_read_lock();
parent = parent_cs(cs);
cpuset_for_each_child(cp, css, parent)
if (is_partition_valid(cp) &&
cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
rcu_read_unlock();
update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
rcu_read_lock();
}
rcu_read_unlock();
retval = 0;
}
if (retval < 0)
return retval;
if (cs->partition_root_state) { if (cs->partition_root_state) {
/* Cpumask of a partition root cannot be empty */ if (invalidate)
if (cpumask_empty(trialcs->cpus_allowed)) update_parent_subparts_cpumask(cs, partcmd_invalidate,
return -EINVAL; NULL, &tmp);
if (update_parent_subparts_cpumask(cs, partcmd_update, else
trialcs->cpus_allowed, &tmp) < 0) update_parent_subparts_cpumask(cs, partcmd_update,
return -EINVAL; trialcs->cpus_allowed, &tmp);
} }
compute_effective_cpumask(trialcs->effective_cpus, trialcs,
parent_cs(cs));
spin_lock_irq(&callback_lock); spin_lock_irq(&callback_lock);
cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed); cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
/* /*
* Make sure that subparts_cpus is a subset of cpus_allowed. * Make sure that subparts_cpus, if not empty, is a subset of
* cpus_allowed. Clear subparts_cpus if partition not valid or
* empty effective cpus with tasks.
*/ */
if (cs->nr_subparts_cpus) { if (cs->nr_subparts_cpus) {
cpumask_and(cs->subparts_cpus, cs->subparts_cpus, cs->cpus_allowed); if (!is_partition_valid(cs) ||
cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus); (cpumask_subset(trialcs->effective_cpus, cs->subparts_cpus) &&
partition_is_populated(cs, NULL))) {
cs->nr_subparts_cpus = 0;
cpumask_clear(cs->subparts_cpus);
} else {
cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
cs->cpus_allowed);
cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
}
} }
spin_unlock_irq(&callback_lock); spin_unlock_irq(&callback_lock);
update_cpumasks_hier(cs, &tmp); /* effective_cpus will be updated here */
update_cpumasks_hier(cs, &tmp, false);
if (cs->partition_root_state) { if (cs->partition_root_state) {
struct cpuset *parent = parent_cs(cs); struct cpuset *parent = parent_cs(cs);
...@@ -2026,16 +2207,18 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, ...@@ -2026,16 +2207,18 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
return err; return err;
} }
/* /**
* update_prstate - update partition_root_state * update_prstate - update partition_root_state
* cs: the cpuset to update * @cs: the cpuset to update
* new_prs: new partition root state * @new_prs: new partition root state
* Return: 0 if successful, != 0 if error
* *
* Call with cpuset_rwsem held. * Call with cpuset_rwsem held.
*/ */
static int update_prstate(struct cpuset *cs, int new_prs) static int update_prstate(struct cpuset *cs, int new_prs)
{ {
int err, old_prs = cs->partition_root_state; int err = PERR_NONE, old_prs = cs->partition_root_state;
bool sched_domain_rebuilt = false;
struct cpuset *parent = parent_cs(cs); struct cpuset *parent = parent_cs(cs);
struct tmpmasks tmpmask; struct tmpmasks tmpmask;
...@@ -2043,28 +2226,33 @@ static int update_prstate(struct cpuset *cs, int new_prs) ...@@ -2043,28 +2226,33 @@ static int update_prstate(struct cpuset *cs, int new_prs)
return 0; return 0;
/* /*
* Cannot force a partial or invalid partition root to a full * For a previously invalid partition root, leave it at being
* partition root. * invalid if new_prs is not "member".
*/ */
if (new_prs && (old_prs == PRS_ERROR)) if (new_prs && is_prs_invalid(old_prs)) {
return -EINVAL; cs->partition_root_state = -new_prs;
return 0;
}
if (alloc_cpumasks(NULL, &tmpmask)) if (alloc_cpumasks(NULL, &tmpmask))
return -ENOMEM; return -ENOMEM;
err = -EINVAL;
if (!old_prs) { if (!old_prs) {
/* /*
* Turning on partition root requires setting the * Turning on partition root requires setting the
* CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed * CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed
* cannot be NULL. * cannot be empty.
*/ */
if (cpumask_empty(cs->cpus_allowed)) if (cpumask_empty(cs->cpus_allowed)) {
err = PERR_CPUSEMPTY;
goto out; goto out;
}
err = update_flag(CS_CPU_EXCLUSIVE, cs, 1); err = update_flag(CS_CPU_EXCLUSIVE, cs, 1);
if (err) if (err) {
err = PERR_NOTEXCL;
goto out; goto out;
}
err = update_parent_subparts_cpumask(cs, partcmd_enable, err = update_parent_subparts_cpumask(cs, partcmd_enable,
NULL, &tmpmask); NULL, &tmpmask);
...@@ -2072,47 +2260,77 @@ static int update_prstate(struct cpuset *cs, int new_prs) ...@@ -2072,47 +2260,77 @@ static int update_prstate(struct cpuset *cs, int new_prs)
update_flag(CS_CPU_EXCLUSIVE, cs, 0); update_flag(CS_CPU_EXCLUSIVE, cs, 0);
goto out; goto out;
} }
if (new_prs == PRS_ISOLATED) {
/*
* Disable the load balance flag should not return an
* error unless the system is running out of memory.
*/
update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
sched_domain_rebuilt = true;
}
} else if (old_prs && new_prs) {
/*
* A change in load balance state only, no change in cpumasks.
*/
update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
sched_domain_rebuilt = true;
goto out; /* Sched domain is rebuilt in update_flag() */
} else { } else {
/* /*
* Turning off partition root will clear the * Switching back to member is always allowed even if it
* CS_CPU_EXCLUSIVE bit. * disables child partitions.
*/ */
if (old_prs == PRS_ERROR) { update_parent_subparts_cpumask(cs, partcmd_disable, NULL,
update_flag(CS_CPU_EXCLUSIVE, cs, 0); &tmpmask);
err = 0;
goto out;
}
err = update_parent_subparts_cpumask(cs, partcmd_disable, /*
NULL, &tmpmask); * If there are child partitions, they will all become invalid.
if (err) */
goto out; if (unlikely(cs->nr_subparts_cpus)) {
spin_lock_irq(&callback_lock);
cs->nr_subparts_cpus = 0;
cpumask_clear(cs->subparts_cpus);
compute_effective_cpumask(cs->effective_cpus, cs, parent);
spin_unlock_irq(&callback_lock);
}
/* Turning off CS_CPU_EXCLUSIVE will not return error */ /* Turning off CS_CPU_EXCLUSIVE will not return error */
update_flag(CS_CPU_EXCLUSIVE, cs, 0); update_flag(CS_CPU_EXCLUSIVE, cs, 0);
if (!is_sched_load_balance(cs)) {
/* Make sure load balance is on */
update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
sched_domain_rebuilt = true;
}
} }
/* update_tasks_cpumask(parent);
* Update cpumask of parent's tasks except when it is the top
* cpuset as some system daemons cannot be mapped to other CPUs.
*/
if (parent != &top_cpuset)
update_tasks_cpumask(parent);
if (parent->child_ecpus_count) if (parent->child_ecpus_count)
update_sibling_cpumasks(parent, cs, &tmpmask); update_sibling_cpumasks(parent, cs, &tmpmask);
rebuild_sched_domains_locked(); if (!sched_domain_rebuilt)
rebuild_sched_domains_locked();
out: out:
if (!err) { /*
spin_lock_irq(&callback_lock); * Make partition invalid if an error happen
cs->partition_root_state = new_prs; */
spin_unlock_irq(&callback_lock); if (err)
notify_partition_change(cs, old_prs, new_prs); new_prs = -new_prs;
} spin_lock_irq(&callback_lock);
cs->partition_root_state = new_prs;
spin_unlock_irq(&callback_lock);
/*
* Update child cpusets, if present.
* Force update if switching back to member.
*/
if (!list_empty(&cs->css.children))
update_cpumasks_hier(cs, &tmpmask, !new_prs);
notify_partition_change(cs, old_prs);
free_cpumasks(NULL, &tmpmask); free_cpumasks(NULL, &tmpmask);
return err; return 0;
} }
/* /*
...@@ -2238,6 +2456,12 @@ static int cpuset_can_attach(struct cgroup_taskset *tset) ...@@ -2238,6 +2456,12 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
(cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))) (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
goto out_unlock; goto out_unlock;
/*
* Task cannot be moved to a cpuset with empty effective cpus.
*/
if (cpumask_empty(cs->effective_cpus))
goto out_unlock;
cgroup_taskset_for_each(task, css, tset) { cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task, cs->effective_cpus); ret = task_can_attach(task, cs->effective_cpus);
if (ret) if (ret)
...@@ -2598,16 +2822,29 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft) ...@@ -2598,16 +2822,29 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
static int sched_partition_show(struct seq_file *seq, void *v) static int sched_partition_show(struct seq_file *seq, void *v)
{ {
struct cpuset *cs = css_cs(seq_css(seq)); struct cpuset *cs = css_cs(seq_css(seq));
const char *err, *type = NULL;
switch (cs->partition_root_state) { switch (cs->partition_root_state) {
case PRS_ENABLED: case PRS_ROOT:
seq_puts(seq, "root\n"); seq_puts(seq, "root\n");
break; break;
case PRS_DISABLED: case PRS_ISOLATED:
seq_puts(seq, "isolated\n");
break;
case PRS_MEMBER:
seq_puts(seq, "member\n"); seq_puts(seq, "member\n");
break; break;
case PRS_ERROR: case PRS_INVALID_ROOT:
seq_puts(seq, "root invalid\n"); type = "root";
fallthrough;
case PRS_INVALID_ISOLATED:
if (!type)
type = "isolated";
err = perr_strings[READ_ONCE(cs->prs_err)];
if (err)
seq_printf(seq, "%s invalid (%s)\n", type, err);
else
seq_printf(seq, "%s invalid\n", type);
break; break;
} }
return 0; return 0;
...@@ -2626,9 +2863,11 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, ...@@ -2626,9 +2863,11 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
* Convert "root" to ENABLED, and convert "member" to DISABLED. * Convert "root" to ENABLED, and convert "member" to DISABLED.
*/ */
if (!strcmp(buf, "root")) if (!strcmp(buf, "root"))
val = PRS_ENABLED; val = PRS_ROOT;
else if (!strcmp(buf, "member")) else if (!strcmp(buf, "member"))
val = PRS_DISABLED; val = PRS_MEMBER;
else if (!strcmp(buf, "isolated"))
val = PRS_ISOLATED;
else else
return -EINVAL; return -EINVAL;
...@@ -2927,7 +3166,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css) ...@@ -2927,7 +3166,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
cpus_read_lock(); cpus_read_lock();
percpu_down_write(&cpuset_rwsem); percpu_down_write(&cpuset_rwsem);
if (is_partition_root(cs)) if (is_partition_valid(cs))
update_prstate(cs, 0); update_prstate(cs, 0);
if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) && if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
...@@ -3103,7 +3342,8 @@ hotplug_update_tasks(struct cpuset *cs, ...@@ -3103,7 +3342,8 @@ hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems, struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated) bool cpus_updated, bool mems_updated)
{ {
if (cpumask_empty(new_cpus)) /* A partition root is allowed to have empty effective cpus */
if (cpumask_empty(new_cpus) && !is_partition_valid(cs))
cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus); cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
if (nodes_empty(*new_mems)) if (nodes_empty(*new_mems))
*new_mems = parent_cs(cs)->effective_mems; *new_mems = parent_cs(cs)->effective_mems;
...@@ -3172,11 +3412,31 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -3172,11 +3412,31 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
/* /*
* In the unlikely event that a partition root has empty * In the unlikely event that a partition root has empty
* effective_cpus or its parent becomes erroneous, we have to * effective_cpus with tasks, we will have to invalidate child
* transition it to the erroneous state. * partitions, if present, by setting nr_subparts_cpus to 0 to
* reclaim their cpus.
*/ */
if (is_partition_root(cs) && (cpumask_empty(&new_cpus) || if (cs->nr_subparts_cpus && is_partition_valid(cs) &&
(parent->partition_root_state == PRS_ERROR))) { cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) {
spin_lock_irq(&callback_lock);
cs->nr_subparts_cpus = 0;
cpumask_clear(cs->subparts_cpus);
spin_unlock_irq(&callback_lock);
compute_effective_cpumask(&new_cpus, cs, parent);
}
/*
* Force the partition to become invalid if either one of
* the following conditions hold:
* 1) empty effective cpus but not valid empty partition.
* 2) parent is invalid or doesn't grant any cpus to child
* partitions.
*/
if (is_partition_valid(cs) && (!parent->nr_subparts_cpus ||
(cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)))) {
int old_prs, parent_prs;
update_parent_subparts_cpumask(cs, partcmd_disable, NULL, tmp);
if (cs->nr_subparts_cpus) { if (cs->nr_subparts_cpus) {
spin_lock_irq(&callback_lock); spin_lock_irq(&callback_lock);
cs->nr_subparts_cpus = 0; cs->nr_subparts_cpus = 0;
...@@ -3185,39 +3445,32 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -3185,39 +3445,32 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
compute_effective_cpumask(&new_cpus, cs, parent); compute_effective_cpumask(&new_cpus, cs, parent);
} }
/* old_prs = cs->partition_root_state;
* If the effective_cpus is empty because the child parent_prs = parent->partition_root_state;
* partitions take away all the CPUs, we can keep if (is_partition_valid(cs)) {
* the current partition and let the child partitions spin_lock_irq(&callback_lock);
* fight for available CPUs. make_partition_invalid(cs);
*/ spin_unlock_irq(&callback_lock);
if ((parent->partition_root_state == PRS_ERROR) || if (is_prs_invalid(parent_prs))
cpumask_empty(&new_cpus)) { WRITE_ONCE(cs->prs_err, PERR_INVPARENT);
int old_prs; else if (!parent_prs)
WRITE_ONCE(cs->prs_err, PERR_NOTPART);
update_parent_subparts_cpumask(cs, partcmd_disable, else
NULL, tmp); WRITE_ONCE(cs->prs_err, PERR_HOTPLUG);
old_prs = cs->partition_root_state; notify_partition_change(cs, old_prs);
if (old_prs != PRS_ERROR) {
spin_lock_irq(&callback_lock);
cs->partition_root_state = PRS_ERROR;
spin_unlock_irq(&callback_lock);
notify_partition_change(cs, old_prs, PRS_ERROR);
}
} }
cpuset_force_rebuild(); cpuset_force_rebuild();
} }
/* /*
* On the other hand, an erroneous partition root may be transitioned * On the other hand, an invalid partition root may be transitioned
* back to a regular one or a partition root with no CPU allocated * back to a regular one.
* from the parent may change to erroneous.
*/ */
if (is_partition_root(parent) && else if (is_partition_valid(parent) && is_partition_invalid(cs)) {
((cs->partition_root_state == PRS_ERROR) || update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp);
!cpumask_intersects(&new_cpus, parent->subparts_cpus)) && if (is_partition_valid(cs))
update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp)) cpuset_force_rebuild();
cpuset_force_rebuild(); }
update_tasks: update_tasks:
cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus); cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
......
...@@ -47,6 +47,7 @@ struct pids_cgroup { ...@@ -47,6 +47,7 @@ struct pids_cgroup {
*/ */
atomic64_t counter; atomic64_t counter;
atomic64_t limit; atomic64_t limit;
int64_t watermark;
/* Handle for "pids.events" */ /* Handle for "pids.events" */
struct cgroup_file events_file; struct cgroup_file events_file;
...@@ -85,6 +86,16 @@ static void pids_css_free(struct cgroup_subsys_state *css) ...@@ -85,6 +86,16 @@ static void pids_css_free(struct cgroup_subsys_state *css)
kfree(css_pids(css)); kfree(css_pids(css));
} }
static void pids_update_watermark(struct pids_cgroup *p, int64_t nr_pids)
{
/*
* This is racy, but we don't need perfectly accurate tallying of
* the watermark, and this lets us avoid extra atomic overhead.
*/
if (nr_pids > READ_ONCE(p->watermark))
WRITE_ONCE(p->watermark, nr_pids);
}
/** /**
* pids_cancel - uncharge the local pid count * pids_cancel - uncharge the local pid count
* @pids: the pid cgroup state * @pids: the pid cgroup state
...@@ -128,8 +139,11 @@ static void pids_charge(struct pids_cgroup *pids, int num) ...@@ -128,8 +139,11 @@ static void pids_charge(struct pids_cgroup *pids, int num)
{ {
struct pids_cgroup *p; struct pids_cgroup *p;
for (p = pids; parent_pids(p); p = parent_pids(p)) for (p = pids; parent_pids(p); p = parent_pids(p)) {
atomic64_add(num, &p->counter); int64_t new = atomic64_add_return(num, &p->counter);
pids_update_watermark(p, new);
}
} }
/** /**
...@@ -156,6 +170,12 @@ static int pids_try_charge(struct pids_cgroup *pids, int num) ...@@ -156,6 +170,12 @@ static int pids_try_charge(struct pids_cgroup *pids, int num)
*/ */
if (new > limit) if (new > limit)
goto revert; goto revert;
/*
* Not technically accurate if we go over limit somewhere up
* the hierarchy, but that's tolerable for the watermark.
*/
pids_update_watermark(p, new);
} }
return 0; return 0;
...@@ -311,6 +331,14 @@ static s64 pids_current_read(struct cgroup_subsys_state *css, ...@@ -311,6 +331,14 @@ static s64 pids_current_read(struct cgroup_subsys_state *css,
return atomic64_read(&pids->counter); return atomic64_read(&pids->counter);
} }
static s64 pids_peak_read(struct cgroup_subsys_state *css,
struct cftype *cft)
{
struct pids_cgroup *pids = css_pids(css);
return READ_ONCE(pids->watermark);
}
static int pids_events_show(struct seq_file *sf, void *v) static int pids_events_show(struct seq_file *sf, void *v)
{ {
struct pids_cgroup *pids = css_pids(seq_css(sf)); struct pids_cgroup *pids = css_pids(seq_css(sf));
...@@ -331,6 +359,11 @@ static struct cftype pids_files[] = { ...@@ -331,6 +359,11 @@ static struct cftype pids_files[] = {
.read_s64 = pids_current_read, .read_s64 = pids_current_read,
.flags = CFTYPE_NOT_ON_ROOT, .flags = CFTYPE_NOT_ON_ROOT,
}, },
{
.name = "peak",
.flags = CFTYPE_NOT_ON_ROOT,
.read_s64 = pids_peak_read,
},
{ {
.name = "events", .name = "events",
.seq_show = pids_events_show, .seq_show = pids_events_show,
......
...@@ -5104,8 +5104,8 @@ struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino) ...@@ -5104,8 +5104,8 @@ struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
struct mem_cgroup *memcg; struct mem_cgroup *memcg;
cgrp = cgroup_get_from_id(ino); cgrp = cgroup_get_from_id(ino);
if (!cgrp) if (IS_ERR(cgrp))
return ERR_PTR(-ENOENT); return ERR_CAST(cgrp);
css = cgroup_get_e_css(cgrp, &memory_cgrp_subsys); css = cgroup_get_e_css(cgrp, &memory_cgrp_subsys);
if (css) if (css)
......
...@@ -40,16 +40,17 @@ static noinline bool ...@@ -40,16 +40,17 @@ static noinline bool
nft_sock_get_eval_cgroupv2(u32 *dest, struct sock *sk, const struct nft_pktinfo *pkt, u32 level) nft_sock_get_eval_cgroupv2(u32 *dest, struct sock *sk, const struct nft_pktinfo *pkt, u32 level)
{ {
struct cgroup *cgrp; struct cgroup *cgrp;
u64 cgid;
if (!sk_fullsock(sk)) if (!sk_fullsock(sk))
return false; return false;
cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); cgrp = cgroup_ancestor(sock_cgroup_ptr(&sk->sk_cgrp_data), level);
if (level > cgrp->level) if (!cgrp)
return false; return false;
memcpy(dest, &cgrp->ancestor_ids[level], sizeof(u64)); cgid = cgroup_id(cgrp);
memcpy(dest, &cgid, sizeof(u64));
return true; return true;
} }
#endif #endif
......
...@@ -61,6 +61,11 @@ autop_names = { ...@@ -61,6 +61,11 @@ autop_names = {
} }
class BlkgIterator: class BlkgIterator:
def __init__(self, root_blkcg, q_id, include_dying=False):
self.include_dying = include_dying
self.blkgs = []
self.walk(root_blkcg, q_id, '')
def blkcg_name(blkcg): def blkcg_name(blkcg):
return blkcg.css.cgroup.kn.name.string_().decode('utf-8') return blkcg.css.cgroup.kn.name.string_().decode('utf-8')
...@@ -82,11 +87,6 @@ class BlkgIterator: ...@@ -82,11 +87,6 @@ class BlkgIterator:
blkcg.css.children.address_of_(), 'css.sibling'): blkcg.css.children.address_of_(), 'css.sibling'):
self.walk(c, q_id, path) self.walk(c, q_id, path)
def __init__(self, root_blkcg, q_id, include_dying=False):
self.include_dying = include_dying
self.blkgs = []
self.walk(root_blkcg, q_id, '')
def __iter__(self): def __iter__(self):
return iter(self.blkgs) return iter(self.blkgs)
......
...@@ -77,7 +77,7 @@ static inline int get_cgroup_v1_idx(__u32 *cgrps, int size) ...@@ -77,7 +77,7 @@ static inline int get_cgroup_v1_idx(__u32 *cgrps, int size)
break; break;
// convert cgroup-id to a map index // convert cgroup-id to a map index
cgrp_id = BPF_CORE_READ(cgrp, ancestor_ids[i]); cgrp_id = BPF_CORE_READ(cgrp, ancestors[i], kn, id);
elem = bpf_map_lookup_elem(&cgrp_idx, &cgrp_id); elem = bpf_map_lookup_elem(&cgrp_idx, &cgrp_id);
if (!elem) if (!elem)
continue; continue;
......
...@@ -5,3 +5,4 @@ test_freezer ...@@ -5,3 +5,4 @@ test_freezer
test_kmem test_kmem
test_kill test_kill
test_cpu test_cpu
wait_inotify
# SPDX-License-Identifier: GPL-2.0 # SPDX-License-Identifier: GPL-2.0
CFLAGS += -Wall -pthread CFLAGS += -Wall -pthread
all: all: ${HELPER_PROGS}
TEST_FILES := with_stress.sh TEST_FILES := with_stress.sh
TEST_PROGS := test_stress.sh TEST_PROGS := test_stress.sh test_cpuset_prs.sh
TEST_GEN_FILES := wait_inotify
TEST_GEN_PROGS = test_memcontrol TEST_GEN_PROGS = test_memcontrol
TEST_GEN_PROGS += test_kmem TEST_GEN_PROGS += test_kmem
TEST_GEN_PROGS += test_core TEST_GEN_PROGS += test_core
......
#!/bin/bash
# SPDX-License-Identifier: GPL-2.0
#
# Test for cpuset v2 partition root state (PRS)
#
# The sched verbose flag is set, if available, so that the console log
# can be examined for the correct setting of scheduling domain.
#
skip_test() {
echo "$1"
echo "Test SKIPPED"
exit 0
}
[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
# Set sched verbose flag, if available
[[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose
# Get wait_inotify location
WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
# Find cgroup v2 mount point
CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//")
[[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!"
# Set verbose flag and delay factor
PROG=$1
VERBOSE=
DELAY_FACTOR=1
while [[ "$1" = -* ]]
do
case "$1" in
-v) VERBOSE=1
break
;;
-d) DELAY_FACTOR=$2
shift
break
;;
*) echo "Usage: $PROG [-v] [-d <delay-factor>"
exit
;;
esac
shift
done
cd $CGROUP2
echo +cpuset > cgroup.subtree_control
[[ -d test ]] || mkdir test
cd test
# Pause in ms
pause()
{
DELAY=$1
LOOP=0
while [[ $LOOP -lt $DELAY_FACTOR ]]
do
sleep $DELAY
((LOOP++))
done
return 0
}
console_msg()
{
MSG=$1
echo "$MSG"
echo "" > /dev/console
echo "$MSG" > /dev/console
pause 0.01
}
test_partition()
{
EXPECTED_VAL=$1
echo $EXPECTED_VAL > cpuset.cpus.partition
[[ $? -eq 0 ]] || exit 1
ACTUAL_VAL=$(cat cpuset.cpus.partition)
[[ $ACTUAL_VAL != $EXPECTED_VAL ]] && {
echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL"
echo "Test FAILED"
exit 1
}
}
test_effective_cpus()
{
EXPECTED_VAL=$1
ACTUAL_VAL=$(cat cpuset.cpus.effective)
[[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && {
echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'"
echo "Test FAILED"
exit 1
}
}
# Adding current process to cgroup.procs as a test
test_add_proc()
{
OUTSTR="$1"
ERRMSG=$((echo $$ > cgroup.procs) |& cat)
echo $ERRMSG | grep -q "$OUTSTR"
[[ $? -ne 0 ]] && {
echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'"
echo "Test FAILED"
exit 1
}
echo $$ > $CGROUP2/cgroup.procs # Move out the task
}
#
# Testing the new "isolated" partition root type
#
test_isolated()
{
echo 2-3 > cpuset.cpus
TYPE=$(cat cpuset.cpus.partition)
[[ $TYPE = member ]] || echo member > cpuset.cpus.partition
console_msg "Change from member to root"
test_partition root
console_msg "Change from root to isolated"
test_partition isolated
console_msg "Change from isolated to member"
test_partition member
console_msg "Change from member to isolated"
test_partition isolated
console_msg "Change from isolated to root"
test_partition root
console_msg "Change from root to member"
test_partition member
#
# Testing partition root with no cpu
#
console_msg "Distribute all cpus to child partition"
echo +cpuset > cgroup.subtree_control
test_partition root
mkdir A1
cd A1
echo 2-3 > cpuset.cpus
test_partition root
test_effective_cpus 2-3
cd ..
test_effective_cpus ""
console_msg "Moving task to partition test"
test_add_proc "No space left"
cd A1
test_add_proc ""
cd ..
console_msg "Shrink and expand child partition"
cd A1
echo 2 > cpuset.cpus
cd ..
test_effective_cpus 3
cd A1
echo 2-3 > cpuset.cpus
cd ..
test_effective_cpus ""
# Cleaning up
console_msg "Cleaning up"
echo $$ > $CGROUP2/cgroup.procs
[[ -d A1 ]] && rmdir A1
}
#
# Cpuset controller state transition test matrix.
#
# Cgroup test hierarchy
#
# test -- A1 -- A2 -- A3
# \- B1
#
# P<v> = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid)
# C<l> = add cpu-list
# S<p> = use prefix in subtree_control
# T = put a task into cgroup
# O<c>-<v> = Write <v> to CPU online file of <c>
#
SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
TEST_MATRIX=(
# test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
# ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
" S+ C0-1 . . C2-3 S+ C4-5 . . 0 A2:0-1"
" S+ C0-1 . . C2-3 P1 . . . 0 "
" S+ C0-1 . . C2-3 P1:S+ C0-1:P1 . . 0 "
" S+ C0-1 . . C2-3 P1:S+ C1:P1 . . 0 "
" S+ C0-1:S+ . . C2-3 . . . P1 0 "
" S+ C0-1:P1 . . C2-3 S+ C1 . . 0 "
" S+ C0-1:P1 . . C2-3 S+ C1:P1 . . 0 "
" S+ C0-1:P1 . . C2-3 S+ C1:P1 . P1 0 "
" S+ C0-1:P1 . . C2-3 C4-5 . . . 0 A1:4-5"
" S+ C0-1:P1 . . C2-3 S+:C4-5 . . . 0 A1:4-5"
" S+ C0-1 . . C2-3:P1 . . . C2 0 "
" S+ C0-1 . . C2-3:P1 . . . C4-5 0 B1:4-5"
" S+ C0-3:P1:S+ C2-3:P1 . . . . . . 0 A1:0-1,A2:2-3"
" S+ C0-3:P1:S+ C2-3:P1 . . C1-3 . . . 0 A1:1,A2:2-3"
" S+ C2-3:P1:S+ C3:P1 . . C3 . . . 0 A1:,A2:3 A1:P1,A2:P1"
" S+ C2-3:P1:S+ C3:P1 . . C3 P0 . . 0 A1:3,A2:3 A1:P1,A2:P0"
" S+ C2-3:P1:S+ C2:P1 . . C2-4 . . . 0 A1:3-4,A2:2"
" S+ C2-3:P1:S+ C3:P1 . . C3 . . C0-2 0 A1:,B1:0-2 A1:P1,A2:P1"
" S+ $SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
# CPU offlining cases:
" S+ C0-1 . . C2-3 S+ C4-5 . O2-0 0 A1:0-1,B1:3"
" S+ C0-3:P1:S+ C2-3:P1 . . O2-0 . . . 0 A1:0-1,A2:3"
" S+ C0-3:P1:S+ C2-3:P1 . . O2-0 O2-1 . . 0 A1:0-1,A2:2-3"
" S+ C0-3:P1:S+ C2-3:P1 . . O1-0 . . . 0 A1:0,A2:2-3"
" S+ C0-3:P1:S+ C2-3:P1 . . O1-0 O1-1 . . 0 A1:0-1,A2:2-3"
" S+ C2-3:P1:S+ C3:P1 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
" S+ C2-3:P1:S+ C3:P2 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P2"
" S+ C2-3:P1:S+ C3:P1 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
" S+ C2-3:P1:S+ C3:P2 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P2"
" S+ C2-3:P1:S+ C3:P1 . . O2-0 . . . 0 A1:,A2:3 A1:P1,A2:P1"
" S+ C2-3:P1:S+ C3:P1 . . O3-0 . . . 0 A1:2,A2: A1:P1,A2:P1"
" S+ C2-3:P1:S+ C3:P1 . . T:O2-0 . . . 0 A1:3,A2:3 A1:P1,A2:P-1"
" S+ C2-3:P1:S+ C3:P1 . . . T:O3-0 . . 0 A1:2,A2:2 A1:P1,A2:P-1"
" S+ $SETUP_A123_PARTITIONS . O1-0 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
" S+ $SETUP_A123_PARTITIONS . O2-0 . . . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
" S+ $SETUP_A123_PARTITIONS . O3-0 . . . 0 A1:1,A2:2,A3: A1:P1,A2:P1,A3:P1"
" S+ $SETUP_A123_PARTITIONS . T:O1-0 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
" S+ $SETUP_A123_PARTITIONS . . T:O2-0 . . 0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1"
" S+ $SETUP_A123_PARTITIONS . . . T:O3-0 . 0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1"
" S+ $SETUP_A123_PARTITIONS . T:O1-0 O1-1 . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
" S+ $SETUP_A123_PARTITIONS . . T:O2-0 O2-1 . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
" S+ $SETUP_A123_PARTITIONS . . . T:O3-0 O3-1 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
" S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O1-1 . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
" S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O2-1 . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
# test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
# ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
#
# Incorrect change to cpuset.cpus invalidates partition root
#
# Adding CPUs to partition root that are not in parent's
# cpuset.cpus is allowed, but those extra CPUs are ignored.
" S+ C2-3:P1:S+ C3:P1 . . . C2-4 . . 0 A1:,A2:2-3 A1:P1,A2:P1"
# Taking away all CPUs from parent or itself if there are tasks
# will make the partition invalid.
" S+ C2-3:P1:S+ C3:P1 . . T C2-3 . . 0 A1:2-3,A2:2-3 A1:P1,A2:P-1"
" S+ $SETUP_A123_PARTITIONS . T:C2-3 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
" S+ $SETUP_A123_PARTITIONS . T:C2-3:C1-3 . . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
# Changing a partition root to member makes child partitions invalid
" S+ C2-3:P1:S+ C3:P1 . . P0 . . . 0 A1:2-3,A2:3 A1:P0,A2:P-1"
" S+ $SETUP_A123_PARTITIONS . C2-3 P0 . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P0,A3:P-1"
# cpuset.cpus can contains cpus not in parent's cpuset.cpus as long
# as they overlap.
" S+ C2-3:P1:S+ . . . . C3-4:P1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
# Deletion of CPUs distributed to child cgroup is allowed.
" S+ C0-1:P1:S+ C1 . C2-3 C4-5 . . . 0 A1:4-5,A2:4-5"
# To become a valid partition root, cpuset.cpus must overlap parent's
# cpuset.cpus.
" S+ C0-1:P1 . . C2-3 S+ C4-5:P1 . . 0 A1:0-1,A2:0-1 A1:P1,A2:P-1"
# Enabling partition with child cpusets is allowed
" S+ C0-1:S+ C1 . C2-3 P1 . . . 0 A1:0-1,A2:1 A1:P1"
# A partition root with non-partition root parent is invalid, but it
# can be made valid if its parent becomes a partition root too.
" S+ C0-1:S+ C1 . C2-3 . P2 . . 0 A1:0-1,A2:1 A1:P0,A2:P-2"
" S+ C0-1:S+ C1:P2 . C2-3 P1 . . . 0 A1:0,A2:1 A1:P1,A2:P2"
# A non-exclusive cpuset.cpus change will invalidate partition and its siblings
" S+ C0-1:P1 . . C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P-1,B1:P0"
" S+ C0-1:P1 . . P1:C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P-1,B1:P-1"
" S+ C0-1 . . P1:C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P0,B1:P-1"
# test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
# ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
# Failure cases:
# A task cannot be added to a partition with no cpu
" S+ C2-3:P1:S+ C3:P1 . . O2-0:T . . . 1 A1:,A2:3 A1:P1,A2:P1"
)
#
# Write to the cpu online file
# $1 - <c>-<v> where <c> = cpu number, <v> value to be written
#
write_cpu_online()
{
CPU=${1%-*}
VAL=${1#*-}
CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online
if [[ $VAL -eq 0 ]]
then
OFFLINE_CPUS="$OFFLINE_CPUS $CPU"
else
[[ -n "$OFFLINE_CPUS" ]] && {
OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\
sort | uniq -u)
}
fi
echo $VAL > $CPUFILE
pause 0.01
}
#
# Set controller state
# $1 - cgroup directory
# $2 - state
# $3 - showerr
#
# The presence of ":" in state means transition from one to the next.
#
set_ctrl_state()
{
TMPMSG=/tmp/.msg_$$
CGRP=$1
STATE=$2
SHOWERR=${3}${VERBOSE}
CTRL=${CTRL:=$CONTROLLER}
HASERR=0
REDIRECT="2> $TMPMSG"
[[ -z "$STATE" || "$STATE" = '.' ]] && return 0
rm -f $TMPMSG
for CMD in $(echo $STATE | sed -e "s/:/ /g")
do
TFILE=$CGRP/cgroup.procs
SFILE=$CGRP/cgroup.subtree_control
PFILE=$CGRP/cpuset.cpus.partition
CFILE=$CGRP/cpuset.cpus
S=$(expr substr $CMD 1 1)
if [[ $S = S ]]
then
PREFIX=${CMD#?}
COMM="echo ${PREFIX}${CTRL} > $SFILE"
eval $COMM $REDIRECT
elif [[ $S = C ]]
then
CPUS=${CMD#?}
COMM="echo $CPUS > $CFILE"
eval $COMM $REDIRECT
elif [[ $S = P ]]
then
VAL=${CMD#?}
case $VAL in
0) VAL=member
;;
1) VAL=root
;;
2) VAL=isolated
;;
*)
echo "Invalid partition state - $VAL"
exit 1
;;
esac
COMM="echo $VAL > $PFILE"
eval $COMM $REDIRECT
elif [[ $S = O ]]
then
VAL=${CMD#?}
write_cpu_online $VAL
elif [[ $S = T ]]
then
COMM="echo 0 > $TFILE"
eval $COMM $REDIRECT
fi
RET=$?
[[ $RET -ne 0 ]] && {
[[ -n "$SHOWERR" ]] && {
echo "$COMM"
cat $TMPMSG
}
HASERR=1
}
pause 0.01
rm -f $TMPMSG
done
return $HASERR
}
set_ctrl_state_noerr()
{
CGRP=$1
STATE=$2
[[ -d $CGRP ]] || mkdir $CGRP
set_ctrl_state $CGRP $STATE 1
[[ $? -ne 0 ]] && {
echo "ERROR: Failed to set $2 to cgroup $1!"
exit 1
}
}
online_cpus()
{
[[ -n "OFFLINE_CPUS" ]] && {
for C in $OFFLINE_CPUS
do
write_cpu_online ${C}-1
done
}
}
#
# Return 1 if the list of effective cpus isn't the same as the initial list.
#
reset_cgroup_states()
{
echo 0 > $CGROUP2/cgroup.procs
online_cpus
rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1
set_ctrl_state . S-
pause 0.01
}
dump_states()
{
for DIR in A1 A1/A2 A1/A2/A3 B1
do
ECPUS=$DIR/cpuset.cpus.effective
PRS=$DIR/cpuset.cpus.partition
[[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)"
[[ -e $PRS ]] && echo "$PRS: $(cat $PRS)"
done
}
#
# Check effective cpus
# $1 - check string, format: <cgroup>:<cpu-list>[,<cgroup>:<cpu-list>]*
#
check_effective_cpus()
{
CHK_STR=$1
for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
do
set -- $(echo $CHK | sed -e "s/:/ /g")
CGRP=$1
CPUS=$2
[[ $CGRP = A2 ]] && CGRP=A1/A2
[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
FILE=$CGRP/cpuset.cpus.effective
[[ -e $FILE ]] || return 1
[[ $CPUS = $(cat $FILE) ]] || return 1
done
}
#
# Check cgroup states
# $1 - check string, format: <cgroup>:<state>[,<cgroup>:<state>]*
#
check_cgroup_states()
{
CHK_STR=$1
for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
do
set -- $(echo $CHK | sed -e "s/:/ /g")
CGRP=$1
STATE=$2
FILE=
EVAL=$(expr substr $STATE 2 2)
[[ $CGRP = A2 ]] && CGRP=A1/A2
[[ $CGRP = A3 ]] && CGRP=A1/A2/A3
case $STATE in
P*) FILE=$CGRP/cpuset.cpus.partition
;;
*) echo "Unknown state: $STATE!"
exit 1
;;
esac
VAL=$(cat $FILE)
case "$VAL" in
member) VAL=0
;;
root) VAL=1
;;
isolated)
VAL=2
;;
"root invalid"*)
VAL=-1
;;
"isolated invalid"*)
VAL=-2
;;
esac
[[ $EVAL != $VAL ]] && return 1
done
return 0
}
#
# Run cpuset state transition test
# $1 - test matrix name
#
# This test is somewhat fragile as delays (sleep x) are added in various
# places to make sure state changes are fully propagated before the next
# action. These delays may need to be adjusted if running in a slower machine.
#
run_state_test()
{
TEST=$1
CONTROLLER=cpuset
CPULIST=0-6
I=0
eval CNT="\${#$TEST[@]}"
reset_cgroup_states
echo $CPULIST > cpuset.cpus
echo root > cpuset.cpus.partition
console_msg "Running state transition test ..."
while [[ $I -lt $CNT ]]
do
echo "Running test $I ..." > /dev/console
eval set -- "\${$TEST[$I]}"
ROOT=$1
OLD_A1=$2
OLD_A2=$3
OLD_A3=$4
OLD_B1=$5
NEW_A1=$6
NEW_A2=$7
NEW_A3=$8
NEW_B1=$9
RESULT=${10}
ECPUS=${11}
STATES=${12}
set_ctrl_state_noerr . $ROOT
set_ctrl_state_noerr A1 $OLD_A1
set_ctrl_state_noerr A1/A2 $OLD_A2
set_ctrl_state_noerr A1/A2/A3 $OLD_A3
set_ctrl_state_noerr B1 $OLD_B1
RETVAL=0
set_ctrl_state A1 $NEW_A1; ((RETVAL += $?))
set_ctrl_state A1/A2 $NEW_A2; ((RETVAL += $?))
set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?))
set_ctrl_state B1 $NEW_B1; ((RETVAL += $?))
[[ $RETVAL -ne $RESULT ]] && {
echo "Test $TEST[$I] failed result check!"
eval echo \"\${$TEST[$I]}\"
dump_states
online_cpus
exit 1
}
[[ -n "$ECPUS" && "$ECPUS" != . ]] && {
check_effective_cpus $ECPUS
[[ $? -ne 0 ]] && {
echo "Test $TEST[$I] failed effective CPU check!"
eval echo \"\${$TEST[$I]}\"
echo
dump_states
online_cpus
exit 1
}
}
[[ -n "$STATES" ]] && {
check_cgroup_states $STATES
[[ $? -ne 0 ]] && {
echo "FAILED: Test $TEST[$I] failed states check!"
eval echo \"\${$TEST[$I]}\"
echo
dump_states
online_cpus
exit 1
}
}
reset_cgroup_states
#
# Check to see if effective cpu list changes
#
pause 0.05
NEWLIST=$(cat cpuset.cpus.effective)
[[ $NEWLIST != $CPULIST ]] && {
echo "Effective cpus changed to $NEWLIST after test $I!"
exit 1
}
[[ -n "$VERBOSE" ]] && echo "Test $I done."
((I++))
done
echo "All $I tests of $TEST PASSED."
echo member > cpuset.cpus.partition
}
#
# Wait for inotify event for the given file and read it
# $1: cgroup file to wait for
# $2: file to store the read result
#
wait_inotify()
{
CGROUP_FILE=$1
OUTPUT_FILE=$2
$WAIT_INOTIFY $CGROUP_FILE
cat $CGROUP_FILE > $OUTPUT_FILE
}
#
# Test if inotify events are properly generated when going into and out of
# invalid partition state.
#
test_inotify()
{
ERR=0
PRS=/tmp/.prs_$$
[[ -f $WAIT_INOTIFY ]] || {
echo "wait_inotify not found, inotify test SKIPPED."
return
}
pause 0.01
echo 1 > cpuset.cpus
echo 0 > cgroup.procs
echo root > cpuset.cpus.partition
pause 0.01
rm -f $PRS
wait_inotify $PWD/cpuset.cpus.partition $PRS &
pause 0.01
set_ctrl_state . "O1-0"
pause 0.01
check_cgroup_states ".:P-1"
if [[ $? -ne 0 ]]
then
echo "FAILED: Inotify test - partition not invalid"
ERR=1
elif [[ ! -f $PRS ]]
then
echo "FAILED: Inotify test - event not generated"
ERR=1
kill %1
elif [[ $(cat $PRS) != "root invalid"* ]]
then
echo "FAILED: Inotify test - incorrect state"
cat $PRS
ERR=1
fi
online_cpus
echo member > cpuset.cpus.partition
echo 0 > ../cgroup.procs
if [[ $ERR -ne 0 ]]
then
exit 1
else
echo "Inotify test PASSED"
fi
}
run_state_test TEST_MATRIX
test_isolated
test_inotify
echo "All tests PASSED."
cd ..
rmdir test
// SPDX-License-Identifier: GPL-2.0
/*
* Wait until an inotify event on the given cgroup file.
*/
#include <linux/limits.h>
#include <sys/inotify.h>
#include <sys/mman.h>
#include <sys/ptrace.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <errno.h>
#include <fcntl.h>
#include <poll.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
static const char usage[] = "Usage: %s [-v] <cgroup_file>\n";
static char *file;
static int verbose;
static inline void fail_message(char *msg)
{
fprintf(stderr, msg, file);
exit(1);
}
int main(int argc, char *argv[])
{
char *cmd = argv[0];
int c, fd;
struct pollfd fds = { .events = POLLIN, };
while ((c = getopt(argc, argv, "v")) != -1) {
switch (c) {
case 'v':
verbose++;
break;
}
argv++, argc--;
}
if (argc != 2) {
fprintf(stderr, usage, cmd);
return -1;
}
file = argv[1];
fd = open(file, O_RDONLY);
if (fd < 0)
fail_message("Cgroup file %s not found!\n");
close(fd);
fd = inotify_init();
if (fd < 0)
fail_message("inotify_init() fails on %s!\n");
if (inotify_add_watch(fd, file, IN_MODIFY) < 0)
fail_message("inotify_add_watch() fails on %s!\n");
fds.fd = fd;
/*
* poll waiting loop
*/
for (;;) {
int ret = poll(&fds, 1, 10000);
if (ret < 0) {
if (errno == EINTR)
continue;
perror("poll");
exit(1);
}
if ((ret > 0) && (fds.revents & POLLIN))
break;
}
if (verbose) {
struct inotify_event events[10];
long len;
usleep(1000);
len = read(fd, events, sizeof(events));
printf("Number of events read = %ld\n",
len/sizeof(struct inotify_event));
}
close(fd);
return 0;
}
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment