Commit adf4bfc4 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - cpuset now support isolated cpus.partition type, which will enable
   dynamic CPU isolation

 - pids.peak added to remember the max number of pids used

 - holes in cgroup namespace plugged

 - internal cleanups

* tag 'cgroup-for-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (25 commits)
  cgroup: use strscpy() is more robust and safer
  iocost_monitor: reorder BlkgIterator
  cgroup: simplify code in cgroup_apply_control
  cgroup: Make cgroup_get_from_id() prettier
  cgroup/cpuset: remove unreachable code
  cgroup: Remove CFTYPE_PRESSURE
  cgroup: Improve cftype add/rm error handling
  kselftest/cgroup: Add cpuset v2 partition root state test
  cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst
  cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule
  cgroup/cpuset: Relocate a code block in validate_change()
  cgroup/cpuset: Show invalid partition reason string
  cgroup/cpuset: Add a new isolated cpus.partition type
  cgroup/cpuset: Relax constraints to partition & cpus changes
  cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective
  cgroup/cpuset: Miscellaneous cleanups & add helper functions
  cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset
  cgroup: add pids.peak interface for pids controller
  cgroup: Remove data-race around cgrp_dfl_visible
  cgroup: Fix build failure when CONFIG_SHRINKER_DEBUG
  ...
parents 8adc0486 8619e94d
......@@ -2190,75 +2190,93 @@ Cpuset Interface Files
It accepts only the following input values when written to.
======== ================================
"root" a partition root
"member" a non-root member of a partition
======== ================================
When set to be a partition root, the current cgroup is the
root of a new partition or scheduling domain that comprises
itself and all its descendants except those that are separate
partition roots themselves and their descendants. The root
cgroup is always a partition root.
There are constraints on where a partition root can be set.
It can only be set in a cgroup if all the following conditions
are true.
1) The "cpuset.cpus" is not empty and the list of CPUs are
exclusive, i.e. they are not shared by any of its siblings.
2) The parent cgroup is a partition root.
3) The "cpuset.cpus" is also a proper subset of the parent's
"cpuset.cpus.effective".
4) There is no child cgroups with cpuset enabled. This is for
eliminating corner cases that have to be handled if such a
condition is allowed.
Setting it to partition root will take the CPUs away from the
effective CPUs of the parent cgroup. Once it is set, this
file cannot be reverted back to "member" if there are any child
cgroups with cpuset enabled.
A parent partition cannot distribute all its CPUs to its
child partitions. There must be at least one cpu left in the
parent partition.
Once becoming a partition root, changes to "cpuset.cpus" is
generally allowed as long as the first condition above is true,
the change will not take away all the CPUs from the parent
partition and the new "cpuset.cpus" value is a superset of its
children's "cpuset.cpus" values.
Sometimes, external factors like changes to ancestors'
"cpuset.cpus" or cpu hotplug can cause the state of the partition
root to change. On read, the "cpuset.sched.partition" file
can show the following values.
============== ==============================
========== =====================================
"member" Non-root member of a partition
"root" Partition root
"root invalid" Invalid partition root
============== ==============================
It is a partition root if the first 2 partition root conditions
above are true and at least one CPU from "cpuset.cpus" is
granted by the parent cgroup.
A partition root can become invalid if none of CPUs requested
in "cpuset.cpus" can be granted by the parent cgroup or the
parent cgroup is no longer a partition root itself. In this
case, it is not a real partition even though the restriction
of the first partition root condition above will still apply.
The cpu affinity of all the tasks in the cgroup will then be
associated with CPUs in the nearest ancestor partition.
An invalid partition root can be transitioned back to a
real partition root if at least one of the requested CPUs
can now be granted by its parent. In this case, the cpu
affinity of all the tasks in the formerly invalid partition
will be associated to the CPUs of the newly formed partition.
Changing the partition state of an invalid partition root to
"member" is always allowed even if child cpusets are present.
"isolated" Partition root without load balancing
========== =====================================
The root cgroup is always a partition root and its state
cannot be changed. All other non-root cgroups start out as
"member".
When set to "root", the current cgroup is the root of a new
partition or scheduling domain that comprises itself and all
its descendants except those that are separate partition roots
themselves and their descendants.
When set to "isolated", the CPUs in that partition root will
be in an isolated state without any load balancing from the
scheduler. Tasks placed in such a partition with multiple
CPUs should be carefully distributed and bound to each of the
individual CPUs for optimal performance.
The value shown in "cpuset.cpus.effective" of a partition root
is the CPUs that the partition root can dedicate to a potential
new child partition root. The new child subtracts available
CPUs from its parent "cpuset.cpus.effective".
A partition root ("root" or "isolated") can be in one of the
two possible states - valid or invalid. An invalid partition
root is in a degraded state where some state information may
be retained, but behaves more like a "member".
All possible state transitions among "member", "root" and
"isolated" are allowed.
On read, the "cpuset.cpus.partition" file can show the following
values.
============================= =====================================
"member" Non-root member of a partition
"root" Partition root
"isolated" Partition root without load balancing
"root invalid (<reason>)" Invalid partition root
"isolated invalid (<reason>)" Invalid isolated partition root
============================= =====================================
In the case of an invalid partition root, a descriptive string on
why the partition is invalid is included within parentheses.
For a partition root to become valid, the following conditions
must be met.
1) The "cpuset.cpus" is exclusive with its siblings , i.e. they
are not shared by any of its siblings (exclusivity rule).
2) The parent cgroup is a valid partition root.
3) The "cpuset.cpus" is not empty and must contain at least
one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
4) The "cpuset.cpus.effective" cannot be empty unless there is
no task associated with this partition.
External events like hotplug or changes to "cpuset.cpus" can
cause a valid partition root to become invalid and vice versa.
Note that a task cannot be moved to a cgroup with empty
"cpuset.cpus.effective".
For a valid partition root with the sibling cpu exclusivity
rule enabled, changes made to "cpuset.cpus" that violate the
exclusivity rule will invalidate the partition as well as its
sibiling partitions with conflicting cpuset.cpus values. So
care must be taking in changing "cpuset.cpus".
A valid non-root parent partition may distribute out all its CPUs
to its child partitions when there is no task associated with it.
Care must be taken to change a valid partition root to
"member" as all its child partitions, if present, will become
invalid causing disruption to tasks running in those child
partitions. These inactivated partitions could be recovered if
their parent is switched back to a partition root with a proper
set of "cpuset.cpus".
Poll and inotify events are triggered whenever the state of
"cpuset.cpus.partition" changes. That includes changes caused
by write to "cpuset.cpus.partition", cpu hotplug or other
changes that modify the validity status of the partition.
This will allow user space agents to monitor unexpected changes
to "cpuset.cpus.partition" without the need to do continuous
polling.
Device controller
......
......@@ -19,8 +19,8 @@ int blkcg_set_fc_appid(char *app_id, u64 cgrp_id, size_t app_id_len)
return -EINVAL;
cgrp = cgroup_get_from_id(cgrp_id);
if (!cgrp)
return -ENOENT;
if (IS_ERR(cgrp))
return PTR_ERR(cgrp);
css = cgroup_get_e_css(cgrp, &io_cgrp_subsys);
if (!css) {
ret = -ENOENT;
......
......@@ -126,11 +126,11 @@ enum {
CFTYPE_NO_PREFIX = (1 << 3), /* (DON'T USE FOR NEW FILES) no subsys prefix */
CFTYPE_WORLD_WRITABLE = (1 << 4), /* (DON'T USE FOR NEW FILES) S_IWUGO */
CFTYPE_DEBUG = (1 << 5), /* create when cgroup_debug */
CFTYPE_PRESSURE = (1 << 6), /* only if pressure feature is enabled */
/* internal flags, do not use outside cgroup core proper */
__CFTYPE_ONLY_ON_DFL = (1 << 16), /* only on default hierarchy */
__CFTYPE_NOT_ON_DFL = (1 << 17), /* not on default hierarchy */
__CFTYPE_ADDED = (1 << 18),
};
/*
......@@ -384,7 +384,7 @@ struct cgroup {
/*
* The depth this cgroup is at. The root is at depth zero and each
* step down the hierarchy increments the level. This along with
* ancestor_ids[] can determine whether a given cgroup is a
* ancestors[] can determine whether a given cgroup is a
* descendant of another without traversing the hierarchy.
*/
int level;
......@@ -504,8 +504,8 @@ struct cgroup {
/* Used to store internal freezer state */
struct cgroup_freezer_state freezer;
/* ids of the ancestors at each level including self */
u64 ancestor_ids[];
/* All ancestors including self */
struct cgroup *ancestors[];
};
/*
......@@ -522,11 +522,15 @@ struct cgroup_root {
/* Unique id for this hierarchy. */
int hierarchy_id;
/* The root cgroup. Root is destroyed on its release. */
/*
* The root cgroup. The containing cgroup_root will be destroyed on its
* release. cgrp->ancestors[0] will be used overflowing into the
* following field. cgrp_ancestor_storage must immediately follow.
*/
struct cgroup cgrp;
/* for cgrp->ancestor_ids[0] */
u64 cgrp_ancestor_id_storage;
/* must follow cgrp for cgrp->ancestors[0], see above */
struct cgroup *cgrp_ancestor_storage;
/* Number of cgroups in the hierarchy, used only for /proc/cgroups */
atomic_t nr_cgrps;
......
......@@ -575,7 +575,7 @@ static inline bool cgroup_is_descendant(struct cgroup *cgrp,
{
if (cgrp->root != ancestor->root || cgrp->level < ancestor->level)
return false;
return cgrp->ancestor_ids[ancestor->level] == cgroup_id(ancestor);
return cgrp->ancestors[ancestor->level] == ancestor;
}
/**
......@@ -592,11 +592,9 @@ static inline bool cgroup_is_descendant(struct cgroup *cgrp,
static inline struct cgroup *cgroup_ancestor(struct cgroup *cgrp,
int ancestor_level)
{
if (cgrp->level < ancestor_level)
if (ancestor_level < 0 || ancestor_level > cgrp->level)
return NULL;
while (cgrp && cgrp->level > ancestor_level)
cgrp = cgroup_parent(cgrp);
return cgrp;
return cgrp->ancestors[ancestor_level];
}
/**
......@@ -748,11 +746,6 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
{}
static inline struct cgroup *cgroup_get_from_id(u64 id)
{
return NULL;
}
#endif /* !CONFIG_CGROUPS */
#ifdef CONFIG_CGROUPS
......
......@@ -250,6 +250,8 @@ int cgroup_migrate(struct task_struct *leader, bool threadgroup,
int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
bool threadgroup);
void cgroup_attach_lock(bool lock_threadgroup);
void cgroup_attach_unlock(bool lock_threadgroup);
struct task_struct *cgroup_procs_write_start(char *buf, bool threadgroup,
bool *locked)
__acquires(&cgroup_threadgroup_rwsem);
......
......@@ -59,8 +59,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
int retval = 0;
mutex_lock(&cgroup_mutex);
cpus_read_lock();
percpu_down_write(&cgroup_threadgroup_rwsem);
cgroup_attach_lock(true);
for_each_root(root) {
struct cgroup *from_cgrp;
......@@ -72,8 +71,7 @@ int cgroup_attach_task_all(struct task_struct *from, struct task_struct *tsk)
if (retval)
break;
}
percpu_up_write(&cgroup_threadgroup_rwsem);
cpus_read_unlock();
cgroup_attach_unlock(true);
mutex_unlock(&cgroup_mutex);
return retval;
......
This diff is collapsed.
This diff is collapsed.
......@@ -47,6 +47,7 @@ struct pids_cgroup {
*/
atomic64_t counter;
atomic64_t limit;
int64_t watermark;
/* Handle for "pids.events" */
struct cgroup_file events_file;
......@@ -85,6 +86,16 @@ static void pids_css_free(struct cgroup_subsys_state *css)
kfree(css_pids(css));
}
static void pids_update_watermark(struct pids_cgroup *p, int64_t nr_pids)
{
/*
* This is racy, but we don't need perfectly accurate tallying of
* the watermark, and this lets us avoid extra atomic overhead.
*/
if (nr_pids > READ_ONCE(p->watermark))
WRITE_ONCE(p->watermark, nr_pids);
}
/**
* pids_cancel - uncharge the local pid count
* @pids: the pid cgroup state
......@@ -128,8 +139,11 @@ static void pids_charge(struct pids_cgroup *pids, int num)
{
struct pids_cgroup *p;
for (p = pids; parent_pids(p); p = parent_pids(p))
atomic64_add(num, &p->counter);
for (p = pids; parent_pids(p); p = parent_pids(p)) {
int64_t new = atomic64_add_return(num, &p->counter);
pids_update_watermark(p, new);
}
}
/**
......@@ -156,6 +170,12 @@ static int pids_try_charge(struct pids_cgroup *pids, int num)
*/
if (new > limit)
goto revert;
/*
* Not technically accurate if we go over limit somewhere up
* the hierarchy, but that's tolerable for the watermark.
*/
pids_update_watermark(p, new);
}
return 0;
......@@ -311,6 +331,14 @@ static s64 pids_current_read(struct cgroup_subsys_state *css,
return atomic64_read(&pids->counter);
}
static s64 pids_peak_read(struct cgroup_subsys_state *css,
struct cftype *cft)
{
struct pids_cgroup *pids = css_pids(css);
return READ_ONCE(pids->watermark);
}
static int pids_events_show(struct seq_file *sf, void *v)
{
struct pids_cgroup *pids = css_pids(seq_css(sf));
......@@ -331,6 +359,11 @@ static struct cftype pids_files[] = {
.read_s64 = pids_current_read,
.flags = CFTYPE_NOT_ON_ROOT,
},
{
.name = "peak",
.flags = CFTYPE_NOT_ON_ROOT,
.read_s64 = pids_peak_read,
},
{
.name = "events",
.seq_show = pids_events_show,
......
......@@ -5104,8 +5104,8 @@ struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
struct mem_cgroup *memcg;
cgrp = cgroup_get_from_id(ino);
if (!cgrp)
return ERR_PTR(-ENOENT);
if (IS_ERR(cgrp))
return ERR_CAST(cgrp);
css = cgroup_get_e_css(cgrp, &memory_cgrp_subsys);
if (css)
......
......@@ -40,16 +40,17 @@ static noinline bool
nft_sock_get_eval_cgroupv2(u32 *dest, struct sock *sk, const struct nft_pktinfo *pkt, u32 level)
{
struct cgroup *cgrp;
u64 cgid;
if (!sk_fullsock(sk))
return false;
cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
if (level > cgrp->level)
cgrp = cgroup_ancestor(sock_cgroup_ptr(&sk->sk_cgrp_data), level);
if (!cgrp)
return false;
memcpy(dest, &cgrp->ancestor_ids[level], sizeof(u64));
cgid = cgroup_id(cgrp);
memcpy(dest, &cgid, sizeof(u64));
return true;
}
#endif
......
......@@ -61,6 +61,11 @@ autop_names = {
}
class BlkgIterator:
def __init__(self, root_blkcg, q_id, include_dying=False):
self.include_dying = include_dying
self.blkgs = []
self.walk(root_blkcg, q_id, '')
def blkcg_name(blkcg):
return blkcg.css.cgroup.kn.name.string_().decode('utf-8')
......@@ -82,11 +87,6 @@ class BlkgIterator:
blkcg.css.children.address_of_(), 'css.sibling'):
self.walk(c, q_id, path)
def __init__(self, root_blkcg, q_id, include_dying=False):
self.include_dying = include_dying
self.blkgs = []
self.walk(root_blkcg, q_id, '')
def __iter__(self):
return iter(self.blkgs)
......
......@@ -77,7 +77,7 @@ static inline int get_cgroup_v1_idx(__u32 *cgrps, int size)
break;
// convert cgroup-id to a map index
cgrp_id = BPF_CORE_READ(cgrp, ancestor_ids[i]);
cgrp_id = BPF_CORE_READ(cgrp, ancestors[i], kn, id);
elem = bpf_map_lookup_elem(&cgrp_idx, &cgrp_id);
if (!elem)
continue;
......
......@@ -5,3 +5,4 @@ test_freezer
test_kmem
test_kill
test_cpu
wait_inotify
# SPDX-License-Identifier: GPL-2.0
CFLAGS += -Wall -pthread
all:
all: ${HELPER_PROGS}
TEST_FILES := with_stress.sh
TEST_PROGS := test_stress.sh
TEST_PROGS := test_stress.sh test_cpuset_prs.sh
TEST_GEN_FILES := wait_inotify
TEST_GEN_PROGS = test_memcontrol
TEST_GEN_PROGS += test_kmem
TEST_GEN_PROGS += test_core
......
This diff is collapsed.
// SPDX-License-Identifier: GPL-2.0
/*
* Wait until an inotify event on the given cgroup file.
*/
#include <linux/limits.h>
#include <sys/inotify.h>
#include <sys/mman.h>
#include <sys/ptrace.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <errno.h>
#include <fcntl.h>
#include <poll.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
static const char usage[] = "Usage: %s [-v] <cgroup_file>\n";
static char *file;
static int verbose;
static inline void fail_message(char *msg)
{
fprintf(stderr, msg, file);
exit(1);
}
int main(int argc, char *argv[])
{
char *cmd = argv[0];
int c, fd;
struct pollfd fds = { .events = POLLIN, };
while ((c = getopt(argc, argv, "v")) != -1) {
switch (c) {
case 'v':
verbose++;
break;
}
argv++, argc--;
}
if (argc != 2) {
fprintf(stderr, usage, cmd);
return -1;
}
file = argv[1];
fd = open(file, O_RDONLY);
if (fd < 0)
fail_message("Cgroup file %s not found!\n");
close(fd);
fd = inotify_init();
if (fd < 0)
fail_message("inotify_init() fails on %s!\n");
if (inotify_add_watch(fd, file, IN_MODIFY) < 0)
fail_message("inotify_add_watch() fails on %s!\n");
fds.fd = fd;
/*
* poll waiting loop
*/
for (;;) {
int ret = poll(&fds, 1, 10000);
if (ret < 0) {
if (errno == EINTR)
continue;
perror("poll");
exit(1);
}
if ((ret > 0) && (fds.revents & POLLIN))
break;
}
if (verbose) {
struct inotify_event events[10];
long len;
usleep(1000);
len = read(fd, events, sizeof(events));
printf("Number of events read = %ld\n",
len/sizeof(struct inotify_event));
}
close(fd);
return 0;
}
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment