Commit de6fef50 authored by Linus Torvalds's avatar Linus Torvalds

Merge tag 'cgroup-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - The locking around cpuset hotplug processing has always been a bit of
   mess which was worked around by making hotplug processing
   asynchronous. The asynchronity isn't great and led to other issues.

   We tried to make the behavior synchronous a while ago but that led to
   lockdep splats. Waiman took another stab at cleaning up and making it
   synchronous. The patch has been in -next for well over a month and
   there haven't been any complaints, so fingers crossed.

 - Tracepoints added to help understanding rstat lock contentions.

 - A bunch of minor changes - doc updates, code cleanups and selftests.

* tag 'cgroup-for-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
  cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints
  selftests/cgroup: Drop define _GNU_SOURCE
  docs: cgroup-v1: Update page cache removal functions
  selftests/cgroup: fix uninitialized variables in test_zswap.c
  selftests/cgroup: cpu_hogger init: use {} instead of {NULL}
  selftests/cgroup: fix clang warnings: uninitialized fd variable
  selftests/cgroup: fix clang build failures for abs() calls
  cgroup/cpuset: Remove outdated comment in sched_partition_write()
  cgroup/cpuset: Fix incorrect top_cpuset flags
  cgroup/cpuset: Avoid clearing CS_SCHED_LOAD_BALANCE twice
  cgroup/cpuset: Statically initialize more members of top_cpuset
  cgroup: Avoid unnecessary looping in cgroup_no_v1()
  cgroup, legacy_freezer: update comment for freezer_css_offline()
  docs, cgroup: add entries for pids to cgroup-v2.rst
  cgroup: don't call cgroup1_pidlist_destroy_all() for v2
  cgroup_freezer: update comment for freezer_css_online()
  cgroup/rstat: desc member cgrp in cgroup_rstat_flush_release
  cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints
  cgroup/pids: Remove superfluous zeroing
  docs: cgroup-v1: Fix description for css_online
  ...
parents f4b0c4b5 21c38a3b
...@@ -570,7 +570,7 @@ visible to cgroup_for_each_child/descendant_*() iterators. The ...@@ -570,7 +570,7 @@ visible to cgroup_for_each_child/descendant_*() iterators. The
subsystem may choose to fail creation by returning -errno. This subsystem may choose to fail creation by returning -errno. This
callback can be used to implement reliable state sharing and callback can be used to implement reliable state sharing and
propagation along the hierarchy. See the comment on propagation along the hierarchy. See the comment on
cgroup_for_each_descendant_pre() for details. cgroup_for_each_live_descendant_pre() for details.
``void css_offline(struct cgroup *cgrp);`` ``void css_offline(struct cgroup *cgrp);``
(cgroup_mutex held by caller) (cgroup_mutex held by caller)
......
...@@ -102,7 +102,7 @@ Under below explanation, we assume CONFIG_SWAP=y. ...@@ -102,7 +102,7 @@ Under below explanation, we assume CONFIG_SWAP=y.
The logic is very clear. (About migration, see below) The logic is very clear. (About migration, see below)
Note: Note:
__remove_from_page_cache() is called by remove_from_page_cache() __filemap_remove_folio() is called by filemap_remove_folio()
and __remove_mapping(). and __remove_mapping().
6. Shmem(tmpfs) Page Cache 6. Shmem(tmpfs) Page Cache
......
...@@ -1058,12 +1058,15 @@ cpufreq governor about the minimum desired frequency which should always be ...@@ -1058,12 +1058,15 @@ cpufreq governor about the minimum desired frequency which should always be
provided by a CPU, as well as the maximum desired frequency, which should not provided by a CPU, as well as the maximum desired frequency, which should not
be exceeded by a CPU. be exceeded by a CPU.
WARNING: cgroup2 doesn't yet support control of realtime processes and WARNING: cgroup2 doesn't yet support control of realtime processes. For
the cpu controller can only be enabled when all RT processes are in a kernel built with the CONFIG_RT_GROUP_SCHED option enabled for group
the root cgroup. Be aware that system management software may already scheduling of realtime processes, the cpu controller can only be enabled
have placed RT processes into nonroot cgroups during the system boot when all RT processes are in the root cgroup. This limitation does
process, and these processes may need to be moved to the root cgroup not apply if CONFIG_RT_GROUP_SCHED is disabled. Be aware that system
before the cpu controller can be enabled. management software may already have placed RT processes into nonroot
cgroups during the system boot process, and these processes may need
to be moved to the root cgroup before the cpu controller can be enabled
with a CONFIG_RT_GROUP_SCHED enabled kernel.
CPU Interface Files CPU Interface Files
...@@ -2190,11 +2193,25 @@ PID Interface Files ...@@ -2190,11 +2193,25 @@ PID Interface Files
Hard limit of number of processes. Hard limit of number of processes.
pids.current pids.current
A read-only single value file which exists on all cgroups. A read-only single value file which exists on non-root cgroups.
The number of processes currently in the cgroup and its The number of processes currently in the cgroup and its
descendants. descendants.
pids.peak
A read-only single value file which exists on non-root cgroups.
The maximum value that the number of processes in the cgroup and its
descendants has ever reached.
pids.events
A read-only flat-keyed file which exists on non-root cgroups. The
following entries are defined. Unless specified otherwise, a value
change in this file generates a file modified event.
max
Number of times fork failed because limit was hit.
Organisational operations are not blocked by cgroup policies, so it is Organisational operations are not blocked by cgroup policies, so it is
possible to have pids.current > pids.max. This can be done by either possible to have pids.current > pids.max. This can be done by either
setting the limit to be smaller than pids.current, or attaching enough setting the limit to be smaller than pids.current, or attaching enough
......
...@@ -690,7 +690,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen) ...@@ -690,7 +690,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
void cgroup_rstat_updated(struct cgroup *cgrp, int cpu); void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
void cgroup_rstat_flush(struct cgroup *cgrp); void cgroup_rstat_flush(struct cgroup *cgrp);
void cgroup_rstat_flush_hold(struct cgroup *cgrp); void cgroup_rstat_flush_hold(struct cgroup *cgrp);
void cgroup_rstat_flush_release(void); void cgroup_rstat_flush_release(struct cgroup *cgrp);
/* /*
* Basic resource stats. * Basic resource stats.
......
...@@ -70,7 +70,6 @@ extern int cpuset_init(void); ...@@ -70,7 +70,6 @@ extern int cpuset_init(void);
extern void cpuset_init_smp(void); extern void cpuset_init_smp(void);
extern void cpuset_force_rebuild(void); extern void cpuset_force_rebuild(void);
extern void cpuset_update_active_cpus(void); extern void cpuset_update_active_cpus(void);
extern void cpuset_wait_for_hotplug(void);
extern void inc_dl_tasks_cs(struct task_struct *task); extern void inc_dl_tasks_cs(struct task_struct *task);
extern void dec_dl_tasks_cs(struct task_struct *task); extern void dec_dl_tasks_cs(struct task_struct *task);
extern void cpuset_lock(void); extern void cpuset_lock(void);
...@@ -185,8 +184,6 @@ static inline void cpuset_update_active_cpus(void) ...@@ -185,8 +184,6 @@ static inline void cpuset_update_active_cpus(void)
partition_sched_domains(1, NULL, NULL); partition_sched_domains(1, NULL, NULL);
} }
static inline void cpuset_wait_for_hotplug(void) { }
static inline void inc_dl_tasks_cs(struct task_struct *task) { } static inline void inc_dl_tasks_cs(struct task_struct *task) { }
static inline void dec_dl_tasks_cs(struct task_struct *task) { } static inline void dec_dl_tasks_cs(struct task_struct *task) { }
static inline void cpuset_lock(void) { } static inline void cpuset_lock(void) { }
......
...@@ -204,6 +204,98 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen, ...@@ -204,6 +204,98 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen,
TP_ARGS(cgrp, path, val) TP_ARGS(cgrp, path, val)
); );
DECLARE_EVENT_CLASS(cgroup_rstat,
TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
TP_ARGS(cgrp, cpu, contended),
TP_STRUCT__entry(
__field( int, root )
__field( int, level )
__field( u64, id )
__field( int, cpu )
__field( bool, contended )
),
TP_fast_assign(
__entry->root = cgrp->root->hierarchy_id;
__entry->id = cgroup_id(cgrp);
__entry->level = cgrp->level;
__entry->cpu = cpu;
__entry->contended = contended;
),
TP_printk("root=%d id=%llu level=%d cpu=%d lock contended:%d",
__entry->root, __entry->id, __entry->level,
__entry->cpu, __entry->contended)
);
/* Related to global: cgroup_rstat_lock */
DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended,
TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
TP_ARGS(cgrp, cpu, contended)
);
DEFINE_EVENT(cgroup_rstat, cgroup_rstat_locked,
TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
TP_ARGS(cgrp, cpu, contended)
);
DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
TP_ARGS(cgrp, cpu, contended)
);
/* Related to per CPU: cgroup_rstat_cpu_lock */
DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended,
TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
TP_ARGS(cgrp, cpu, contended)
);
DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_lock_contended_fastpath,
TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
TP_ARGS(cgrp, cpu, contended)
);
DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked,
TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
TP_ARGS(cgrp, cpu, contended)
);
DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_locked_fastpath,
TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
TP_ARGS(cgrp, cpu, contended)
);
DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock,
TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
TP_ARGS(cgrp, cpu, contended)
);
DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath,
TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
TP_ARGS(cgrp, cpu, contended)
);
#endif /* _TRACE_CGROUP_H */ #endif /* _TRACE_CGROUP_H */
/* This part must be outside protection */ /* This part must be outside protection */
......
...@@ -1335,6 +1335,7 @@ static int __init cgroup_no_v1(char *str) ...@@ -1335,6 +1335,7 @@ static int __init cgroup_no_v1(char *str)
continue; continue;
cgroup_no_v1_mask |= 1 << i; cgroup_no_v1_mask |= 1 << i;
break;
} }
} }
return 1; return 1;
......
...@@ -5368,7 +5368,8 @@ static void css_free_rwork_fn(struct work_struct *work) ...@@ -5368,7 +5368,8 @@ static void css_free_rwork_fn(struct work_struct *work)
} else { } else {
/* cgroup free path */ /* cgroup free path */
atomic_dec(&cgrp->root->nr_cgrps); atomic_dec(&cgrp->root->nr_cgrps);
cgroup1_pidlist_destroy_all(cgrp); if (!cgroup_on_dfl(cgrp))
cgroup1_pidlist_destroy_all(cgrp);
cancel_work_sync(&cgrp->release_agent_work); cancel_work_sync(&cgrp->release_agent_work);
bpf_cgrp_storage_free(cgrp); bpf_cgrp_storage_free(cgrp);
......
...@@ -201,6 +201,14 @@ struct cpuset { ...@@ -201,6 +201,14 @@ struct cpuset {
struct list_head remote_sibling; struct list_head remote_sibling;
}; };
/*
* Legacy hierarchy call to cgroup_transfer_tasks() is handled asynchrously
*/
struct cpuset_remove_tasks_struct {
struct work_struct work;
struct cpuset *cs;
};
/* /*
* Exclusive CPUs distributed out to sub-partitions of top_cpuset * Exclusive CPUs distributed out to sub-partitions of top_cpuset
*/ */
...@@ -360,9 +368,10 @@ static inline void notify_partition_change(struct cpuset *cs, int old_prs) ...@@ -360,9 +368,10 @@ static inline void notify_partition_change(struct cpuset *cs, int old_prs)
} }
static struct cpuset top_cpuset = { static struct cpuset top_cpuset = {
.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) | .flags = BIT(CS_ONLINE) | BIT(CS_CPU_EXCLUSIVE) |
(1 << CS_MEM_EXCLUSIVE)), BIT(CS_MEM_EXCLUSIVE) | BIT(CS_SCHED_LOAD_BALANCE),
.partition_root_state = PRS_ROOT, .partition_root_state = PRS_ROOT,
.relax_domain_level = -1,
.remote_sibling = LIST_HEAD_INIT(top_cpuset.remote_sibling), .remote_sibling = LIST_HEAD_INIT(top_cpuset.remote_sibling),
}; };
...@@ -449,12 +458,6 @@ static DEFINE_SPINLOCK(callback_lock); ...@@ -449,12 +458,6 @@ static DEFINE_SPINLOCK(callback_lock);
static struct workqueue_struct *cpuset_migrate_mm_wq; static struct workqueue_struct *cpuset_migrate_mm_wq;
/*
* CPU / memory hotplug is handled asynchronously.
*/
static void cpuset_hotplug_workfn(struct work_struct *work);
static DECLARE_WORK(cpuset_hotplug_work, cpuset_hotplug_workfn);
static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq); static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
static inline void check_insane_mems_config(nodemask_t *nodes) static inline void check_insane_mems_config(nodemask_t *nodes)
...@@ -540,22 +543,10 @@ static void guarantee_online_cpus(struct task_struct *tsk, ...@@ -540,22 +543,10 @@ static void guarantee_online_cpus(struct task_struct *tsk,
rcu_read_lock(); rcu_read_lock();
cs = task_cs(tsk); cs = task_cs(tsk);
while (!cpumask_intersects(cs->effective_cpus, pmask)) { while (!cpumask_intersects(cs->effective_cpus, pmask))
cs = parent_cs(cs); cs = parent_cs(cs);
if (unlikely(!cs)) {
/*
* The top cpuset doesn't have any online cpu as a
* consequence of a race between cpuset_hotplug_work
* and cpu hotplug notifier. But we know the top
* cpuset's effective_cpus is on its way to be
* identical to cpu_online_mask.
*/
goto out_unlock;
}
}
cpumask_and(pmask, pmask, cs->effective_cpus);
out_unlock: cpumask_and(pmask, pmask, cs->effective_cpus);
rcu_read_unlock(); rcu_read_unlock();
} }
...@@ -1217,7 +1208,7 @@ static void rebuild_sched_domains_locked(void) ...@@ -1217,7 +1208,7 @@ static void rebuild_sched_domains_locked(void)
/* /*
* If we have raced with CPU hotplug, return early to avoid * If we have raced with CPU hotplug, return early to avoid
* passing doms with offlined cpu to partition_sched_domains(). * passing doms with offlined cpu to partition_sched_domains().
* Anyways, cpuset_hotplug_workfn() will rebuild sched domains. * Anyways, cpuset_handle_hotplug() will rebuild sched domains.
* *
* With no CPUs in any subpartitions, top_cpuset's effective CPUs * With no CPUs in any subpartitions, top_cpuset's effective CPUs
* should be the same as the active CPUs, so checking only top_cpuset * should be the same as the active CPUs, so checking only top_cpuset
...@@ -1260,12 +1251,17 @@ static void rebuild_sched_domains_locked(void) ...@@ -1260,12 +1251,17 @@ static void rebuild_sched_domains_locked(void)
} }
#endif /* CONFIG_SMP */ #endif /* CONFIG_SMP */
void rebuild_sched_domains(void) static void rebuild_sched_domains_cpuslocked(void)
{ {
cpus_read_lock();
mutex_lock(&cpuset_mutex); mutex_lock(&cpuset_mutex);
rebuild_sched_domains_locked(); rebuild_sched_domains_locked();
mutex_unlock(&cpuset_mutex); mutex_unlock(&cpuset_mutex);
}
void rebuild_sched_domains(void)
{
cpus_read_lock();
rebuild_sched_domains_cpuslocked();
cpus_read_unlock(); cpus_read_unlock();
} }
...@@ -2079,14 +2075,11 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd, ...@@ -2079,14 +2075,11 @@ static int update_parent_effective_cpumask(struct cpuset *cs, int cmd,
/* /*
* For partcmd_update without newmask, it is being called from * For partcmd_update without newmask, it is being called from
* cpuset_hotplug_workfn() where cpus_read_lock() wasn't taken. * cpuset_handle_hotplug(). Update the load balance flag and
* Update the load balance flag and scheduling domain if * scheduling domain accordingly.
* cpus_read_trylock() is successful.
*/ */
if ((cmd == partcmd_update) && !newmask && cpus_read_trylock()) { if ((cmd == partcmd_update) && !newmask)
update_partition_sd_lb(cs, old_prs); update_partition_sd_lb(cs, old_prs);
cpus_read_unlock();
}
notify_partition_change(cs, old_prs); notify_partition_change(cs, old_prs);
return 0; return 0;
...@@ -3599,8 +3592,8 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, ...@@ -3599,8 +3592,8 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
* proceeding, so that we don't end up keep removing tasks added * proceeding, so that we don't end up keep removing tasks added
* after execution capability is restored. * after execution capability is restored.
* *
* cpuset_hotplug_work calls back into cgroup core via * cpuset_handle_hotplug may call back into cgroup core asynchronously
* cgroup_transfer_tasks() and waiting for it from a cgroupfs * via cgroup_transfer_tasks() and waiting for it from a cgroupfs
* operation like this one can lead to a deadlock through kernfs * operation like this one can lead to a deadlock through kernfs
* active_ref protection. Let's break the protection. Losing the * active_ref protection. Let's break the protection. Losing the
* protection is okay as we check whether @cs is online after * protection is okay as we check whether @cs is online after
...@@ -3609,7 +3602,6 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of, ...@@ -3609,7 +3602,6 @@ static ssize_t cpuset_write_resmask(struct kernfs_open_file *of,
*/ */
css_get(&cs->css); css_get(&cs->css);
kernfs_break_active_protection(of->kn); kernfs_break_active_protection(of->kn);
flush_work(&cpuset_hotplug_work);
cpus_read_lock(); cpus_read_lock();
mutex_lock(&cpuset_mutex); mutex_lock(&cpuset_mutex);
...@@ -3782,9 +3774,6 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, ...@@ -3782,9 +3774,6 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
buf = strstrip(buf); buf = strstrip(buf);
/*
* Convert "root" to ENABLED, and convert "member" to DISABLED.
*/
if (!strcmp(buf, "root")) if (!strcmp(buf, "root"))
val = PRS_ROOT; val = PRS_ROOT;
else if (!strcmp(buf, "member")) else if (!strcmp(buf, "member"))
...@@ -4060,11 +4049,6 @@ static int cpuset_css_online(struct cgroup_subsys_state *css) ...@@ -4060,11 +4049,6 @@ static int cpuset_css_online(struct cgroup_subsys_state *css)
cs->effective_mems = parent->effective_mems; cs->effective_mems = parent->effective_mems;
cs->use_parent_ecpus = true; cs->use_parent_ecpus = true;
parent->child_ecpus_count++; parent->child_ecpus_count++;
/*
* Clear CS_SCHED_LOAD_BALANCE if parent is isolated
*/
if (!is_sched_load_balance(parent))
clear_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
} }
/* /*
...@@ -4318,8 +4302,6 @@ int __init cpuset_init(void) ...@@ -4318,8 +4302,6 @@ int __init cpuset_init(void)
nodes_setall(top_cpuset.effective_mems); nodes_setall(top_cpuset.effective_mems);
fmeter_init(&top_cpuset.fmeter); fmeter_init(&top_cpuset.fmeter);
set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);
top_cpuset.relax_domain_level = -1;
INIT_LIST_HEAD(&remote_children); INIT_LIST_HEAD(&remote_children);
BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL)); BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));
...@@ -4354,6 +4336,16 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs) ...@@ -4354,6 +4336,16 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
} }
} }
static void cpuset_migrate_tasks_workfn(struct work_struct *work)
{
struct cpuset_remove_tasks_struct *s;
s = container_of(work, struct cpuset_remove_tasks_struct, work);
remove_tasks_in_empty_cpuset(s->cs);
css_put(&s->cs->css);
kfree(s);
}
static void static void
hotplug_update_tasks_legacy(struct cpuset *cs, hotplug_update_tasks_legacy(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems, struct cpumask *new_cpus, nodemask_t *new_mems,
...@@ -4383,12 +4375,21 @@ hotplug_update_tasks_legacy(struct cpuset *cs, ...@@ -4383,12 +4375,21 @@ hotplug_update_tasks_legacy(struct cpuset *cs,
/* /*
* Move tasks to the nearest ancestor with execution resources, * Move tasks to the nearest ancestor with execution resources,
* This is full cgroup operation which will also call back into * This is full cgroup operation which will also call back into
* cpuset. Should be done outside any lock. * cpuset. Execute it asynchronously using workqueue.
*/ */
if (is_empty) { if (is_empty && cs->css.cgroup->nr_populated_csets &&
mutex_unlock(&cpuset_mutex); css_tryget_online(&cs->css)) {
remove_tasks_in_empty_cpuset(cs); struct cpuset_remove_tasks_struct *s;
mutex_lock(&cpuset_mutex);
s = kzalloc(sizeof(*s), GFP_KERNEL);
if (WARN_ON_ONCE(!s)) {
css_put(&cs->css);
return;
}
s->cs = cs;
INIT_WORK(&s->work, cpuset_migrate_tasks_workfn);
schedule_work(&s->work);
} }
} }
...@@ -4421,30 +4422,6 @@ void cpuset_force_rebuild(void) ...@@ -4421,30 +4422,6 @@ void cpuset_force_rebuild(void)
force_rebuild = true; force_rebuild = true;
} }
/*
* Attempt to acquire a cpus_read_lock while a hotplug operation may be in
* progress.
* Return: true if successful, false otherwise
*
* To avoid circular lock dependency between cpuset_mutex and cpus_read_lock,
* cpus_read_trylock() is used here to acquire the lock.
*/
static bool cpuset_hotplug_cpus_read_trylock(void)
{
int retries = 0;
while (!cpus_read_trylock()) {
/*
* CPU hotplug still in progress. Retry 5 times
* with a 10ms wait before bailing out.
*/
if (++retries > 5)
return false;
msleep(10);
}
return true;
}
/** /**
* cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug * cpuset_hotplug_update_tasks - update tasks in a cpuset for hotunplug
* @cs: cpuset in interest * @cs: cpuset in interest
...@@ -4493,13 +4470,11 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -4493,13 +4470,11 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
compute_partition_effective_cpumask(cs, &new_cpus); compute_partition_effective_cpumask(cs, &new_cpus);
if (remote && cpumask_empty(&new_cpus) && if (remote && cpumask_empty(&new_cpus) &&
partition_is_populated(cs, NULL) && partition_is_populated(cs, NULL)) {
cpuset_hotplug_cpus_read_trylock()) {
remote_partition_disable(cs, tmp); remote_partition_disable(cs, tmp);
compute_effective_cpumask(&new_cpus, cs, parent); compute_effective_cpumask(&new_cpus, cs, parent);
remote = false; remote = false;
cpuset_force_rebuild(); cpuset_force_rebuild();
cpus_read_unlock();
} }
/* /*
...@@ -4519,18 +4494,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -4519,18 +4494,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
else if (is_partition_valid(parent) && is_partition_invalid(cs)) else if (is_partition_valid(parent) && is_partition_invalid(cs))
partcmd = partcmd_update; partcmd = partcmd_update;
/*
* cpus_read_lock needs to be held before calling
* update_parent_effective_cpumask(). To avoid circular lock
* dependency between cpuset_mutex and cpus_read_lock,
* cpus_read_trylock() is used here to acquire the lock.
*/
if (partcmd >= 0) { if (partcmd >= 0) {
if (!cpuset_hotplug_cpus_read_trylock())
goto update_tasks;
update_parent_effective_cpumask(cs, partcmd, NULL, tmp); update_parent_effective_cpumask(cs, partcmd, NULL, tmp);
cpus_read_unlock();
if ((partcmd == partcmd_invalidate) || is_partition_valid(cs)) { if ((partcmd == partcmd_invalidate) || is_partition_valid(cs)) {
compute_partition_effective_cpumask(cs, &new_cpus); compute_partition_effective_cpumask(cs, &new_cpus);
cpuset_force_rebuild(); cpuset_force_rebuild();
...@@ -4558,8 +4523,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -4558,8 +4523,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
} }
/** /**
* cpuset_hotplug_workfn - handle CPU/memory hotunplug for a cpuset * cpuset_handle_hotplug - handle CPU/memory hot{,un}plug for a cpuset
* @work: unused
* *
* This function is called after either CPU or memory configuration has * This function is called after either CPU or memory configuration has
* changed and updates cpuset accordingly. The top_cpuset is always * changed and updates cpuset accordingly. The top_cpuset is always
...@@ -4573,8 +4537,10 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp) ...@@ -4573,8 +4537,10 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
* *
* Note that CPU offlining during suspend is ignored. We don't modify * Note that CPU offlining during suspend is ignored. We don't modify
* cpusets across suspend/resume cycles at all. * cpusets across suspend/resume cycles at all.
*
* CPU / memory hotplug is handled synchronously.
*/ */
static void cpuset_hotplug_workfn(struct work_struct *work) static void cpuset_handle_hotplug(void)
{ {
static cpumask_t new_cpus; static cpumask_t new_cpus;
static nodemask_t new_mems; static nodemask_t new_mems;
...@@ -4585,6 +4551,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work) ...@@ -4585,6 +4551,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
if (on_dfl && !alloc_cpumasks(NULL, &tmp)) if (on_dfl && !alloc_cpumasks(NULL, &tmp))
ptmp = &tmp; ptmp = &tmp;
lockdep_assert_cpus_held();
mutex_lock(&cpuset_mutex); mutex_lock(&cpuset_mutex);
/* fetch the available cpus/mems and find out which changed how */ /* fetch the available cpus/mems and find out which changed how */
...@@ -4666,7 +4633,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work) ...@@ -4666,7 +4633,7 @@ static void cpuset_hotplug_workfn(struct work_struct *work)
/* rebuild sched domains if cpus_allowed has changed */ /* rebuild sched domains if cpus_allowed has changed */
if (cpus_updated || force_rebuild) { if (cpus_updated || force_rebuild) {
force_rebuild = false; force_rebuild = false;
rebuild_sched_domains(); rebuild_sched_domains_cpuslocked();
} }
free_cpumasks(NULL, ptmp); free_cpumasks(NULL, ptmp);
...@@ -4679,12 +4646,7 @@ void cpuset_update_active_cpus(void) ...@@ -4679,12 +4646,7 @@ void cpuset_update_active_cpus(void)
* inside cgroup synchronization. Bounce actual hotplug processing * inside cgroup synchronization. Bounce actual hotplug processing
* to a work item to avoid reverse locking order. * to a work item to avoid reverse locking order.
*/ */
schedule_work(&cpuset_hotplug_work); cpuset_handle_hotplug();
}
void cpuset_wait_for_hotplug(void)
{
flush_work(&cpuset_hotplug_work);
} }
/* /*
...@@ -4695,7 +4657,7 @@ void cpuset_wait_for_hotplug(void) ...@@ -4695,7 +4657,7 @@ void cpuset_wait_for_hotplug(void)
static int cpuset_track_online_nodes(struct notifier_block *self, static int cpuset_track_online_nodes(struct notifier_block *self,
unsigned long action, void *arg) unsigned long action, void *arg)
{ {
schedule_work(&cpuset_hotplug_work); cpuset_handle_hotplug();
return NOTIFY_OK; return NOTIFY_OK;
} }
......
...@@ -106,8 +106,7 @@ freezer_css_alloc(struct cgroup_subsys_state *parent_css) ...@@ -106,8 +106,7 @@ freezer_css_alloc(struct cgroup_subsys_state *parent_css)
* @css: css being created * @css: css being created
* *
* We're committing to creation of @css. Mark it online and inherit * We're committing to creation of @css. Mark it online and inherit
* parent's freezing state while holding both parent's and our * parent's freezing state while holding cpus read lock and freezer_mutex.
* freezer->lock.
*/ */
static int freezer_css_online(struct cgroup_subsys_state *css) static int freezer_css_online(struct cgroup_subsys_state *css)
{ {
...@@ -133,7 +132,7 @@ static int freezer_css_online(struct cgroup_subsys_state *css) ...@@ -133,7 +132,7 @@ static int freezer_css_online(struct cgroup_subsys_state *css)
* freezer_css_offline - initiate destruction of a freezer css * freezer_css_offline - initiate destruction of a freezer css
* @css: css being destroyed * @css: css being destroyed
* *
* @css is going away. Mark it dead and decrement system_freezing_count if * @css is going away. Mark it dead and decrement freezer_active if
* it was holding one. * it was holding one.
*/ */
static void freezer_css_offline(struct cgroup_subsys_state *css) static void freezer_css_offline(struct cgroup_subsys_state *css)
......
...@@ -75,9 +75,7 @@ pids_css_alloc(struct cgroup_subsys_state *parent) ...@@ -75,9 +75,7 @@ pids_css_alloc(struct cgroup_subsys_state *parent)
if (!pids) if (!pids)
return ERR_PTR(-ENOMEM); return ERR_PTR(-ENOMEM);
atomic64_set(&pids->counter, 0);
atomic64_set(&pids->limit, PIDS_MAX); atomic64_set(&pids->limit, PIDS_MAX);
atomic64_set(&pids->events_limit, 0);
return &pids->css; return &pids->css;
} }
......
...@@ -7,6 +7,8 @@ ...@@ -7,6 +7,8 @@
#include <linux/btf.h> #include <linux/btf.h>
#include <linux/btf_ids.h> #include <linux/btf_ids.h>
#include <trace/events/cgroup.h>
static DEFINE_SPINLOCK(cgroup_rstat_lock); static DEFINE_SPINLOCK(cgroup_rstat_lock);
static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock); static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
...@@ -17,6 +19,60 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) ...@@ -17,6 +19,60 @@ static struct cgroup_rstat_cpu *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
return per_cpu_ptr(cgrp->rstat_cpu, cpu); return per_cpu_ptr(cgrp->rstat_cpu, cpu);
} }
/*
* Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
*
* This makes it easier to diagnose locking issues and contention in
* production environments. The parameter @fast_path determine the
* tracepoints being added, allowing us to diagnose "flush" related
* operations without handling high-frequency fast-path "update" events.
*/
static __always_inline
unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int cpu,
struct cgroup *cgrp, const bool fast_path)
{
unsigned long flags;
bool contended;
/*
* The _irqsave() is needed because cgroup_rstat_lock is
* spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
* this lock with the _irq() suffix only disables interrupts on
* a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
* interrupts on both configurations. The _irqsave() ensures
* that interrupts are always disabled and later restored.
*/
contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
if (contended) {
if (fast_path)
trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu, contended);
else
trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, contended);
raw_spin_lock_irqsave(cpu_lock, flags);
}
if (fast_path)
trace_cgroup_rstat_cpu_locked_fastpath(cgrp, cpu, contended);
else
trace_cgroup_rstat_cpu_locked(cgrp, cpu, contended);
return flags;
}
static __always_inline
void _cgroup_rstat_cpu_unlock(raw_spinlock_t *cpu_lock, int cpu,
struct cgroup *cgrp, unsigned long flags,
const bool fast_path)
{
if (fast_path)
trace_cgroup_rstat_cpu_unlock_fastpath(cgrp, cpu, false);
else
trace_cgroup_rstat_cpu_unlock(cgrp, cpu, false);
raw_spin_unlock_irqrestore(cpu_lock, flags);
}
/** /**
* cgroup_rstat_updated - keep track of updated rstat_cpu * cgroup_rstat_updated - keep track of updated rstat_cpu
* @cgrp: target cgroup * @cgrp: target cgroup
...@@ -42,7 +98,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) ...@@ -42,7 +98,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next)) if (data_race(cgroup_rstat_cpu(cgrp, cpu)->updated_next))
return; return;
raw_spin_lock_irqsave(cpu_lock, flags); flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, cgrp, true);
/* put @cgrp and all ancestors on the corresponding updated lists */ /* put @cgrp and all ancestors on the corresponding updated lists */
while (true) { while (true) {
...@@ -70,7 +126,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu) ...@@ -70,7 +126,7 @@ __bpf_kfunc void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
cgrp = parent; cgrp = parent;
} }
raw_spin_unlock_irqrestore(cpu_lock, flags); _cgroup_rstat_cpu_unlock(cpu_lock, cpu, cgrp, flags, true);
} }
/** /**
...@@ -151,15 +207,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu) ...@@ -151,15 +207,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
struct cgroup *head = NULL, *parent, *child; struct cgroup *head = NULL, *parent, *child;
unsigned long flags; unsigned long flags;
/* flags = _cgroup_rstat_cpu_lock(cpu_lock, cpu, root, false);
* The _irqsave() is needed because cgroup_rstat_lock is
* spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
* this lock with the _irq() suffix only disables interrupts on
* a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
* interrupts on both configurations. The _irqsave() ensures
* that interrupts are always disabled and later restored.
*/
raw_spin_lock_irqsave(cpu_lock, flags);
/* Return NULL if this subtree is not on-list */ /* Return NULL if this subtree is not on-list */
if (!rstatc->updated_next) if (!rstatc->updated_next)
...@@ -196,7 +244,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu) ...@@ -196,7 +244,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
if (child != root) if (child != root)
head = cgroup_rstat_push_children(head, child, cpu); head = cgroup_rstat_push_children(head, child, cpu);
unlock_ret: unlock_ret:
raw_spin_unlock_irqrestore(cpu_lock, flags); _cgroup_rstat_cpu_unlock(cpu_lock, cpu, root, flags, false);
return head; return head;
} }
...@@ -222,6 +270,35 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp, ...@@ -222,6 +270,35 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
__bpf_hook_end(); __bpf_hook_end();
/*
* Helper functions for locking cgroup_rstat_lock.
*
* This makes it easier to diagnose locking issues and contention in
* production environments. The parameter @cpu_in_loop indicate lock
* was released and re-taken when collection data from the CPUs. The
* value -1 is used when obtaining the main lock else this is the CPU
* number processed last.
*/
static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
__acquires(&cgroup_rstat_lock)
{
bool contended;
contended = !spin_trylock_irq(&cgroup_rstat_lock);
if (contended) {
trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
spin_lock_irq(&cgroup_rstat_lock);
}
trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
}
static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
__releases(&cgroup_rstat_lock)
{
trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
spin_unlock_irq(&cgroup_rstat_lock);
}
/* see cgroup_rstat_flush() */ /* see cgroup_rstat_flush() */
static void cgroup_rstat_flush_locked(struct cgroup *cgrp) static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock) __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
...@@ -248,10 +325,10 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp) ...@@ -248,10 +325,10 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
/* play nice and yield if necessary */ /* play nice and yield if necessary */
if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) { if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
spin_unlock_irq(&cgroup_rstat_lock); __cgroup_rstat_unlock(cgrp, cpu);
if (!cond_resched()) if (!cond_resched())
cpu_relax(); cpu_relax();
spin_lock_irq(&cgroup_rstat_lock); __cgroup_rstat_lock(cgrp, cpu);
} }
} }
} }
...@@ -273,9 +350,9 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp) ...@@ -273,9 +350,9 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
{ {
might_sleep(); might_sleep();
spin_lock_irq(&cgroup_rstat_lock); __cgroup_rstat_lock(cgrp, -1);
cgroup_rstat_flush_locked(cgrp); cgroup_rstat_flush_locked(cgrp);
spin_unlock_irq(&cgroup_rstat_lock); __cgroup_rstat_unlock(cgrp, -1);
} }
/** /**
...@@ -291,17 +368,18 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp) ...@@ -291,17 +368,18 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
__acquires(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
{ {
might_sleep(); might_sleep();
spin_lock_irq(&cgroup_rstat_lock); __cgroup_rstat_lock(cgrp, -1);
cgroup_rstat_flush_locked(cgrp); cgroup_rstat_flush_locked(cgrp);
} }
/** /**
* cgroup_rstat_flush_release - release cgroup_rstat_flush_hold() * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
* @cgrp: cgroup used by tracepoint
*/ */
void cgroup_rstat_flush_release(void) void cgroup_rstat_flush_release(struct cgroup *cgrp)
__releases(&cgroup_rstat_lock) __releases(&cgroup_rstat_lock)
{ {
spin_unlock_irq(&cgroup_rstat_lock); __cgroup_rstat_unlock(cgrp, -1);
} }
int cgroup_rstat_init(struct cgroup *cgrp) int cgroup_rstat_init(struct cgroup *cgrp)
...@@ -533,7 +611,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq) ...@@ -533,7 +611,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
#ifdef CONFIG_SCHED_CORE #ifdef CONFIG_SCHED_CORE
forceidle_time = cgrp->bstat.forceidle_sum; forceidle_time = cgrp->bstat.forceidle_sum;
#endif #endif
cgroup_rstat_flush_release(); cgroup_rstat_flush_release(cgrp);
} else { } else {
root_cgroup_cputime(&bstat); root_cgroup_cputime(&bstat);
usage = bstat.cputime.sum_exec_runtime; usage = bstat.cputime.sum_exec_runtime;
......
...@@ -1208,52 +1208,6 @@ void __init cpuhp_threads_init(void) ...@@ -1208,52 +1208,6 @@ void __init cpuhp_threads_init(void)
kthread_unpark(this_cpu_read(cpuhp_state.thread)); kthread_unpark(this_cpu_read(cpuhp_state.thread));
} }
/*
*
* Serialize hotplug trainwrecks outside of the cpu_hotplug_lock
* protected region.
*
* The operation is still serialized against concurrent CPU hotplug via
* cpu_add_remove_lock, i.e. CPU map protection. But it is _not_
* serialized against other hotplug related activity like adding or
* removing of state callbacks and state instances, which invoke either the
* startup or the teardown callback of the affected state.
*
* This is required for subsystems which are unfixable vs. CPU hotplug and
* evade lock inversion problems by scheduling work which has to be
* completed _before_ cpu_up()/_cpu_down() returns.
*
* Don't even think about adding anything to this for any new code or even
* drivers. It's only purpose is to keep existing lock order trainwrecks
* working.
*
* For cpu_down() there might be valid reasons to finish cleanups which are
* not required to be done under cpu_hotplug_lock, but that's a different
* story and would be not invoked via this.
*/
static void cpu_up_down_serialize_trainwrecks(bool tasks_frozen)
{
/*
* cpusets delegate hotplug operations to a worker to "solve" the
* lock order problems. Wait for the worker, but only if tasks are
* _not_ frozen (suspend, hibernate) as that would wait forever.
*
* The wait is required because otherwise the hotplug operation
* returns with inconsistent state, which could even be observed in
* user space when a new CPU is brought up. The CPU plug uevent
* would be delivered and user space reacting on it would fail to
* move tasks to the newly plugged CPU up to the point where the
* work has finished because up to that point the newly plugged CPU
* is not assignable in cpusets/cgroups. On unplug that's not
* necessarily a visible issue, but it is still inconsistent state,
* which is the real problem which needs to be "fixed". This can't
* prevent the transient state between scheduling the work and
* returning from waiting for it.
*/
if (!tasks_frozen)
cpuset_wait_for_hotplug();
}
#ifdef CONFIG_HOTPLUG_CPU #ifdef CONFIG_HOTPLUG_CPU
#ifndef arch_clear_mm_cpumask_cpu #ifndef arch_clear_mm_cpumask_cpu
#define arch_clear_mm_cpumask_cpu(cpu, mm) cpumask_clear_cpu(cpu, mm_cpumask(mm)) #define arch_clear_mm_cpumask_cpu(cpu, mm) cpumask_clear_cpu(cpu, mm_cpumask(mm))
...@@ -1494,7 +1448,6 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen, ...@@ -1494,7 +1448,6 @@ static int __ref _cpu_down(unsigned int cpu, int tasks_frozen,
*/ */
lockup_detector_cleanup(); lockup_detector_cleanup();
arch_smt_update(); arch_smt_update();
cpu_up_down_serialize_trainwrecks(tasks_frozen);
return ret; return ret;
} }
...@@ -1728,7 +1681,6 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target) ...@@ -1728,7 +1681,6 @@ static int _cpu_up(unsigned int cpu, int tasks_frozen, enum cpuhp_state target)
out: out:
cpus_write_unlock(); cpus_write_unlock();
arch_smt_update(); arch_smt_update();
cpu_up_down_serialize_trainwrecks(tasks_frozen);
return ret; return ret;
} }
......
...@@ -194,8 +194,6 @@ void thaw_processes(void) ...@@ -194,8 +194,6 @@ void thaw_processes(void)
__usermodehelper_set_disable_depth(UMH_FREEZING); __usermodehelper_set_disable_depth(UMH_FREEZING);
thaw_workqueues(); thaw_workqueues();
cpuset_wait_for_hotplug();
read_lock(&tasklist_lock); read_lock(&tasklist_lock);
for_each_process_thread(g, p) { for_each_process_thread(g, p) {
/* No other threads should have PF_SUSPEND_TASK set */ /* No other threads should have PF_SUSPEND_TASK set */
......
...@@ -4,7 +4,7 @@ CFLAGS += -Wall -pthread ...@@ -4,7 +4,7 @@ CFLAGS += -Wall -pthread
all: ${HELPER_PROGS} all: ${HELPER_PROGS}
TEST_FILES := with_stress.sh TEST_FILES := with_stress.sh
TEST_PROGS := test_stress.sh test_cpuset_prs.sh TEST_PROGS := test_stress.sh test_cpuset_prs.sh test_cpuset_v1_hp.sh
TEST_GEN_FILES := wait_inotify TEST_GEN_FILES := wait_inotify
TEST_GEN_PROGS = test_memcontrol TEST_GEN_PROGS = test_memcontrol
TEST_GEN_PROGS += test_kmem TEST_GEN_PROGS += test_kmem
......
/* SPDX-License-Identifier: GPL-2.0 */ /* SPDX-License-Identifier: GPL-2.0 */
#define _GNU_SOURCE
#include <errno.h> #include <errno.h>
#include <fcntl.h> #include <fcntl.h>
#include <linux/limits.h> #include <linux/limits.h>
...@@ -195,10 +192,10 @@ int cg_write_numeric(const char *cgroup, const char *control, long value) ...@@ -195,10 +192,10 @@ int cg_write_numeric(const char *cgroup, const char *control, long value)
return cg_write(cgroup, control, buf); return cg_write(cgroup, control, buf);
} }
int cg_find_unified_root(char *root, size_t len) int cg_find_unified_root(char *root, size_t len, bool *nsdelegate)
{ {
char buf[10 * PAGE_SIZE]; char buf[10 * PAGE_SIZE];
char *fs, *mount, *type; char *fs, *mount, *type, *options;
const char delim[] = "\n\t "; const char delim[] = "\n\t ";
if (read_text("/proc/self/mounts", buf, sizeof(buf)) <= 0) if (read_text("/proc/self/mounts", buf, sizeof(buf)) <= 0)
...@@ -211,12 +208,14 @@ int cg_find_unified_root(char *root, size_t len) ...@@ -211,12 +208,14 @@ int cg_find_unified_root(char *root, size_t len)
for (fs = strtok(buf, delim); fs; fs = strtok(NULL, delim)) { for (fs = strtok(buf, delim); fs; fs = strtok(NULL, delim)) {
mount = strtok(NULL, delim); mount = strtok(NULL, delim);
type = strtok(NULL, delim); type = strtok(NULL, delim);
strtok(NULL, delim); options = strtok(NULL, delim);
strtok(NULL, delim); strtok(NULL, delim);
strtok(NULL, delim); strtok(NULL, delim);
if (strcmp(type, "cgroup2") == 0) { if (strcmp(type, "cgroup2") == 0) {
strncpy(root, mount, len); strncpy(root, mount, len);
if (nsdelegate)
*nsdelegate = !!strstr(options, "nsdelegate");
return 0; return 0;
} }
} }
......
...@@ -18,10 +18,10 @@ ...@@ -18,10 +18,10 @@
*/ */
static inline int values_close(long a, long b, int err) static inline int values_close(long a, long b, int err)
{ {
return abs(a - b) <= (a + b) / 100 * err; return labs(a - b) <= (a + b) / 100 * err;
} }
extern int cg_find_unified_root(char *root, size_t len); extern int cg_find_unified_root(char *root, size_t len, bool *nsdelegate);
extern char *cg_name(const char *root, const char *name); extern char *cg_name(const char *root, const char *name);
extern char *cg_name_indexed(const char *root, const char *name, int index); extern char *cg_name_indexed(const char *root, const char *name, int index);
extern char *cg_control(const char *cgroup, const char *control); extern char *cg_control(const char *cgroup, const char *control);
......
/* SPDX-License-Identifier: GPL-2.0 */ /* SPDX-License-Identifier: GPL-2.0 */
#define _GNU_SOURCE
#include <linux/limits.h> #include <linux/limits.h>
#include <linux/sched.h> #include <linux/sched.h>
#include <sys/types.h> #include <sys/types.h>
...@@ -18,6 +16,8 @@ ...@@ -18,6 +16,8 @@
#include "../kselftest.h" #include "../kselftest.h"
#include "cgroup_util.h" #include "cgroup_util.h"
static bool nsdelegate;
static int touch_anon(char *buf, size_t size) static int touch_anon(char *buf, size_t size)
{ {
int fd; int fd;
...@@ -775,6 +775,9 @@ static int test_cgcore_lesser_ns_open(const char *root) ...@@ -775,6 +775,9 @@ static int test_cgcore_lesser_ns_open(const char *root)
pid_t pid; pid_t pid;
int status; int status;
if (!nsdelegate)
return KSFT_SKIP;
cg_test_a = cg_name(root, "cg_test_a"); cg_test_a = cg_name(root, "cg_test_a");
cg_test_b = cg_name(root, "cg_test_b"); cg_test_b = cg_name(root, "cg_test_b");
...@@ -862,7 +865,7 @@ int main(int argc, char *argv[]) ...@@ -862,7 +865,7 @@ int main(int argc, char *argv[])
char root[PATH_MAX]; char root[PATH_MAX];
int i, ret = EXIT_SUCCESS; int i, ret = EXIT_SUCCESS;
if (cg_find_unified_root(root, sizeof(root))) if (cg_find_unified_root(root, sizeof(root), &nsdelegate))
ksft_exit_skip("cgroup v2 isn't mounted\n"); ksft_exit_skip("cgroup v2 isn't mounted\n");
if (cg_read_strstr(root, "cgroup.subtree_control", "memory")) if (cg_read_strstr(root, "cgroup.subtree_control", "memory"))
......
// SPDX-License-Identifier: GPL-2.0 // SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
#include <linux/limits.h> #include <linux/limits.h>
#include <sys/sysinfo.h> #include <sys/sysinfo.h>
#include <sys/wait.h> #include <sys/wait.h>
...@@ -237,7 +235,7 @@ run_cpucg_weight_test( ...@@ -237,7 +235,7 @@ run_cpucg_weight_test(
{ {
int ret = KSFT_FAIL, i; int ret = KSFT_FAIL, i;
char *parent = NULL; char *parent = NULL;
struct cpu_hogger children[3] = {NULL}; struct cpu_hogger children[3] = {};
parent = cg_name(root, "cpucg_test_0"); parent = cg_name(root, "cpucg_test_0");
if (!parent) if (!parent)
...@@ -408,7 +406,7 @@ run_cpucg_nested_weight_test(const char *root, bool overprovisioned) ...@@ -408,7 +406,7 @@ run_cpucg_nested_weight_test(const char *root, bool overprovisioned)
{ {
int ret = KSFT_FAIL, i; int ret = KSFT_FAIL, i;
char *parent = NULL, *child = NULL; char *parent = NULL, *child = NULL;
struct cpu_hogger leaf[3] = {NULL}; struct cpu_hogger leaf[3] = {};
long nested_leaf_usage, child_usage; long nested_leaf_usage, child_usage;
int nprocs = get_nprocs(); int nprocs = get_nprocs();
...@@ -700,7 +698,7 @@ int main(int argc, char *argv[]) ...@@ -700,7 +698,7 @@ int main(int argc, char *argv[])
char root[PATH_MAX]; char root[PATH_MAX];
int i, ret = EXIT_SUCCESS; int i, ret = EXIT_SUCCESS;
if (cg_find_unified_root(root, sizeof(root))) if (cg_find_unified_root(root, sizeof(root), NULL))
ksft_exit_skip("cgroup v2 isn't mounted\n"); ksft_exit_skip("cgroup v2 isn't mounted\n");
if (cg_read_strstr(root, "cgroup.subtree_control", "cpu")) if (cg_read_strstr(root, "cgroup.subtree_control", "cpu"))
......
...@@ -249,7 +249,7 @@ int main(int argc, char *argv[]) ...@@ -249,7 +249,7 @@ int main(int argc, char *argv[])
char root[PATH_MAX]; char root[PATH_MAX];
int i, ret = EXIT_SUCCESS; int i, ret = EXIT_SUCCESS;
if (cg_find_unified_root(root, sizeof(root))) if (cg_find_unified_root(root, sizeof(root), NULL))
ksft_exit_skip("cgroup v2 isn't mounted\n"); ksft_exit_skip("cgroup v2 isn't mounted\n");
if (cg_read_strstr(root, "cgroup.subtree_control", "cpuset")) if (cg_read_strstr(root, "cgroup.subtree_control", "cpuset"))
......
#!/bin/sh
# SPDX-License-Identifier: GPL-2.0
#
# Test the special cpuset v1 hotplug case where a cpuset become empty of
# CPUs will force migration of tasks out to an ancestor.
#
skip_test() {
echo "$1"
echo "Test SKIPPED"
exit 4 # ksft_skip
}
[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
# Find cpuset v1 mount point
CPUSET=$(mount -t cgroup | grep cpuset | head -1 | awk -e '{print $3}')
[[ -n "$CPUSET" ]] || skip_test "cpuset v1 mount point not found!"
#
# Create a test cpuset, put a CPU and a task there and offline that CPU
#
TDIR=test$$
[[ -d $CPUSET/$TDIR ]] || mkdir $CPUSET/$TDIR
echo 1 > $CPUSET/$TDIR/cpuset.cpus
echo 0 > $CPUSET/$TDIR/cpuset.mems
sleep 10&
TASK=$!
echo $TASK > $CPUSET/$TDIR/tasks
NEWCS=$(cat /proc/$TASK/cpuset)
[[ $NEWCS != "/$TDIR" ]] && {
echo "Unexpected cpuset $NEWCS, test FAILED!"
exit 1
}
echo 0 > /sys/devices/system/cpu/cpu1/online
sleep 0.5
echo 1 > /sys/devices/system/cpu/cpu1/online
NEWCS=$(cat /proc/$TASK/cpuset)
rmdir $CPUSET/$TDIR
[[ $NEWCS != "/" ]] && {
echo "cpuset $NEWCS, test FAILED!"
exit 1
}
echo "Test PASSED"
exit 0
...@@ -827,7 +827,7 @@ int main(int argc, char *argv[]) ...@@ -827,7 +827,7 @@ int main(int argc, char *argv[])
char root[PATH_MAX]; char root[PATH_MAX];
int i, ret = EXIT_SUCCESS; int i, ret = EXIT_SUCCESS;
if (cg_find_unified_root(root, sizeof(root))) if (cg_find_unified_root(root, sizeof(root), NULL))
ksft_exit_skip("cgroup v2 isn't mounted\n"); ksft_exit_skip("cgroup v2 isn't mounted\n");
for (i = 0; i < ARRAY_SIZE(tests); i++) { for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) { switch (tests[i].fn(root)) {
......
// SPDX-License-Identifier: GPL-2.0 // SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
#include <linux/limits.h> #include <linux/limits.h>
#include <sys/mman.h> #include <sys/mman.h>
#include <stdio.h> #include <stdio.h>
...@@ -214,7 +212,7 @@ int main(int argc, char **argv) ...@@ -214,7 +212,7 @@ int main(int argc, char **argv)
return ret; return ret;
} }
if (cg_find_unified_root(root, sizeof(root))) if (cg_find_unified_root(root, sizeof(root), NULL))
ksft_exit_skip("cgroup v2 isn't mounted\n"); ksft_exit_skip("cgroup v2 isn't mounted\n");
switch (test_hugetlb_memcg(root)) { switch (test_hugetlb_memcg(root)) {
......
...@@ -276,7 +276,7 @@ int main(int argc, char *argv[]) ...@@ -276,7 +276,7 @@ int main(int argc, char *argv[])
char root[PATH_MAX]; char root[PATH_MAX];
int i, ret = EXIT_SUCCESS; int i, ret = EXIT_SUCCESS;
if (cg_find_unified_root(root, sizeof(root))) if (cg_find_unified_root(root, sizeof(root), NULL))
ksft_exit_skip("cgroup v2 isn't mounted\n"); ksft_exit_skip("cgroup v2 isn't mounted\n");
for (i = 0; i < ARRAY_SIZE(tests); i++) { for (i = 0; i < ARRAY_SIZE(tests); i++) {
switch (tests[i].fn(root)) { switch (tests[i].fn(root)) {
......
// SPDX-License-Identifier: GPL-2.0 // SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
#include <linux/limits.h> #include <linux/limits.h>
#include <fcntl.h> #include <fcntl.h>
#include <stdio.h> #include <stdio.h>
...@@ -192,7 +190,7 @@ static int test_kmem_memcg_deletion(const char *root) ...@@ -192,7 +190,7 @@ static int test_kmem_memcg_deletion(const char *root)
goto cleanup; goto cleanup;
sum = anon + file + kernel + sock; sum = anon + file + kernel + sock;
if (abs(sum - current) < MAX_VMSTAT_ERROR) { if (labs(sum - current) < MAX_VMSTAT_ERROR) {
ret = KSFT_PASS; ret = KSFT_PASS;
} else { } else {
printf("memory.current = %ld\n", current); printf("memory.current = %ld\n", current);
...@@ -380,7 +378,7 @@ static int test_percpu_basic(const char *root) ...@@ -380,7 +378,7 @@ static int test_percpu_basic(const char *root)
current = cg_read_long(parent, "memory.current"); current = cg_read_long(parent, "memory.current");
percpu = cg_read_key_long(parent, "memory.stat", "percpu "); percpu = cg_read_key_long(parent, "memory.stat", "percpu ");
if (current > 0 && percpu > 0 && abs(current - percpu) < if (current > 0 && percpu > 0 && labs(current - percpu) <
MAX_VMSTAT_ERROR) MAX_VMSTAT_ERROR)
ret = KSFT_PASS; ret = KSFT_PASS;
else else
...@@ -420,7 +418,7 @@ int main(int argc, char **argv) ...@@ -420,7 +418,7 @@ int main(int argc, char **argv)
char root[PATH_MAX]; char root[PATH_MAX];
int i, ret = EXIT_SUCCESS; int i, ret = EXIT_SUCCESS;
if (cg_find_unified_root(root, sizeof(root))) if (cg_find_unified_root(root, sizeof(root), NULL))
ksft_exit_skip("cgroup v2 isn't mounted\n"); ksft_exit_skip("cgroup v2 isn't mounted\n");
/* /*
......
/* SPDX-License-Identifier: GPL-2.0 */ /* SPDX-License-Identifier: GPL-2.0 */
#define _GNU_SOURCE
#include <linux/limits.h> #include <linux/limits.h>
#include <linux/oom.h> #include <linux/oom.h>
#include <fcntl.h> #include <fcntl.h>
...@@ -716,7 +714,9 @@ static bool reclaim_until(const char *memcg, long goal) ...@@ -716,7 +714,9 @@ static bool reclaim_until(const char *memcg, long goal)
*/ */
static int test_memcg_reclaim(const char *root) static int test_memcg_reclaim(const char *root)
{ {
int ret = KSFT_FAIL, fd, retries; int ret = KSFT_FAIL;
int fd = -1;
int retries;
char *memcg; char *memcg;
long current, expected_usage; long current, expected_usage;
...@@ -1314,7 +1314,7 @@ int main(int argc, char **argv) ...@@ -1314,7 +1314,7 @@ int main(int argc, char **argv)
char root[PATH_MAX]; char root[PATH_MAX];
int i, proc_status, ret = EXIT_SUCCESS; int i, proc_status, ret = EXIT_SUCCESS;
if (cg_find_unified_root(root, sizeof(root))) if (cg_find_unified_root(root, sizeof(root), NULL))
ksft_exit_skip("cgroup v2 isn't mounted\n"); ksft_exit_skip("cgroup v2 isn't mounted\n");
/* /*
......
// SPDX-License-Identifier: GPL-2.0 // SPDX-License-Identifier: GPL-2.0
#define _GNU_SOURCE
#include <linux/limits.h> #include <linux/limits.h>
#include <unistd.h> #include <unistd.h>
#include <stdio.h> #include <stdio.h>
...@@ -257,7 +255,7 @@ static int test_no_invasive_cgroup_shrink(const char *root) ...@@ -257,7 +255,7 @@ static int test_no_invasive_cgroup_shrink(const char *root)
{ {
int ret = KSFT_FAIL; int ret = KSFT_FAIL;
size_t control_allocation_size = MB(10); size_t control_allocation_size = MB(10);
char *control_allocation, *wb_group = NULL, *control_group = NULL; char *control_allocation = NULL, *wb_group = NULL, *control_group = NULL;
wb_group = setup_test_group_1M(root, "per_memcg_wb_test1"); wb_group = setup_test_group_1M(root, "per_memcg_wb_test1");
if (!wb_group) if (!wb_group)
...@@ -342,7 +340,7 @@ static int test_no_kmem_bypass(const char *root) ...@@ -342,7 +340,7 @@ static int test_no_kmem_bypass(const char *root)
struct sysinfo sys_info; struct sysinfo sys_info;
int ret = KSFT_FAIL; int ret = KSFT_FAIL;
int child_status; int child_status;
char *test_group; char *test_group = NULL;
pid_t child_pid; pid_t child_pid;
/* Read sys info and compute test values accordingly */ /* Read sys info and compute test values accordingly */
...@@ -440,7 +438,7 @@ int main(int argc, char **argv) ...@@ -440,7 +438,7 @@ int main(int argc, char **argv)
char root[PATH_MAX]; char root[PATH_MAX];
int i, ret = EXIT_SUCCESS; int i, ret = EXIT_SUCCESS;
if (cg_find_unified_root(root, sizeof(root))) if (cg_find_unified_root(root, sizeof(root), NULL))
ksft_exit_skip("cgroup v2 isn't mounted\n"); ksft_exit_skip("cgroup v2 isn't mounted\n");
if (!zswap_configured()) if (!zswap_configured())
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment