Commit bbe179f8 authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:

 - threadgroup_lock got reorganized so that its users can pick the
   actual locking mechanism to use.  Its only user - cgroups - is
   updated to use a percpu_rwsem instead of per-process rwsem.

   This makes things a bit lighter on hot paths and allows cgroups to
   perform and fail multi-task (a process) migrations atomically.
   Multi-task migrations are used in several places including the
   unified hierarchy.

 - Delegation rule and documentation added to unified hierarchy.  This
   will likely be the last interface update from the cgroup core side
   for unified hierarchy before lifting the devel mask.

 - Some groundwork for the pids controller which is scheduled to be
   merged in the coming devel cycle.

* 'for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: add delegation section to unified hierarchy documentation
  cgroup: require write perm on common ancestor when moving processes on the default hierarchy
  cgroup: separate out cgroup_procs_write_permission() from __cgroup_procs_write()
  kernfs: make kernfs_get_inode() public
  MAINTAINERS: add a cgroup core co-maintainer
  cgroup: fix uninitialised iterator in for_each_subsys_which
  cgroup: replace explicit ss_mask checking with for_each_subsys_which
  cgroup: use bitmask to filter for_each_subsys
  cgroup: add seq_file forward declaration for struct cftype
  cgroup: simplify threadgroup locking
  sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem
  sched, cgroup: reorganize threadgroup locking
  cgroup: switch to unsigned long for bitmasks
  cgroup: reorganize include/linux/cgroup.h
  cgroup: separate out include/linux/cgroup-defs.h
  cgroup: fix some comment typos
parents 4b703b1d 8a0792ef
...@@ -17,15 +17,18 @@ CONTENTS ...@@ -17,15 +17,18 @@ CONTENTS
3. Structural Constraints 3. Structural Constraints
3-1. Top-down 3-1. Top-down
3-2. No internal tasks 3-2. No internal tasks
4. Other Changes 4. Delegation
4-1. [Un]populated Notification 4-1. Model of delegation
4-2. Other Core Changes 4-2. Common ancestor rule
4-3. Per-Controller Changes 5. Other Changes
4-3-1. blkio 5-1. [Un]populated Notification
4-3-2. cpuset 5-2. Other Core Changes
4-3-3. memory 5-3. Per-Controller Changes
5. Planned Changes 5-3-1. blkio
5-1. CAP for resource control 5-3-2. cpuset
5-3-3. memory
6. Planned Changes
6-1. CAP for resource control
1. Background 1. Background
...@@ -245,9 +248,72 @@ cgroup must create children and transfer all its tasks to the children ...@@ -245,9 +248,72 @@ cgroup must create children and transfer all its tasks to the children
before enabling controllers in its "cgroup.subtree_control" file. before enabling controllers in its "cgroup.subtree_control" file.
4. Other Changes 4. Delegation
4-1. [Un]populated Notification 4-1. Model of delegation
A cgroup can be delegated to a less privileged user by granting write
access of the directory and its "cgroup.procs" file to the user. Note
that the resource control knobs in a given directory concern the
resources of the parent and thus must not be delegated along with the
directory.
Once delegated, the user can build sub-hierarchy under the directory,
organize processes as it sees fit and further distribute the resources
it got from the parent. The limits and other settings of all resource
controllers are hierarchical and regardless of what happens in the
delegated sub-hierarchy, nothing can escape the resource restrictions
imposed by the parent.
Currently, cgroup doesn't impose any restrictions on the number of
cgroups in or nesting depth of a delegated sub-hierarchy; however,
this may in the future be limited explicitly.
4-2. Common ancestor rule
On the unified hierarchy, to write to a "cgroup.procs" file, in
addition to the usual write permission to the file and uid match, the
writer must also have write access to the "cgroup.procs" file of the
common ancestor of the source and destination cgroups. This prevents
delegatees from smuggling processes across disjoint sub-hierarchies.
Let's say cgroups C0 and C1 have been delegated to user U0 who created
C00, C01 under C0 and C10 under C1 as follows.
~~~~~~~~~~~~~ - C0 - C00
~ cgroup ~ \ C01
~ hierarchy ~
~~~~~~~~~~~~~ - C1 - C10
C0 and C1 are separate entities in terms of resource distribution
regardless of their relative positions in the hierarchy. The
resources the processes under C0 are entitled to are controlled by
C0's ancestors and may be completely different from C1. It's clear
that the intention of delegating C0 to U0 is allowing U0 to organize
the processes under C0 and further control the distribution of C0's
resources.
On traditional hierarchies, if a task has write access to "tasks" or
"cgroup.procs" file of a cgroup and its uid agrees with the target, it
can move the target to the cgroup. In the above example, U0 will not
only be able to move processes in each sub-hierarchy but also across
the two sub-hierarchies, effectively allowing it to violate the
organizational and resource restrictions implied by the hierarchical
structure above C0 and C1.
On the unified hierarchy, let's say U0 wants to write the pid of a
process which has a matching uid and is currently in C10 into
"C00/cgroup.procs". U0 obviously has write access to the file and
migration permission on the process; however, the common ancestor of
the source cgroup C10 and the destination cgroup C00 is above the
points of delegation and U0 would not have write access to its
"cgroup.procs" and thus be denied with -EACCES.
5. Other Changes
5-1. [Un]populated Notification
cgroup users often need a way to determine when a cgroup's cgroup users often need a way to determine when a cgroup's
subhierarchy becomes empty so that it can be cleaned up. cgroup subhierarchy becomes empty so that it can be cleaned up. cgroup
...@@ -289,7 +355,7 @@ supported and the interface files "release_agent" and ...@@ -289,7 +355,7 @@ supported and the interface files "release_agent" and
"notify_on_release" do not exist. "notify_on_release" do not exist.
4-2. Other Core Changes 5-2. Other Core Changes
- None of the mount options is allowed. - None of the mount options is allowed.
...@@ -306,14 +372,14 @@ supported and the interface files "release_agent" and ...@@ -306,14 +372,14 @@ supported and the interface files "release_agent" and
- The "cgroup.clone_children" file is removed. - The "cgroup.clone_children" file is removed.
4-3. Per-Controller Changes 5-3. Per-Controller Changes
4-3-1. blkio 5-3-1. blkio
- blk-throttle becomes properly hierarchical. - blk-throttle becomes properly hierarchical.
4-3-2. cpuset 5-3-2. cpuset
- Tasks are kept in empty cpusets after hotplug and take on the masks - Tasks are kept in empty cpusets after hotplug and take on the masks
of the nearest non-empty ancestor, instead of being moved to it. of the nearest non-empty ancestor, instead of being moved to it.
...@@ -322,7 +388,7 @@ supported and the interface files "release_agent" and ...@@ -322,7 +388,7 @@ supported and the interface files "release_agent" and
masks of the nearest non-empty ancestor. masks of the nearest non-empty ancestor.
4-3-3. memory 5-3-3. memory
- use_hierarchy is on by default and the cgroup file for the flag is - use_hierarchy is on by default and the cgroup file for the flag is
not created. not created.
...@@ -407,9 +473,9 @@ supported and the interface files "release_agent" and ...@@ -407,9 +473,9 @@ supported and the interface files "release_agent" and
memory.low, memory.high, and memory.max will use the string "max" to memory.low, memory.high, and memory.max will use the string "max" to
indicate and set the highest possible value. indicate and set the highest possible value.
5. Planned Changes 6. Planned Changes
5-1. CAP for resource control 6-1. CAP for resource control
Unified hierarchy will require one of the capabilities(7), which is Unified hierarchy will require one of the capabilities(7), which is
yet to be decided, for all resource control related knobs. Process yet to be decided, for all resource control related knobs. Process
......
...@@ -2816,6 +2816,7 @@ F: drivers/connector/ ...@@ -2816,6 +2816,7 @@ F: drivers/connector/
CONTROL GROUP (CGROUP) CONTROL GROUP (CGROUP)
M: Tejun Heo <tj@kernel.org> M: Tejun Heo <tj@kernel.org>
M: Li Zefan <lizefan@huawei.com> M: Li Zefan <lizefan@huawei.com>
M: Johannes Weiner <hannes@cmpxchg.org>
L: cgroups@vger.kernel.org L: cgroups@vger.kernel.org
T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git T: git git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git
S: Maintained S: Maintained
......
...@@ -76,7 +76,6 @@ extern struct kmem_cache *kernfs_node_cache; ...@@ -76,7 +76,6 @@ extern struct kmem_cache *kernfs_node_cache;
/* /*
* inode.c * inode.c
*/ */
struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn);
void kernfs_evict_inode(struct inode *inode); void kernfs_evict_inode(struct inode *inode);
int kernfs_iop_permission(struct inode *inode, int mask); int kernfs_iop_permission(struct inode *inode, int mask);
int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr); int kernfs_iop_setattr(struct dentry *dentry, struct iattr *iattr);
......
This diff is collapsed.
This diff is collapsed.
...@@ -25,13 +25,6 @@ ...@@ -25,13 +25,6 @@
extern struct files_struct init_files; extern struct files_struct init_files;
extern struct fs_struct init_fs; extern struct fs_struct init_fs;
#ifdef CONFIG_CGROUPS
#define INIT_GROUP_RWSEM(sig) \
.group_rwsem = __RWSEM_INITIALIZER(sig.group_rwsem),
#else
#define INIT_GROUP_RWSEM(sig)
#endif
#ifdef CONFIG_CPUSETS #ifdef CONFIG_CPUSETS
#define INIT_CPUSET_SEQ(tsk) \ #define INIT_CPUSET_SEQ(tsk) \
.mems_allowed_seq = SEQCNT_ZERO(tsk.mems_allowed_seq), .mems_allowed_seq = SEQCNT_ZERO(tsk.mems_allowed_seq),
...@@ -55,7 +48,6 @@ extern struct fs_struct init_fs; ...@@ -55,7 +48,6 @@ extern struct fs_struct init_fs;
}, \ }, \
.cred_guard_mutex = \ .cred_guard_mutex = \
__MUTEX_INITIALIZER(sig.cred_guard_mutex), \ __MUTEX_INITIALIZER(sig.cred_guard_mutex), \
INIT_GROUP_RWSEM(sig) \
} }
extern struct nsproxy init_nsproxy; extern struct nsproxy init_nsproxy;
......
...@@ -277,6 +277,7 @@ void kernfs_put(struct kernfs_node *kn); ...@@ -277,6 +277,7 @@ void kernfs_put(struct kernfs_node *kn);
struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry); struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry);
struct kernfs_root *kernfs_root_from_sb(struct super_block *sb); struct kernfs_root *kernfs_root_from_sb(struct super_block *sb);
struct inode *kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn);
struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops, struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
unsigned int flags, void *priv); unsigned int flags, void *priv);
...@@ -352,6 +353,10 @@ static inline struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry) ...@@ -352,6 +353,10 @@ static inline struct kernfs_node *kernfs_node_from_dentry(struct dentry *dentry)
static inline struct kernfs_root *kernfs_root_from_sb(struct super_block *sb) static inline struct kernfs_root *kernfs_root_from_sb(struct super_block *sb)
{ return NULL; } { return NULL; }
static inline struct inode *
kernfs_get_inode(struct super_block *sb, struct kernfs_node *kn)
{ return NULL; }
static inline struct kernfs_root * static inline struct kernfs_root *
kernfs_create_root(struct kernfs_syscall_ops *scops, unsigned int flags, kernfs_create_root(struct kernfs_syscall_ops *scops, unsigned int flags,
void *priv) void *priv)
......
...@@ -58,6 +58,7 @@ struct sched_param { ...@@ -58,6 +58,7 @@ struct sched_param {
#include <linux/uidgid.h> #include <linux/uidgid.h>
#include <linux/gfp.h> #include <linux/gfp.h>
#include <linux/magic.h> #include <linux/magic.h>
#include <linux/cgroup-defs.h>
#include <asm/processor.h> #include <asm/processor.h>
...@@ -755,18 +756,6 @@ struct signal_struct { ...@@ -755,18 +756,6 @@ struct signal_struct {
unsigned audit_tty_log_passwd; unsigned audit_tty_log_passwd;
struct tty_audit_buf *tty_audit_buf; struct tty_audit_buf *tty_audit_buf;
#endif #endif
#ifdef CONFIG_CGROUPS
/*
* group_rwsem prevents new tasks from entering the threadgroup and
* member tasks from exiting,a more specifically, setting of
* PF_EXITING. fork and exit paths are protected with this rwsem
* using threadgroup_change_begin/end(). Users which require
* threadgroup to remain stable should use threadgroup_[un]lock()
* which also takes care of exec path. Currently, cgroup is the
* only user.
*/
struct rw_semaphore group_rwsem;
#endif
oom_flags_t oom_flags; oom_flags_t oom_flags;
short oom_score_adj; /* OOM kill score adjustment */ short oom_score_adj; /* OOM kill score adjustment */
...@@ -2725,53 +2714,33 @@ static inline void unlock_task_sighand(struct task_struct *tsk, ...@@ -2725,53 +2714,33 @@ static inline void unlock_task_sighand(struct task_struct *tsk,
spin_unlock_irqrestore(&tsk->sighand->siglock, *flags); spin_unlock_irqrestore(&tsk->sighand->siglock, *flags);
} }
#ifdef CONFIG_CGROUPS
static inline void threadgroup_change_begin(struct task_struct *tsk)
{
down_read(&tsk->signal->group_rwsem);
}
static inline void threadgroup_change_end(struct task_struct *tsk)
{
up_read(&tsk->signal->group_rwsem);
}
/** /**
* threadgroup_lock - lock threadgroup * threadgroup_change_begin - mark the beginning of changes to a threadgroup
* @tsk: member task of the threadgroup to lock * @tsk: task causing the changes
*
* Lock the threadgroup @tsk belongs to. No new task is allowed to enter
* and member tasks aren't allowed to exit (as indicated by PF_EXITING) or
* change ->group_leader/pid. This is useful for cases where the threadgroup
* needs to stay stable across blockable operations.
*
* fork and exit paths explicitly call threadgroup_change_{begin|end}() for
* synchronization. While held, no new task will be added to threadgroup
* and no existing live task will have its PF_EXITING set.
* *
* de_thread() does threadgroup_change_{begin|end}() when a non-leader * All operations which modify a threadgroup - a new thread joining the
* sub-thread becomes a new leader. * group, death of a member thread (the assertion of PF_EXITING) and
* exec(2) dethreading the process and replacing the leader - are wrapped
* by threadgroup_change_{begin|end}(). This is to provide a place which
* subsystems needing threadgroup stability can hook into for
* synchronization.
*/ */
static inline void threadgroup_lock(struct task_struct *tsk) static inline void threadgroup_change_begin(struct task_struct *tsk)
{ {
down_write(&tsk->signal->group_rwsem); might_sleep();
cgroup_threadgroup_change_begin(tsk);
} }
/** /**
* threadgroup_unlock - unlock threadgroup * threadgroup_change_end - mark the end of changes to a threadgroup
* @tsk: member task of the threadgroup to unlock * @tsk: task causing the changes
* *
* Reverse threadgroup_lock(). * See threadgroup_change_begin().
*/ */
static inline void threadgroup_unlock(struct task_struct *tsk) static inline void threadgroup_change_end(struct task_struct *tsk)
{ {
up_write(&tsk->signal->group_rwsem); cgroup_threadgroup_change_end(tsk);
} }
#else
static inline void threadgroup_change_begin(struct task_struct *tsk) {}
static inline void threadgroup_change_end(struct task_struct *tsk) {}
static inline void threadgroup_lock(struct task_struct *tsk) {}
static inline void threadgroup_unlock(struct task_struct *tsk) {}
#endif
#ifndef __HAVE_THREAD_FUNCTIONS #ifndef __HAVE_THREAD_FUNCTIONS
......
...@@ -924,6 +924,7 @@ config NUMA_BALANCING_DEFAULT_ENABLED ...@@ -924,6 +924,7 @@ config NUMA_BALANCING_DEFAULT_ENABLED
menuconfig CGROUPS menuconfig CGROUPS
bool "Control Group support" bool "Control Group support"
select KERNFS select KERNFS
select PERCPU_RWSEM
help help
This option adds support for grouping sets of processes together, for This option adds support for grouping sets of processes together, for
use with process control subsystems such as Cpusets, CFS, memory use with process control subsystems such as Cpusets, CFS, memory
......
This diff is collapsed.
...@@ -1141,10 +1141,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) ...@@ -1141,10 +1141,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
tty_audit_fork(sig); tty_audit_fork(sig);
sched_autogroup_fork(sig); sched_autogroup_fork(sig);
#ifdef CONFIG_CGROUPS
init_rwsem(&sig->group_rwsem);
#endif
sig->oom_score_adj = current->signal->oom_score_adj; sig->oom_score_adj = current->signal->oom_score_adj;
sig->oom_score_adj_min = current->signal->oom_score_adj_min; sig->oom_score_adj_min = current->signal->oom_score_adj_min;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment