Commit f7878dc3 authored by Linus Torvalds's avatar Linus Torvalds

Merge branch 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

Pull cgroup updates from Tejun Heo:
 "Several noteworthy changes.

   - Parav's rdma controller is finally merged. It is very straight
     forward and can limit the abosolute numbers of common rdma
     constructs used by different cgroups.

   - kernel/cgroup.c got too chubby and disorganized. Created
     kernel/cgroup/ subdirectory and moved all cgroup related files
     under kernel/ there and reorganized the core code. This hurts for
     backporting patches but was long overdue.

   - cgroup v2 process listing reimplemented so that it no longer
     depends on allocating a buffer large enough to cache the entire
     result to sort and uniq the output. v2 has always mangled the sort
     order to ensure that users don't depend on the sorted output, so
     this shouldn't surprise anybody. This makes the pid listing
     functions use the same iterators that are used internally, which
     have to have the same iterating capabilities anyway.

   - perf cgroup filtering now works automatically on cgroup v2. This
     patch was posted a long time ago but somehow fell through the
     cracks.

   - misc fixes asnd documentation updates"

* 'for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (27 commits)
  kernfs: fix locking around kernfs_ops->release() callback
  cgroup: drop the matching uid requirement on migration for cgroup v2
  cgroup, perf_event: make perf_event controller work on cgroup2 hierarchy
  cgroup: misc cleanups
  cgroup: call subsys->*attach() only for subsystems which are actually affected by migration
  cgroup: track migration context in cgroup_mgctx
  cgroup: cosmetic update to cgroup_taskset_add()
  rdmacg: Fixed uninitialized current resource usage
  cgroup: Add missing cgroup-v2 PID controller documentation.
  rdmacg: Added documentation for rdmacg
  IB/core: added support to use rdma cgroup controller
  rdmacg: Added rdma cgroup controller
  cgroup: fix a comment typo
  cgroup: fix RCU related sparse warnings
  cgroup: move namespace code to kernel/cgroup/namespace.c
  cgroup: rename functions for consistency
  cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c
  cgroup: separate out cgroup1_kf_syscall_ops
  cgroup: refactor mount path and clearly distinguish v1 and v2 paths
  cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c
  ...
parents fb15a782 f83f3c51
RDMA Controller
----------------
Contents
--------
1. Overview
1-1. What is RDMA controller?
1-2. Why RDMA controller needed?
1-3. How is RDMA controller implemented?
2. Usage Examples
1. Overview
1-1. What is RDMA controller?
-----------------------------
RDMA controller allows user to limit RDMA/IB specific resources that a given
set of processes can use. These processes are grouped using RDMA controller.
RDMA controller defines two resources which can be limited for processes of a
cgroup.
1-2. Why RDMA controller needed?
--------------------------------
Currently user space applications can easily take away all the rdma verb
specific resources such as AH, CQ, QP, MR etc. Due to which other applications
in other cgroup or kernel space ULPs may not even get chance to allocate any
rdma resources. This can leads to service unavailability.
Therefore RDMA controller is needed through which resource consumption
of processes can be limited. Through this controller different rdma
resources can be accounted.
1-3. How is RDMA controller implemented?
----------------------------------------
RDMA cgroup allows limit configuration of resources. Rdma cgroup maintains
resource accounting per cgroup, per device using resource pool structure.
Each such resource pool is limited up to 64 resources in given resource pool
by rdma cgroup, which can be extended later if required.
This resource pool object is linked to the cgroup css. Typically there
are 0 to 4 resource pool instances per cgroup, per device in most use cases.
But nothing limits to have it more. At present hundreds of RDMA devices per
single cgroup may not be handled optimally, however there is no
known use case or requirement for such configuration either.
Since RDMA resources can be allocated from any process and can be freed by any
of the child processes which shares the address space, rdma resources are
always owned by the creator cgroup css. This allows process migration from one
to other cgroup without major complexity of transferring resource ownership;
because such ownership is not really present due to shared nature of
rdma resources. Linking resources around css also ensures that cgroups can be
deleted after processes migrated. This allow progress migration as well with
active resources, even though that is not a primary use case.
Whenever RDMA resource charging occurs, owner rdma cgroup is returned to
the caller. Same rdma cgroup should be passed while uncharging the resource.
This also allows process migrated with active RDMA resource to charge
to new owner cgroup for new resource. It also allows to uncharge resource of
a process from previously charged cgroup which is migrated to new cgroup,
even though that is not a primary use case.
Resource pool object is created in following situations.
(a) User sets the limit and no previous resource pool exist for the device
of interest for the cgroup.
(b) No resource limits were configured, but IB/RDMA stack tries to
charge the resource. So that it correctly uncharge them when applications are
running without limits and later on when limits are enforced during uncharging,
otherwise usage count will drop to negative.
Resource pool is destroyed if all the resource limits are set to max and
it is the last resource getting deallocated.
User should set all the limit to max value if it intents to remove/unconfigure
the resource pool for a particular device.
IB stack honors limits enforced by the rdma controller. When application
query about maximum resource limits of IB device, it returns minimum of
what is configured by user for a given cgroup and what is supported by
IB device.
Following resources can be accounted by rdma controller.
hca_handle Maximum number of HCA Handles
hca_object Maximum number of HCA Objects
2. Usage Examples
-----------------
(a) Configure resource limit:
echo mlx4_0 hca_handle=2 hca_object=2000 > /sys/fs/cgroup/rdma/1/rdma.max
echo ocrdma1 hca_handle=3 > /sys/fs/cgroup/rdma/2/rdma.max
(b) Query resource limit:
cat /sys/fs/cgroup/rdma/2/rdma.max
#Output:
mlx4_0 hca_handle=2 hca_object=2000
ocrdma1 hca_handle=3 hca_object=max
(c) Query current usage:
cat /sys/fs/cgroup/rdma/2/rdma.current
#Output:
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23
(d) Delete resource limit:
echo echo mlx4_0 hca_handle=max hca_object=max > /sys/fs/cgroup/rdma/1/rdma.max
......@@ -47,6 +47,12 @@ CONTENTS
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
5-4. PID
5-4-1. PID Interface Files
5-5. RDMA
5-5-1. RDMA Interface Files
5-6. Misc
5-6-1. perf_event
6. Namespace
6-1. Basics
6-2. The Root and Views
......@@ -328,14 +334,12 @@ a process with a non-root euid to migrate a target process into a
cgroup by writing its PID to the "cgroup.procs" file, the following
conditions must be met.
- The writer's euid must match either uid or suid of the target process.
- The writer must have write access to the "cgroup.procs" file.
- The writer must have write access to the "cgroup.procs" file of the
common ancestor of the source and destination cgroups.
The above three constraints ensure that while a delegatee may migrate
The above two constraints ensure that while a delegatee may migrate
processes around freely in the delegated sub-hierarchy it can't pull
in from or push out to outside the sub-hierarchy.
......@@ -350,10 +354,10 @@ all processes under C0 and C1 belong to U0.
Let's also say U0 wants to write the PID of a process which is
currently in C10 into "C00/cgroup.procs". U0 has write access to the
file and uid match on the process; however, the common ancestor of the
source cgroup C10 and the destination cgroup C00 is above the points
of delegation and U0 would not have write access to its "cgroup.procs"
files and thus the write will be denied with -EACCES.
file; however, the common ancestor of the source cgroup C10 and the
destination cgroup C00 is above the points of delegation and U0 would
not have write access to its "cgroup.procs" files and thus the write
will be denied with -EACCES.
2-6. Guidelines
......@@ -1119,6 +1123,91 @@ writeback as follows.
vm.dirty[_background]_ratio.
5-4. PID
The process number controller is used to allow a cgroup to stop any
new tasks from being fork()'d or clone()'d after a specified limit is
reached.
The number of tasks in a cgroup can be exhausted in ways which other
controllers cannot prevent, thus warranting its own controller. For
example, a fork bomb is likely to exhaust the number of tasks before
hitting memory restrictions.
Note that PIDs used in this controller refer to TIDs, process IDs as
used by the kernel.
5-4-1. PID Interface Files
pids.max
A read-write single value file which exists on non-root cgroups. The
default is "max".
Hard limit of number of processes.
pids.current
A read-only single value file which exists on all cgroups.
The number of processes currently in the cgroup and its descendants.
Organisational operations are not blocked by cgroup policies, so it is
possible to have pids.current > pids.max. This can be done by either
setting the limit to be smaller than pids.current, or attaching enough
processes to the cgroup such that pids.current is larger than
pids.max. However, it is not possible to violate a cgroup PID policy
through fork() or clone(). These will return -EAGAIN if the creation
of a new process would cause a cgroup policy to be violated.
5-5. RDMA
The "rdma" controller regulates the distribution and accounting of
of RDMA resources.
5-5-1. RDMA Interface Files
rdma.max
A readwrite nested-keyed file that exists for all the cgroups
except root that describes current configured resource limit
for a RDMA/IB device.
Lines are keyed by device name and are not ordered.
Each line contains space separated resource name and its configured
limit that can be distributed.
The following nested keys are defined.
hca_handle Maximum number of HCA Handles
hca_object Maximum number of HCA Objects
An example for mlx4 and ocrdma device follows.
mlx4_0 hca_handle=2 hca_object=2000
ocrdma1 hca_handle=3 hca_object=max
rdma.current
A read-only file that describes current resource usage.
It exists for all the cgroup except root.
An example for mlx4 and ocrdma device follows.
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23
5-6. Misc
5-6-1. perf_event
perf_event controller, if not mounted on a legacy hierarchy, is
automatically enabled on the v2 hierarchy so that perf events can
always be filtered by cgroup v2 path. The controller can still be
moved to a legacy hierarchy after v2 hierarchy is populated.
6. Namespace
6-1. Basics
......
......@@ -13,6 +13,7 @@ ib_core-y := packer.o ud_header.o verbs.o cq.o rw.o sysfs.o \
multicast.o mad.o smi.o agent.o mad_rmpp.o
ib_core-$(CONFIG_INFINIBAND_USER_MEM) += umem.o
ib_core-$(CONFIG_INFINIBAND_ON_DEMAND_PAGING) += umem_odp.o umem_rbtree.o
ib_core-$(CONFIG_CGROUP_RDMA) += cgroup.o
ib_cm-y := cm.o
......
/*
* Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#include "core_priv.h"
/**
* ib_device_register_rdmacg - register with rdma cgroup.
* @device: device to register to participate in resource
* accounting by rdma cgroup.
*
* Register with the rdma cgroup. Should be called before
* exposing rdma device to user space applications to avoid
* resource accounting leak.
* Returns 0 on success or otherwise failure code.
*/
int ib_device_register_rdmacg(struct ib_device *device)
{
device->cg_device.name = device->name;
return rdmacg_register_device(&device->cg_device);
}
/**
* ib_device_unregister_rdmacg - unregister with rdma cgroup.
* @device: device to unregister.
*
* Unregister with the rdma cgroup. Should be called after
* all the resources are deallocated, and after a stage when any
* other resource allocation by user application cannot be done
* for this device to avoid any leak in accounting.
*/
void ib_device_unregister_rdmacg(struct ib_device *device)
{
rdmacg_unregister_device(&device->cg_device);
}
int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
struct ib_device *device,
enum rdmacg_resource_type resource_index)
{
return rdmacg_try_charge(&cg_obj->cg, &device->cg_device,
resource_index);
}
EXPORT_SYMBOL(ib_rdmacg_try_charge);
void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
struct ib_device *device,
enum rdmacg_resource_type resource_index)
{
rdmacg_uncharge(cg_obj->cg, &device->cg_device,
resource_index);
}
EXPORT_SYMBOL(ib_rdmacg_uncharge);
......@@ -35,6 +35,7 @@
#include <linux/list.h>
#include <linux/spinlock.h>
#include <linux/cgroup_rdma.h>
#include <rdma/ib_verbs.h>
......@@ -124,6 +125,35 @@ int ib_cache_setup_one(struct ib_device *device);
void ib_cache_cleanup_one(struct ib_device *device);
void ib_cache_release_one(struct ib_device *device);
#ifdef CONFIG_CGROUP_RDMA
int ib_device_register_rdmacg(struct ib_device *device);
void ib_device_unregister_rdmacg(struct ib_device *device);
int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
struct ib_device *device,
enum rdmacg_resource_type resource_index);
void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
struct ib_device *device,
enum rdmacg_resource_type resource_index);
#else
static inline int ib_device_register_rdmacg(struct ib_device *device)
{ return 0; }
static inline void ib_device_unregister_rdmacg(struct ib_device *device)
{ }
static inline int ib_rdmacg_try_charge(struct ib_rdmacg_object *cg_obj,
struct ib_device *device,
enum rdmacg_resource_type resource_index)
{ return 0; }
static inline void ib_rdmacg_uncharge(struct ib_rdmacg_object *cg_obj,
struct ib_device *device,
enum rdmacg_resource_type resource_index)
{ }
#endif
static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
struct net_device *upper)
{
......
......@@ -369,10 +369,18 @@ int ib_register_device(struct ib_device *device,
goto out;
}
ret = ib_device_register_rdmacg(device);
if (ret) {
pr_warn("Couldn't register device with rdma cgroup\n");
ib_cache_cleanup_one(device);
goto out;
}
memset(&device->attrs, 0, sizeof(device->attrs));
ret = device->query_device(device, &device->attrs, &uhw);
if (ret) {
pr_warn("Couldn't query the device attributes\n");
ib_device_unregister_rdmacg(device);
ib_cache_cleanup_one(device);
goto out;
}
......@@ -381,6 +389,7 @@ int ib_register_device(struct ib_device *device,
if (ret) {
pr_warn("Couldn't register device %s with driver model\n",
device->name);
ib_device_unregister_rdmacg(device);
ib_cache_cleanup_one(device);
goto out;
}
......@@ -430,6 +439,7 @@ void ib_unregister_device(struct ib_device *device)
mutex_unlock(&device_mutex);
ib_device_unregister_rdmacg(device);
ib_device_unregister_sysfs(device);
ib_cache_cleanup_one(device);
......
......@@ -316,6 +316,7 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
struct ib_udata udata;
struct ib_ucontext *ucontext;
struct file *filp;
struct ib_rdmacg_object cg_obj;
int ret;
if (out_len < sizeof resp)
......@@ -335,13 +336,18 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
(unsigned long) cmd.response + sizeof resp,
in_len - sizeof cmd, out_len - sizeof resp);
ret = ib_rdmacg_try_charge(&cg_obj, ib_dev, RDMACG_RESOURCE_HCA_HANDLE);
if (ret)
goto err;
ucontext = ib_dev->alloc_ucontext(ib_dev, &udata);
if (IS_ERR(ucontext)) {
ret = PTR_ERR(ucontext);
goto err;
goto err_alloc;
}
ucontext->device = ib_dev;
ucontext->cg_obj = cg_obj;
INIT_LIST_HEAD(&ucontext->pd_list);
INIT_LIST_HEAD(&ucontext->mr_list);
INIT_LIST_HEAD(&ucontext->mw_list);
......@@ -407,6 +413,9 @@ ssize_t ib_uverbs_get_context(struct ib_uverbs_file *file,
put_pid(ucontext->tgid);
ib_dev->dealloc_ucontext(ucontext);
err_alloc:
ib_rdmacg_uncharge(&cg_obj, ib_dev, RDMACG_RESOURCE_HCA_HANDLE);
err:
mutex_unlock(&file->mutex);
return ret;
......@@ -561,6 +570,13 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file,
return -ENOMEM;
init_uobj(uobj, 0, file->ucontext, &pd_lock_class);
ret = ib_rdmacg_try_charge(&uobj->cg_obj, ib_dev,
RDMACG_RESOURCE_HCA_OBJECT);
if (ret) {
kfree(uobj);
return ret;
}
down_write(&uobj->mutex);
pd = ib_dev->alloc_pd(ib_dev, file->ucontext, &udata);
......@@ -605,6 +621,7 @@ ssize_t ib_uverbs_alloc_pd(struct ib_uverbs_file *file,
ib_dealloc_pd(pd);
err:
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
put_uobj_write(uobj);
return ret;
}
......@@ -637,6 +654,8 @@ ssize_t ib_uverbs_dealloc_pd(struct ib_uverbs_file *file,
if (ret)
goto err_put;
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
uobj->live = 0;
put_uobj_write(uobj);
......@@ -1006,6 +1025,10 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
goto err_put;
}
}
ret = ib_rdmacg_try_charge(&uobj->cg_obj, ib_dev,
RDMACG_RESOURCE_HCA_OBJECT);
if (ret)
goto err_charge;
mr = pd->device->reg_user_mr(pd, cmd.start, cmd.length, cmd.hca_va,
cmd.access_flags, &udata);
......@@ -1054,6 +1077,9 @@ ssize_t ib_uverbs_reg_mr(struct ib_uverbs_file *file,
ib_dereg_mr(mr);
err_put:
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
err_charge:
put_pd_read(pd);
err_free:
......@@ -1178,6 +1204,8 @@ ssize_t ib_uverbs_dereg_mr(struct ib_uverbs_file *file,
if (ret)
return ret;
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
mutex_lock(&file->mutex);
......@@ -1226,6 +1254,11 @@ ssize_t ib_uverbs_alloc_mw(struct ib_uverbs_file *file,
in_len - sizeof(cmd) - sizeof(struct ib_uverbs_cmd_hdr),
out_len - sizeof(resp));
ret = ib_rdmacg_try_charge(&uobj->cg_obj, ib_dev,
RDMACG_RESOURCE_HCA_OBJECT);
if (ret)
goto err_charge;
mw = pd->device->alloc_mw(pd, cmd.mw_type, &udata);
if (IS_ERR(mw)) {
ret = PTR_ERR(mw);
......@@ -1271,6 +1304,9 @@ ssize_t ib_uverbs_alloc_mw(struct ib_uverbs_file *file,
uverbs_dealloc_mw(mw);
err_put:
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
err_charge:
put_pd_read(pd);
err_free:
......@@ -1306,6 +1342,8 @@ ssize_t ib_uverbs_dealloc_mw(struct ib_uverbs_file *file,
if (ret)
return ret;
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
idr_remove_uobj(&ib_uverbs_mw_idr, uobj);
mutex_lock(&file->mutex);
......@@ -1405,6 +1443,11 @@ static struct ib_ucq_object *create_cq(struct ib_uverbs_file *file,
if (cmd_sz > offsetof(typeof(*cmd), flags) + sizeof(cmd->flags))
attr.flags = cmd->flags;
ret = ib_rdmacg_try_charge(&obj->uobject.cg_obj, ib_dev,
RDMACG_RESOURCE_HCA_OBJECT);
if (ret)
goto err_charge;
cq = ib_dev->create_cq(ib_dev, &attr,
file->ucontext, uhw);
if (IS_ERR(cq)) {
......@@ -1452,6 +1495,10 @@ static struct ib_ucq_object *create_cq(struct ib_uverbs_file *file,
ib_destroy_cq(cq);
err_file:
ib_rdmacg_uncharge(&obj->uobject.cg_obj, ib_dev,
RDMACG_RESOURCE_HCA_OBJECT);
err_charge:
if (ev_file)
ib_uverbs_release_ucq(file, ev_file, obj);
......@@ -1732,6 +1779,8 @@ ssize_t ib_uverbs_destroy_cq(struct ib_uverbs_file *file,
if (ret)
return ret;
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
mutex_lock(&file->mutex);
......@@ -1905,6 +1954,11 @@ static int create_qp(struct ib_uverbs_file *file,
goto err_put;
}
ret = ib_rdmacg_try_charge(&obj->uevent.uobject.cg_obj, device,
RDMACG_RESOURCE_HCA_OBJECT);
if (ret)
goto err_put;
if (cmd->qp_type == IB_QPT_XRC_TGT)
qp = ib_create_qp(pd, &attr);
else
......@@ -1912,7 +1966,7 @@ static int create_qp(struct ib_uverbs_file *file,
if (IS_ERR(qp)) {
ret = PTR_ERR(qp);
goto err_put;
goto err_create;
}
if (cmd->qp_type != IB_QPT_XRC_TGT) {
......@@ -1993,6 +2047,10 @@ static int create_qp(struct ib_uverbs_file *file,
err_destroy:
ib_destroy_qp(qp);
err_create:
ib_rdmacg_uncharge(&obj->uevent.uobject.cg_obj, device,
RDMACG_RESOURCE_HCA_OBJECT);
err_put:
if (xrcd)
put_xrcd_read(xrcd_uobj);
......@@ -2519,6 +2577,8 @@ ssize_t ib_uverbs_destroy_qp(struct ib_uverbs_file *file,
if (ret)
return ret;
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
if (obj->uxrcd)
atomic_dec(&obj->uxrcd->refcnt);
......@@ -2970,11 +3030,16 @@ ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
memset(&attr.dmac, 0, sizeof(attr.dmac));
memcpy(attr.grh.dgid.raw, cmd.attr.grh.dgid, 16);
ret = ib_rdmacg_try_charge(&uobj->cg_obj, ib_dev,
RDMACG_RESOURCE_HCA_OBJECT);
if (ret)
goto err_charge;
ah = pd->device->create_ah(pd, &attr, &udata);
if (IS_ERR(ah)) {
ret = PTR_ERR(ah);
goto err_put;
goto err_create;
}
ah->device = pd->device;
......@@ -3013,7 +3078,10 @@ ssize_t ib_uverbs_create_ah(struct ib_uverbs_file *file,
err_destroy:
ib_destroy_ah(ah);
err_put:
err_create:
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
err_charge:
put_pd_read(pd);
err:
......@@ -3047,6 +3115,8 @@ ssize_t ib_uverbs_destroy_ah(struct ib_uverbs_file *file,
if (ret)
return ret;
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
mutex_lock(&file->mutex);
......@@ -3861,10 +3931,16 @@ int ib_uverbs_ex_create_flow(struct ib_uverbs_file *file,
err = -EINVAL;
goto err_free;
}
err = ib_rdmacg_try_charge(&uobj->cg_obj, ib_dev,
RDMACG_RESOURCE_HCA_OBJECT);
if (err)
goto err_free;
flow_id = ib_create_flow(qp, flow_attr, IB_FLOW_DOMAIN_USER);
if (IS_ERR(flow_id)) {
err = PTR_ERR(flow_id);
goto err_free;
goto err_create;
}
flow_id->uobject = uobj;
uobj->object = flow_id;
......@@ -3897,6 +3973,8 @@ int ib_uverbs_ex_create_flow(struct ib_uverbs_file *file,
idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
destroy_flow:
ib_destroy_flow(flow_id);
err_create:
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
err_free:
kfree(flow_attr);
err_put:
......@@ -3936,8 +4014,11 @@ int ib_uverbs_ex_destroy_flow(struct ib_uverbs_file *file,
flow_id = uobj->object;
ret = ib_destroy_flow(flow_id);
if (!ret)
if (!ret) {
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev,
RDMACG_RESOURCE_HCA_OBJECT);
uobj->live = 0;
}
put_uobj_write(uobj);
......@@ -4005,6 +4086,11 @@ static int __uverbs_create_xsrq(struct ib_uverbs_file *file,
obj->uevent.events_reported = 0;
INIT_LIST_HEAD(&obj->uevent.event_list);
ret = ib_rdmacg_try_charge(&obj->uevent.uobject.cg_obj, ib_dev,
RDMACG_RESOURCE_HCA_OBJECT);
if (ret)
goto err_put_cq;
srq = pd->device->create_srq(pd, &attr, udata);
if (IS_ERR(srq)) {
ret = PTR_ERR(srq);
......@@ -4069,6 +4155,8 @@ static int __uverbs_create_xsrq(struct ib_uverbs_file *file,
ib_destroy_srq(srq);
err_put:
ib_rdmacg_uncharge(&obj->uevent.uobject.cg_obj, ib_dev,
RDMACG_RESOURCE_HCA_OBJECT);
put_pd_read(pd);
err_put_cq:
......@@ -4255,6 +4343,8 @@ ssize_t ib_uverbs_destroy_srq(struct ib_uverbs_file *file,
if (ret)
return ret;
ib_rdmacg_uncharge(&uobj->cg_obj, ib_dev, RDMACG_RESOURCE_HCA_OBJECT);
if (srq_type == IB_SRQT_XRC) {
us = container_of(obj, struct ib_usrq_object, uevent);
atomic_dec(&us->uxrcd->refcnt);
......
......@@ -51,6 +51,7 @@
#include <rdma/ib.h>
#include "uverbs.h"
#include "core_priv.h"
MODULE_AUTHOR("Roland Dreier");
MODULE_DESCRIPTION("InfiniBand userspace verbs access");
......@@ -237,6 +238,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
idr_remove_uobj(&ib_uverbs_ah_idr, uobj);
ib_destroy_ah(ah);
ib_rdmacg_uncharge(&uobj->cg_obj, context->device,
RDMACG_RESOURCE_HCA_OBJECT);
kfree(uobj);
}
......@@ -246,6 +249,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
idr_remove_uobj(&ib_uverbs_mw_idr, uobj);
uverbs_dealloc_mw(mw);
ib_rdmacg_uncharge(&uobj->cg_obj, context->device,
RDMACG_RESOURCE_HCA_OBJECT);
kfree(uobj);
}
......@@ -254,6 +259,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
idr_remove_uobj(&ib_uverbs_rule_idr, uobj);
ib_destroy_flow(flow_id);
ib_rdmacg_uncharge(&uobj->cg_obj, context->device,
RDMACG_RESOURCE_HCA_OBJECT);
kfree(uobj);
}
......@@ -266,6 +273,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
if (qp == qp->real_qp)
ib_uverbs_detach_umcast(qp, uqp);
ib_destroy_qp(qp);
ib_rdmacg_uncharge(&uobj->cg_obj, context->device,
RDMACG_RESOURCE_HCA_OBJECT);
ib_uverbs_release_uevent(file, &uqp->uevent);
kfree(uqp);
}
......@@ -298,6 +307,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
idr_remove_uobj(&ib_uverbs_srq_idr, uobj);
ib_destroy_srq(srq);
ib_rdmacg_uncharge(&uobj->cg_obj, context->device,
RDMACG_RESOURCE_HCA_OBJECT);
ib_uverbs_release_uevent(file, uevent);
kfree(uevent);
}
......@@ -310,6 +321,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
idr_remove_uobj(&ib_uverbs_cq_idr, uobj);
ib_destroy_cq(cq);
ib_rdmacg_uncharge(&uobj->cg_obj, context->device,
RDMACG_RESOURCE_HCA_OBJECT);
ib_uverbs_release_ucq(file, ev_file, ucq);
kfree(ucq);
}
......@@ -319,6 +332,8 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
idr_remove_uobj(&ib_uverbs_mr_idr, uobj);
ib_dereg_mr(mr);
ib_rdmacg_uncharge(&uobj->cg_obj, context->device,
RDMACG_RESOURCE_HCA_OBJECT);
kfree(uobj);
}
......@@ -339,11 +354,16 @@ static int ib_uverbs_cleanup_ucontext(struct ib_uverbs_file *file,
idr_remove_uobj(&ib_uverbs_pd_idr, uobj);
ib_dealloc_pd(pd);
ib_rdmacg_uncharge(&uobj->cg_obj, context->device,
RDMACG_RESOURCE_HCA_OBJECT);
kfree(uobj);
}
put_pid(context->tgid);
ib_rdmacg_uncharge(&context->cg_obj, context->device,
RDMACG_RESOURCE_HCA_HANDLE);
return context->device->dealloc_ucontext(context);
}
......
......@@ -478,7 +478,7 @@ static void kernfs_drain(struct kernfs_node *kn)
rwsem_release(&kn->dep_map, 1, _RET_IP_);
}
kernfs_unmap_bin_file(kn);
kernfs_drain_open_files(kn);
mutex_lock(&kernfs_mutex);
}
......
......@@ -515,7 +515,7 @@ static int kernfs_fop_mmap(struct file *file, struct vm_area_struct *vma)
goto out_put;
rc = 0;
of->mmapped = 1;
of->mmapped = true;
of->vm_ops = vma->vm_ops;
vma->vm_ops = &kernfs_vm_ops;
out_put:
......@@ -707,7 +707,8 @@ static int kernfs_fop_open(struct inode *inode, struct file *file)
if (error)
goto err_free;
((struct seq_file *)file->private_data)->private = of;
of->seq_file = file->private_data;
of->seq_file->private = of;
/* seq_file clears PWRITE unconditionally, restore it if WRITE */
if (file->f_mode & FMODE_WRITE)
......@@ -716,13 +717,22 @@ static int kernfs_fop_open(struct inode *inode, struct file *file)
/* make sure we have open node struct */
error = kernfs_get_open_node(kn, of);
if (error)
goto err_close;
goto err_seq_release;
if (ops->open) {
/* nobody has access to @of yet, skip @of->mutex */
error = ops->open(of);
if (error)
goto err_put_node;
}
/* open succeeded, put active references */
kernfs_put_active(kn);
return 0;
err_close:
err_put_node:
kernfs_put_open_node(kn, of);
err_seq_release:
seq_release(inode, file);
err_free:
kfree(of->prealloc_buf);
......@@ -732,11 +742,41 @@ static int kernfs_fop_open(struct inode *inode, struct file *file)
return error;
}
/* used from release/drain to ensure that ->release() is called exactly once */
static void kernfs_release_file(struct kernfs_node *kn,
struct kernfs_open_file *of)
{
/*
* @of is guaranteed to have no other file operations in flight and
* we just want to synchronize release and drain paths.
* @kernfs_open_file_mutex is enough. @of->mutex can't be used
* here because drain path may be called from places which can
* cause circular dependency.
*/
lockdep_assert_held(&kernfs_open_file_mutex);
if (!of->released) {
/*
* A file is never detached without being released and we
* need to be able to release files which are deactivated
* and being drained. Don't use kernfs_ops().
*/
kn->attr.ops->release(of);
of->released = true;
}
}
static int kernfs_fop_release(struct inode *inode, struct file *filp)
{
struct kernfs_node *kn = filp->f_path.dentry->d_fsdata;
struct kernfs_open_file *of = kernfs_of(filp);
if (kn->flags & KERNFS_HAS_RELEASE) {
mutex_lock(&kernfs_open_file_mutex);
kernfs_release_file(kn, of);
mutex_unlock(&kernfs_open_file_mutex);
}
kernfs_put_open_node(kn, of);
seq_release(inode, filp);
kfree(of->prealloc_buf);
......@@ -745,12 +785,12 @@ static int kernfs_fop_release(struct inode *inode, struct file *filp)
return 0;
}
void kernfs_unmap_bin_file(struct kernfs_node *kn)
void kernfs_drain_open_files(struct kernfs_node *kn)
{
struct kernfs_open_node *on;
struct kernfs_open_file *of;
if (!(kn->flags & KERNFS_HAS_MMAP))
if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
return;
spin_lock_irq(&kernfs_open_node_lock);
......@@ -762,10 +802,16 @@ void kernfs_unmap_bin_file(struct kernfs_node *kn)
return;
mutex_lock(&kernfs_open_file_mutex);
list_for_each_entry(of, &on->files, list) {
struct inode *inode = file_inode(of->file);
unmap_mapping_range(inode->i_mapping, 0, 0, 1);
if (kn->flags & KERNFS_HAS_MMAP)
unmap_mapping_range(inode->i_mapping, 0, 0, 1);
kernfs_release_file(kn, of);
}
mutex_unlock(&kernfs_open_file_mutex);
kernfs_put_open_node(kn, NULL);
......@@ -964,6 +1010,8 @@ struct kernfs_node *__kernfs_create_file(struct kernfs_node *parent,
kn->flags |= KERNFS_HAS_SEQ_SHOW;
if (ops->mmap)
kn->flags |= KERNFS_HAS_MMAP;
if (ops->release)
kn->flags |= KERNFS_HAS_RELEASE;
rc = kernfs_add_one(kn);
if (rc) {
......
......@@ -104,7 +104,7 @@ struct kernfs_node *kernfs_new_node(struct kernfs_node *parent,
*/
extern const struct file_operations kernfs_file_fops;
void kernfs_unmap_bin_file(struct kernfs_node *kn);
void kernfs_drain_open_files(struct kernfs_node *kn);
/*
* symlink.c
......
......@@ -148,14 +148,18 @@ struct cgroup_subsys_state {
* set for a task.
*/
struct css_set {
/* Reference count */
atomic_t refcount;
/*
* List running through all cgroup groups in the same hash
* slot. Protected by css_set_lock
* Set of subsystem states, one for each subsystem. This array is
* immutable after creation apart from the init_css_set during
* subsystem registration (at boot time).
*/
struct hlist_node hlist;
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
/* reference count */
atomic_t refcount;
/* the default cgroup associated with this css_set */
struct cgroup *dfl_cgrp;
/*
* Lists running through all tasks using this cgroup group.
......@@ -167,21 +171,29 @@ struct css_set {
struct list_head tasks;
struct list_head mg_tasks;
/* all css_task_iters currently walking this cset */
struct list_head task_iters;
/*
* List of cgrp_cset_links pointing at cgroups referenced from this
* css_set. Protected by css_set_lock.
* On the default hierarhcy, ->subsys[ssid] may point to a css
* attached to an ancestor instead of the cgroup this css_set is
* associated with. The following node is anchored at
* ->subsys[ssid]->cgroup->e_csets[ssid] and provides a way to
* iterate through all css's attached to a given cgroup.
*/
struct list_head cgrp_links;
struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
/* the default cgroup associated with this css_set */
struct cgroup *dfl_cgrp;
/*
* List running through all cgroup groups in the same hash
* slot. Protected by css_set_lock
*/
struct hlist_node hlist;
/*
* Set of subsystem states, one for each subsystem. This array is
* immutable after creation apart from the init_css_set during
* subsystem registration (at boot time).
* List of cgrp_cset_links pointing at cgroups referenced from this
* css_set. Protected by css_set_lock.
*/
struct cgroup_subsys_state *subsys[CGROUP_SUBSYS_COUNT];
struct list_head cgrp_links;
/*
* List of csets participating in the on-going migration either as
......@@ -201,18 +213,6 @@ struct css_set {
struct cgroup *mg_dst_cgrp;
struct css_set *mg_dst_cset;
/*
* On the default hierarhcy, ->subsys[ssid] may point to a css
* attached to an ancestor instead of the cgroup this css_set is
* associated with. The following node is anchored at
* ->subsys[ssid]->cgroup->e_csets[ssid] and provides a way to
* iterate through all css's attached to a given cgroup.
*/
struct list_head e_cset_node[CGROUP_SUBSYS_COUNT];
/* all css_task_iters currently walking this cset */
struct list_head task_iters;
/* dead and being drained, ignore for migration */
bool dead;
......@@ -388,6 +388,9 @@ struct cftype {
struct list_head node; /* anchored at ss->cfts */
struct kernfs_ops *kf_ops;
int (*open)(struct kernfs_open_file *of);
void (*release)(struct kernfs_open_file *of);
/*
* read_u64() is a shortcut for the common case of returning a
* single integer. Use it in place of read()
......
......@@ -266,7 +266,7 @@ void css_task_iter_end(struct css_task_iter *it);
* cgroup_taskset_for_each_leader - iterate group leaders in a cgroup_taskset
* @leader: the loop cursor
* @dst_css: the destination css
* @tset: takset to iterate
* @tset: taskset to iterate
*
* Iterate threadgroup leaders of @tset. For single-task migrations, @tset
* may not contain any.
......
/*
* Copyright (C) 2016 Parav Pandit <pandit.parav@gmail.com>
*
* This file is subject to the terms and conditions of version 2 of the GNU
* General Public License. See the file COPYING in the main directory of the
* Linux distribution for more details.
*/
#ifndef _CGROUP_RDMA_H
#define _CGROUP_RDMA_H
#include <linux/cgroup.h>
enum rdmacg_resource_type {
RDMACG_RESOURCE_HCA_HANDLE,
RDMACG_RESOURCE_HCA_OBJECT,
RDMACG_RESOURCE_MAX,
};
#ifdef CONFIG_CGROUP_RDMA
struct rdma_cgroup {
struct cgroup_subsys_state css;
/*
* head to keep track of all resource pools
* that belongs to this cgroup.
*/
struct list_head rpools;
};
struct rdmacg_device {
struct list_head dev_node;
struct list_head rpools;
char *name;
};
/*
* APIs for RDMA/IB stack to publish when a device wants to
* participate in resource accounting
*/
int rdmacg_register_device(struct rdmacg_device *device);
void rdmacg_unregister_device(struct rdmacg_device *device);
/* APIs for RDMA/IB stack to charge/uncharge pool specific resources */
int rdmacg_try_charge(struct rdma_cgroup **rdmacg,
struct rdmacg_device *device,
enum rdmacg_resource_type index);
void rdmacg_uncharge(struct rdma_cgroup *cg,
struct rdmacg_device *device,
enum rdmacg_resource_type index);
#endif /* CONFIG_CGROUP_RDMA */
#endif /* _CGROUP_RDMA_H */
......@@ -56,6 +56,10 @@ SUBSYS(hugetlb)
SUBSYS(pids)
#endif
#if IS_ENABLED(CONFIG_CGROUP_RDMA)
SUBSYS(rdma)
#endif
/*
* The following subsystems are not supported on the default hierarchy.
*/
......
......@@ -46,6 +46,7 @@ enum kernfs_node_flag {
KERNFS_SUICIDAL = 0x0400,
KERNFS_SUICIDED = 0x0800,
KERNFS_EMPTY_DIR = 0x1000,
KERNFS_HAS_RELEASE = 0x2000,
};
/* @flags for kernfs_create_root() */
......@@ -175,6 +176,7 @@ struct kernfs_open_file {
/* published fields */
struct kernfs_node *kn;
struct file *file;
struct seq_file *seq_file;
void *priv;
/* private fields, do not use outside kernfs proper */
......@@ -185,11 +187,19 @@ struct kernfs_open_file {
char *prealloc_buf;
size_t atomic_write_len;
bool mmapped;
bool mmapped:1;
bool released:1;
const struct vm_operations_struct *vm_ops;
};
struct kernfs_ops {
/*
* Optional open/release methods. Both are called with
* @of->seq_file populated.
*/
int (*open)(struct kernfs_open_file *of);
void (*release)(struct kernfs_open_file *of);
/*
* Read is handled by either seq_file or raw_read().
*
......
......@@ -60,6 +60,7 @@
#include <linux/atomic.h>
#include <linux/mmu_notifier.h>
#include <linux/uaccess.h>
#include <linux/cgroup_rdma.h>
extern struct workqueue_struct *ib_wq;
extern struct workqueue_struct *ib_comp_wq;
......@@ -1356,6 +1357,12 @@ struct ib_fmr_attr {
struct ib_umem;
struct ib_rdmacg_object {
#ifdef CONFIG_CGROUP_RDMA
struct rdma_cgroup *cg; /* owner rdma cgroup */
#endif
};
struct ib_ucontext {
struct ib_device *device;
struct list_head pd_list;
......@@ -1388,6 +1395,8 @@ struct ib_ucontext {
struct list_head no_private_counters;
int odp_mrs_count;
#endif
struct ib_rdmacg_object cg_obj;
};
struct ib_uobject {
......@@ -1395,6 +1404,7 @@ struct ib_uobject {
struct ib_ucontext *context; /* associated user context */
void *object; /* containing object */
struct list_head list; /* link to context's list */
struct ib_rdmacg_object cg_obj; /* rdmacg object */
int id; /* index into kernel idr */
struct kref ref;
struct rw_semaphore mutex; /* protects .live */
......@@ -2128,6 +2138,10 @@ struct ib_device {
struct attribute_group *hw_stats_ag;
struct rdma_hw_stats *hw_stats;
#ifdef CONFIG_CGROUP_RDMA
struct rdmacg_device cg_device;
#endif
/**
* The following mandatory functions are used only at device
* registration. Keep functions such as these at the end of this
......
......@@ -1078,6 +1078,16 @@ config CGROUP_PIDS
since the PIDs limit only affects a process's ability to fork, not to
attach to a cgroup.
config CGROUP_RDMA
bool "RDMA controller"
help
Provides enforcement of RDMA resources defined by IB stack.
It is fairly easy for consumers to exhaust RDMA resources, which
can result into resource unavailability to other consumers.
RDMA controller is designed to stop this from happening.
Attaching processes with active RDMA resources to the cgroup
hierarchy is allowed even if can cross the hierarchy's limit.
config CGROUP_FREEZER
bool "Freezer controller"
help
......
......@@ -64,10 +64,7 @@ obj-$(CONFIG_KEXEC) += kexec.o
obj-$(CONFIG_KEXEC_FILE) += kexec_file.o
obj-$(CONFIG_BACKTRACE_SELF_TEST) += backtracetest.o
obj-$(CONFIG_COMPAT) += compat.o
obj-$(CONFIG_CGROUPS) += cgroup.o
obj-$(CONFIG_CGROUP_FREEZER) += cgroup_freezer.o
obj-$(CONFIG_CGROUP_PIDS) += cgroup_pids.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_CGROUPS) += cgroup/
obj-$(CONFIG_UTS_NS) += utsname.o
obj-$(CONFIG_USER_NS) += user_namespace.o
obj-$(CONFIG_PID_NS) += pid_namespace.o
......
obj-y := cgroup.o namespace.o cgroup-v1.o
obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
obj-$(CONFIG_CGROUP_PIDS) += pids.o
obj-$(CONFIG_CGROUP_RDMA) += rdma.o
obj-$(CONFIG_CPUSETS) += cpuset.o
#ifndef __CGROUP_INTERNAL_H
#define __CGROUP_INTERNAL_H
#include <linux/cgroup.h>
#include <linux/kernfs.h>
#include <linux/workqueue.h>
#include <linux/list.h>
/*
* A cgroup can be associated with multiple css_sets as different tasks may
* belong to different cgroups on different hierarchies. In the other
* direction, a css_set is naturally associated with multiple cgroups.
* This M:N relationship is represented by the following link structure
* which exists for each association and allows traversing the associations
* from both sides.
*/
struct cgrp_cset_link {
/* the cgroup and css_set this link associates */
struct cgroup *cgrp;
struct css_set *cset;
/* list of cgrp_cset_links anchored at cgrp->cset_links */
struct list_head cset_link;
/* list of cgrp_cset_links anchored at css_set->cgrp_links */
struct list_head cgrp_link;
};
/* used to track tasks and csets during migration */
struct cgroup_taskset {
/* the src and dst cset list running through cset->mg_node */
struct list_head src_csets;
struct list_head dst_csets;
/* the subsys currently being processed */
int ssid;
/*
* Fields for cgroup_taskset_*() iteration.
*
* Before migration is committed, the target migration tasks are on
* ->mg_tasks of the csets on ->src_csets. After, on ->mg_tasks of
* the csets on ->dst_csets. ->csets point to either ->src_csets
* or ->dst_csets depending on whether migration is committed.
*
* ->cur_csets and ->cur_task point to the current task position
* during iteration.
*/
struct list_head *csets;
struct css_set *cur_cset;
struct task_struct *cur_task;
};
/* migration context also tracks preloading */
struct cgroup_mgctx {
/*
* Preloaded source and destination csets. Used to guarantee
* atomic success or failure on actual migration.
*/
struct list_head preloaded_src_csets;
struct list_head preloaded_dst_csets;
/* tasks and csets to migrate */
struct cgroup_taskset tset;
/* subsystems affected by migration */
u16 ss_mask;
};
#define CGROUP_TASKSET_INIT(tset) \
{ \
.src_csets = LIST_HEAD_INIT(tset.src_csets), \
.dst_csets = LIST_HEAD_INIT(tset.dst_csets), \
.csets = &tset.src_csets, \
}
#define CGROUP_MGCTX_INIT(name) \
{ \
LIST_HEAD_INIT(name.preloaded_src_csets), \
LIST_HEAD_INIT(name.preloaded_dst_csets), \
CGROUP_TASKSET_INIT(name.tset), \
}
#define DEFINE_CGROUP_MGCTX(name) \
struct cgroup_mgctx name = CGROUP_MGCTX_INIT(name)
struct cgroup_sb_opts {
u16 subsys_mask;
unsigned int flags;
char *release_agent;
bool cpuset_clone_children;
char *name;
/* User explicitly requested empty subsystem */
bool none;
};
extern struct mutex cgroup_mutex;
extern spinlock_t css_set_lock;
extern struct cgroup_subsys *cgroup_subsys[];
extern struct list_head cgroup_roots;
extern struct file_system_type cgroup_fs_type;
/* iterate across the hierarchies */
#define for_each_root(root) \
list_for_each_entry((root), &cgroup_roots, root_list)
/**
* for_each_subsys - iterate all enabled cgroup subsystems
* @ss: the iteration cursor
* @ssid: the index of @ss, CGROUP_SUBSYS_COUNT after reaching the end
*/
#define for_each_subsys(ss, ssid) \
for ((ssid) = 0; (ssid) < CGROUP_SUBSYS_COUNT && \
(((ss) = cgroup_subsys[ssid]) || true); (ssid)++)
static inline bool cgroup_is_dead(const struct cgroup *cgrp)
{
return !(cgrp->self.flags & CSS_ONLINE);
}
static inline bool notify_on_release(const struct cgroup *cgrp)
{
return test_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
}
void put_css_set_locked(struct css_set *cset);
static inline void put_css_set(struct css_set *cset)
{
unsigned long flags;
/*
* Ensure that the refcount doesn't hit zero while any readers
* can see it. Similar to atomic_dec_and_lock(), but for an
* rwlock
*/
if (atomic_add_unless(&cset->refcount, -1, 1))
return;
spin_lock_irqsave(&css_set_lock, flags);
put_css_set_locked(cset);
spin_unlock_irqrestore(&css_set_lock, flags);
}
/*
* refcounted get/put for css_set objects
*/
static inline void get_css_set(struct css_set *cset)
{
atomic_inc(&cset->refcount);
}
bool cgroup_ssid_enabled(int ssid);
bool cgroup_on_dfl(const struct cgroup *cgrp);
struct cgroup_root *cgroup_root_from_kf(struct kernfs_root *kf_root);
struct cgroup *task_cgroup_from_root(struct task_struct *task,
struct cgroup_root *root);
struct cgroup *cgroup_kn_lock_live(struct kernfs_node *kn, bool drain_offline);
void cgroup_kn_unlock(struct kernfs_node *kn);
int cgroup_path_ns_locked(struct cgroup *cgrp, char *buf, size_t buflen,
struct cgroup_namespace *ns);
void cgroup_free_root(struct cgroup_root *root);
void init_cgroup_root(struct cgroup_root *root, struct cgroup_sb_opts *opts);
int cgroup_setup_root(struct cgroup_root *root, u16 ss_mask);
int rebind_subsystems(struct cgroup_root *dst_root, u16 ss_mask);
struct dentry *cgroup_do_mount(struct file_system_type *fs_type, int flags,
struct cgroup_root *root, unsigned long magic,
struct cgroup_namespace *ns);
bool cgroup_may_migrate_to(struct cgroup *dst_cgrp);
void cgroup_migrate_finish(struct cgroup_mgctx *mgctx);
void cgroup_migrate_add_src(struct css_set *src_cset, struct cgroup *dst_cgrp,
struct cgroup_mgctx *mgctx);
int cgroup_migrate_prepare_dst(struct cgroup_mgctx *mgctx);
int cgroup_migrate(struct task_struct *leader, bool threadgroup,
struct cgroup_mgctx *mgctx);
int cgroup_attach_task(struct cgroup *dst_cgrp, struct task_struct *leader,
bool threadgroup);
ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off, bool threadgroup);
ssize_t cgroup_procs_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
loff_t off);
void cgroup_lock_and_drain_offline(struct cgroup *cgrp);
int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode);
int cgroup_rmdir(struct kernfs_node *kn);
int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
struct kernfs_root *kf_root);
/*
* namespace.c
*/
extern const struct proc_ns_operations cgroupns_operations;
/*
* cgroup-v1.c
*/
extern struct cftype cgroup1_base_files[];
extern const struct file_operations proc_cgroupstats_operations;
extern struct kernfs_syscall_ops cgroup1_kf_syscall_ops;
bool cgroup1_ssid_disabled(int ssid);
void cgroup1_pidlist_destroy_all(struct cgroup *cgrp);
void cgroup1_release_agent(struct work_struct *work);
void cgroup1_check_for_release(struct cgroup *cgrp);
struct dentry *cgroup1_mount(struct file_system_type *fs_type, int flags,
void *data, unsigned long magic,
struct cgroup_namespace *ns);
#endif /* __CGROUP_INTERNAL_H */
This diff is collapsed.
This diff is collapsed.
#include "cgroup-internal.h"
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/nsproxy.h>
#include <linux/proc_ns.h>
/* cgroup namespaces */
static struct ucounts *inc_cgroup_namespaces(struct user_namespace *ns)
{
return inc_ucount(ns, current_euid(), UCOUNT_CGROUP_NAMESPACES);
}
static void dec_cgroup_namespaces(struct ucounts *ucounts)
{
dec_ucount(ucounts, UCOUNT_CGROUP_NAMESPACES);
}
static struct cgroup_namespace *alloc_cgroup_ns(void)
{
struct cgroup_namespace *new_ns;
int ret;
new_ns = kzalloc(sizeof(struct cgroup_namespace), GFP_KERNEL);
if (!new_ns)
return ERR_PTR(-ENOMEM);
ret = ns_alloc_inum(&new_ns->ns);
if (ret) {
kfree(new_ns);
return ERR_PTR(ret);
}
atomic_set(&new_ns->count, 1);
new_ns->ns.ops = &cgroupns_operations;
return new_ns;
}
void free_cgroup_ns(struct cgroup_namespace *ns)
{
put_css_set(ns->root_cset);
dec_cgroup_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
kfree(ns);
}
EXPORT_SYMBOL(free_cgroup_ns);
struct cgroup_namespace *copy_cgroup_ns(unsigned long flags,
struct user_namespace *user_ns,
struct cgroup_namespace *old_ns)
{
struct cgroup_namespace *new_ns;
struct ucounts *ucounts;
struct css_set *cset;
BUG_ON(!old_ns);
if (!(flags & CLONE_NEWCGROUP)) {
get_cgroup_ns(old_ns);
return old_ns;
}
/* Allow only sysadmin to create cgroup namespace. */
if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);
ucounts = inc_cgroup_namespaces(user_ns);
if (!ucounts)
return ERR_PTR(-ENOSPC);
/* It is not safe to take cgroup_mutex here */
spin_lock_irq(&css_set_lock);
cset = task_css_set(current);
get_css_set(cset);
spin_unlock_irq(&css_set_lock);
new_ns = alloc_cgroup_ns();
if (IS_ERR(new_ns)) {
put_css_set(cset);
dec_cgroup_namespaces(ucounts);
return new_ns;
}
new_ns->user_ns = get_user_ns(user_ns);
new_ns->ucounts = ucounts;
new_ns->root_cset = cset;
return new_ns;
}
static inline struct cgroup_namespace *to_cg_ns(struct ns_common *ns)
{
return container_of(ns, struct cgroup_namespace, ns);
}
static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
{
struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
!ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
return -EPERM;
/* Don't need to do anything if we are attaching to our own cgroupns. */
if (cgroup_ns == nsproxy->cgroup_ns)
return 0;
get_cgroup_ns(cgroup_ns);
put_cgroup_ns(nsproxy->cgroup_ns);
nsproxy->cgroup_ns = cgroup_ns;
return 0;
}
static struct ns_common *cgroupns_get(struct task_struct *task)
{
struct cgroup_namespace *ns = NULL;
struct nsproxy *nsproxy;
task_lock(task);
nsproxy = task->nsproxy;
if (nsproxy) {
ns = nsproxy->cgroup_ns;
get_cgroup_ns(ns);
}
task_unlock(task);
return ns ? &ns->ns : NULL;
}
static void cgroupns_put(struct ns_common *ns)
{
put_cgroup_ns(to_cg_ns(ns));
}
static struct user_namespace *cgroupns_owner(struct ns_common *ns)
{
return to_cg_ns(ns)->user_ns;
}
const struct proc_ns_operations cgroupns_operations = {
.name = "cgroup",
.type = CLONE_NEWCGROUP,
.get = cgroupns_get,
.put = cgroupns_put,
.install = cgroupns_install,
.owner = cgroupns_owner,
};
static __init int cgroup_namespaces_init(void)
{
return 0;
}
subsys_initcall(cgroup_namespaces_init);
This diff is collapsed.
......@@ -10959,5 +10959,11 @@ struct cgroup_subsys perf_event_cgrp_subsys = {
.css_alloc = perf_cgroup_css_alloc,
.css_free = perf_cgroup_css_free,
.attach = perf_cgroup_attach,
/*
* Implicitly enable on dfl hierarchy so that perf events can
* always be filtered by cgroup2 path as long as perf_event
* controller is not mounted on a legacy hierarchy.
*/
.implicit_on_dfl = true,
};
#endif /* CONFIG_CGROUP_PERF */
......@@ -12,8 +12,8 @@ cgroupfs_find_mountpoint(char *buf, size_t maxlen)
{
FILE *fp;
char mountpoint[PATH_MAX + 1], tokens[PATH_MAX + 1], type[PATH_MAX + 1];
char path_v1[PATH_MAX + 1], path_v2[PATH_MAX + 2], *path;
char *token, *saved_ptr = NULL;
int found = 0;
fp = fopen("/proc/mounts", "r");
if (!fp)
......@@ -24,31 +24,43 @@ cgroupfs_find_mountpoint(char *buf, size_t maxlen)
* and inspect every cgroupfs mount point to find one that has
* perf_event subsystem
*/
path_v1[0] = '\0';
path_v2[0] = '\0';
while (fscanf(fp, "%*s %"STR(PATH_MAX)"s %"STR(PATH_MAX)"s %"
STR(PATH_MAX)"s %*d %*d\n",
mountpoint, type, tokens) == 3) {
if (!strcmp(type, "cgroup")) {
if (!path_v1[0] && !strcmp(type, "cgroup")) {
token = strtok_r(tokens, ",", &saved_ptr);
while (token != NULL) {
if (!strcmp(token, "perf_event")) {
found = 1;
strcpy(path_v1, mountpoint);
break;
}
token = strtok_r(NULL, ",", &saved_ptr);
}
}
if (found)
if (!path_v2[0] && !strcmp(type, "cgroup2"))
strcpy(path_v2, mountpoint);
if (path_v1[0] && path_v2[0])
break;
}
fclose(fp);
if (!found)
if (path_v1[0])
path = path_v1;
else if (path_v2[0])
path = path_v2;
else
return -1;
if (strlen(mountpoint) < maxlen) {
strcpy(buf, mountpoint);
if (strlen(path) < maxlen) {
strcpy(buf, path);
return 0;
}
return -1;
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment