Commit bab8ac3c authored by Alexei Starovoitov's avatar Alexei Starovoitov

Merge branch 'add-open-coded-task-css_task-and-css-iters'

Chuyi Zhou says:

====================
Add Open-coded task, css_task and css iters

This is version 6 of task, css_task and css iters support.

--- Changelog ---

v5 -> v6:

Patch #3:
 * In bpf_iter_task_next, return pos rather than goto out. (Andrii)
Patch #2, #3, #4:
 * Add the missing __diag_ignore_all to avoid kernel build warning
Patch #5, #6, #7:
 * Add Andrii's ack

Patch #8:
 * In BPF prog iter_css_task_for_each, return -EPERM rather than 0, and
   ensure stack_mprotect() in iters.c not success. If not, it would cause
   the subsequent 'test_lsm' fail, since the 'is_stack' check in
   test_int_hook(lsm.c) would not be guaranteed.
   (https://github.com/kernel-patches/bpf/actions/runs/6489662214/job/17624665086?pr=5790)

v4 -> v5:https://lore.kernel.org/lkml/20231007124522.34834-1-zhouchuyi@bytedance.com/

Patch 3~4:
 * Relax the BUILD_BUG_ON check in bpf_iter_task_new and bpf_iter_css_new to avoid
   netdev/build_32bit CI error.
   (https://netdev.bots.linux.dev/static/nipa/790929/13412333/build_32bit/stderr)
Patch 8:
 * Initialize skel pointer to fix the LLVM-16 build CI error
   (https://github.com/kernel-patches/bpf/actions/runs/6462875618/job/17545170863)

v3 -> v4:https://lore.kernel.org/all/20230925105552.817513-1-zhouchuyi@bytedance.com/

* Address all the comments from Andrii in patch-3 ~ patch-6
* Collect Tejun's ack
* Add a extra patch to rename bpf_iter_task.c to bpf_iter_tasks.c
* Seperate three BPF program files for selftests (iters_task.c iters_css_task.c iters_css.c)

v2 -> v3:https://lore.kernel.org/lkml/20230912070149.969939-1-zhouchuyi@bytedance.com/

Patch 1 (cgroup: Prepare for using css_task_iter_*() in BPF)
  * Add tj's ack and Alexei's suggest-by.
Patch 2 (bpf: Introduce css_task open-coded iterator kfuncs)
  * Use bpf_mem_alloc/bpf_mem_free rather than kzalloc()
  * Add KF_TRUSTED_ARGS for bpf_iter_css_task_new (Alexei)
  * Move bpf_iter_css_task's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_css_task_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 3 (Introduce task open coded iterator kfuncs)
  * Change th API design keep consistent with SEC("iter/task"), support
    iterating all threads(BPF_TASK_ITERATE_ALL) and threads of a
    specific task (BPF_TASK_ITERATE_THREAD).(Andrii)
  * Move bpf_iter_task's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_task_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 4 (Introduce css open-coded iterator kfuncs)
  * Change th API design keep consistent with cgroup_iters, reuse
    BPF_CGROUP_ITER_DESCENDANTS_PRE/BPF_CGROUP_ITER_DESCENDANTS_POST
    /BPF_CGROUP_ITER_ANCESTORS_UP(Andrii)
  * Add KF_TRUSTED_ARGS for bpf_iter_css_new
  * Move bpf_iter_css's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_css_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 5 (teach the verifier to enforce css_iter and task_iter in RCU CS)
  * Add KF flag KF_RCU_PROTECTED to maintain kfuncs which need RCU CS.(Andrii)
  * Consider STACK_ITER when using bpf_for_each_spilled_reg.
Patch 6 (Let bpf_iter_task_new accept null task ptr)
  * Add this extra patch to let bpf_iter_task_new accept a 'nullable'
  * task pointer(Andrii)
Patch 7 (selftests/bpf: Add tests for open-coded task and css iter)
  * Add failure testcase(Alexei)

Changes from v1(https://lore.kernel.org/lkml/20230827072057.1591929-1-zhouchuyi@bytedance.com/):
- Add a pre-patch to make some preparations before supporting css_task
  iters.(Alexei)
- Add an allowlist for css_task iters(Alexei)
- Let bpf progs do explicit bpf_rcu_read_lock() when using process
  iters and css_descendant iters.(Alexei)
---------------------

In some BPF usage scenarios, it will be useful to iterate the process and
css directly in the BPF program. One of the expected scenarios is
customizable OOM victim selection via BPF[1].

Inspired by Dave's task_vma iter[2], this patchset adds three types of
open-coded iterator kfuncs:

1. bpf_task_iters. It can be used to
1) iterate all process in the system, like for_each_forcess() in kernel.
2) iterate all threads in the system.
3) iterate all threads of a specific task

2. bpf_css_iters. It works like css_task_iter_{start, next, end} and would
be used to iterating tasks/threads under a css.

3. css_iters. It works like css_next_descendant_{pre, post} to iterating all
descendant css.

BPF programs can use these kfuncs directly or through bpf_for_each macro.

link[1]: https://lore.kernel.org/lkml/20230810081319.65668-1-zhouchuyi@bytedance.com/
link[2]: https://lore.kernel.org/all/20230810183513.684836-1-davemarchevsky@fb.com/
====================

Link: https://lore.kernel.org/r/20231018061746.111364-1-zhouchuyi@bytedance.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
parents 6bd5e167 130e0f7a
......@@ -386,19 +386,18 @@ struct bpf_verifier_state {
u32 jmp_history_cnt;
};
#define bpf_get_spilled_reg(slot, frame) \
#define bpf_get_spilled_reg(slot, frame, mask) \
(((slot < frame->allocated_stack / BPF_REG_SIZE) && \
(frame->stack[slot].slot_type[0] == STACK_SPILL)) \
((1 << frame->stack[slot].slot_type[0]) & (mask))) \
? &frame->stack[slot].spilled_ptr : NULL)
/* Iterate over 'frame', setting 'reg' to either NULL or a spilled register. */
#define bpf_for_each_spilled_reg(iter, frame, reg) \
for (iter = 0, reg = bpf_get_spilled_reg(iter, frame); \
#define bpf_for_each_spilled_reg(iter, frame, reg, mask) \
for (iter = 0, reg = bpf_get_spilled_reg(iter, frame, mask); \
iter < frame->allocated_stack / BPF_REG_SIZE; \
iter++, reg = bpf_get_spilled_reg(iter, frame))
iter++, reg = bpf_get_spilled_reg(iter, frame, mask))
/* Invoke __expr over regsiters in __vst, setting __state and __reg */
#define bpf_for_each_reg_in_vstate(__vst, __state, __reg, __expr) \
#define bpf_for_each_reg_in_vstate_mask(__vst, __state, __reg, __mask, __expr) \
({ \
struct bpf_verifier_state *___vstate = __vst; \
int ___i, ___j; \
......@@ -410,7 +409,7 @@ struct bpf_verifier_state {
__reg = &___regs[___j]; \
(void)(__expr); \
} \
bpf_for_each_spilled_reg(___j, __state, __reg) { \
bpf_for_each_spilled_reg(___j, __state, __reg, __mask) { \
if (!__reg) \
continue; \
(void)(__expr); \
......@@ -418,6 +417,10 @@ struct bpf_verifier_state {
} \
})
/* Invoke __expr over regsiters in __vst, setting __state and __reg */
#define bpf_for_each_reg_in_vstate(__vst, __state, __reg, __expr) \
bpf_for_each_reg_in_vstate_mask(__vst, __state, __reg, 1 << STACK_SPILL, __expr)
/* linked list of verifier states used to prune search */
struct bpf_verifier_state_list {
struct bpf_verifier_state state;
......
......@@ -74,6 +74,7 @@
#define KF_ITER_NEW (1 << 8) /* kfunc implements BPF iter constructor */
#define KF_ITER_NEXT (1 << 9) /* kfunc implements BPF iter next method */
#define KF_ITER_DESTROY (1 << 10) /* kfunc implements BPF iter destructor */
#define KF_RCU_PROTECTED (1 << 11) /* kfunc should be protected by rcu cs when they are invoked */
/*
* Tag marking a kernel function as a kfunc. This is meant to minimize the
......
......@@ -40,13 +40,11 @@ struct kernel_clone_args;
#define CGROUP_WEIGHT_DFL 100
#define CGROUP_WEIGHT_MAX 10000
/* walk only threadgroup leaders */
#define CSS_TASK_ITER_PROCS (1U << 0)
/* walk all threaded css_sets in the domain */
#define CSS_TASK_ITER_THREADED (1U << 1)
/* internal flags */
#define CSS_TASK_ITER_SKIPPED (1U << 16)
enum {
CSS_TASK_ITER_PROCS = (1U << 0), /* walk only threadgroup leaders */
CSS_TASK_ITER_THREADED = (1U << 1), /* walk all threaded css_sets in the domain */
CSS_TASK_ITER_SKIPPED = (1U << 16), /* internal flags */
};
/* a css_task_iter should be treated as an opaque object */
struct css_task_iter {
......
......@@ -294,3 +294,68 @@ static int __init bpf_cgroup_iter_init(void)
}
late_initcall(bpf_cgroup_iter_init);
struct bpf_iter_css {
__u64 __opaque[3];
} __attribute__((aligned(8)));
struct bpf_iter_css_kern {
struct cgroup_subsys_state *start;
struct cgroup_subsys_state *pos;
unsigned int flags;
} __attribute__((aligned(8)));
__diag_push();
__diag_ignore_all("-Wmissing-prototypes",
"Global functions as their definitions will be in vmlinux BTF");
__bpf_kfunc int bpf_iter_css_new(struct bpf_iter_css *it,
struct cgroup_subsys_state *start, unsigned int flags)
{
struct bpf_iter_css_kern *kit = (void *)it;
BUILD_BUG_ON(sizeof(struct bpf_iter_css_kern) > sizeof(struct bpf_iter_css));
BUILD_BUG_ON(__alignof__(struct bpf_iter_css_kern) != __alignof__(struct bpf_iter_css));
kit->start = NULL;
switch (flags) {
case BPF_CGROUP_ITER_DESCENDANTS_PRE:
case BPF_CGROUP_ITER_DESCENDANTS_POST:
case BPF_CGROUP_ITER_ANCESTORS_UP:
break;
default:
return -EINVAL;
}
kit->start = start;
kit->pos = NULL;
kit->flags = flags;
return 0;
}
__bpf_kfunc struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it)
{
struct bpf_iter_css_kern *kit = (void *)it;
if (!kit->start)
return NULL;
switch (kit->flags) {
case BPF_CGROUP_ITER_DESCENDANTS_PRE:
kit->pos = css_next_descendant_pre(kit->pos, kit->start);
break;
case BPF_CGROUP_ITER_DESCENDANTS_POST:
kit->pos = css_next_descendant_post(kit->pos, kit->start);
break;
case BPF_CGROUP_ITER_ANCESTORS_UP:
kit->pos = kit->pos ? kit->pos->parent : kit->start;
}
return kit->pos;
}
__bpf_kfunc void bpf_iter_css_destroy(struct bpf_iter_css *it)
{
}
__diag_pop();
\ No newline at end of file
......@@ -2560,6 +2560,15 @@ BTF_ID_FLAGS(func, bpf_iter_num_destroy, KF_ITER_DESTROY)
BTF_ID_FLAGS(func, bpf_iter_task_vma_new, KF_ITER_NEW | KF_RCU)
BTF_ID_FLAGS(func, bpf_iter_task_vma_next, KF_ITER_NEXT | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_iter_task_vma_destroy, KF_ITER_DESTROY)
BTF_ID_FLAGS(func, bpf_iter_css_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS)
BTF_ID_FLAGS(func, bpf_iter_css_task_next, KF_ITER_NEXT | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_iter_css_task_destroy, KF_ITER_DESTROY)
BTF_ID_FLAGS(func, bpf_iter_task_new, KF_ITER_NEW | KF_TRUSTED_ARGS | KF_RCU_PROTECTED)
BTF_ID_FLAGS(func, bpf_iter_task_next, KF_ITER_NEXT | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_iter_task_destroy, KF_ITER_DESTROY)
BTF_ID_FLAGS(func, bpf_iter_css_new, KF_ITER_NEW | KF_TRUSTED_ARGS | KF_RCU_PROTECTED)
BTF_ID_FLAGS(func, bpf_iter_css_next, KF_ITER_NEXT | KF_RET_NULL)
BTF_ID_FLAGS(func, bpf_iter_css_destroy, KF_ITER_DESTROY)
BTF_ID_FLAGS(func, bpf_dynptr_adjust)
BTF_ID_FLAGS(func, bpf_dynptr_is_null)
BTF_ID_FLAGS(func, bpf_dynptr_is_rdonly)
......
......@@ -894,6 +894,157 @@ __bpf_kfunc void bpf_iter_task_vma_destroy(struct bpf_iter_task_vma *it)
__diag_pop();
struct bpf_iter_css_task {
__u64 __opaque[1];
} __attribute__((aligned(8)));
struct bpf_iter_css_task_kern {
struct css_task_iter *css_it;
} __attribute__((aligned(8)));
__diag_push();
__diag_ignore_all("-Wmissing-prototypes",
"Global functions as their definitions will be in vmlinux BTF");
__bpf_kfunc int bpf_iter_css_task_new(struct bpf_iter_css_task *it,
struct cgroup_subsys_state *css, unsigned int flags)
{
struct bpf_iter_css_task_kern *kit = (void *)it;
BUILD_BUG_ON(sizeof(struct bpf_iter_css_task_kern) != sizeof(struct bpf_iter_css_task));
BUILD_BUG_ON(__alignof__(struct bpf_iter_css_task_kern) !=
__alignof__(struct bpf_iter_css_task));
kit->css_it = NULL;
switch (flags) {
case CSS_TASK_ITER_PROCS | CSS_TASK_ITER_THREADED:
case CSS_TASK_ITER_PROCS:
case 0:
break;
default:
return -EINVAL;
}
kit->css_it = bpf_mem_alloc(&bpf_global_ma, sizeof(struct css_task_iter));
if (!kit->css_it)
return -ENOMEM;
css_task_iter_start(css, flags, kit->css_it);
return 0;
}
__bpf_kfunc struct task_struct *bpf_iter_css_task_next(struct bpf_iter_css_task *it)
{
struct bpf_iter_css_task_kern *kit = (void *)it;
if (!kit->css_it)
return NULL;
return css_task_iter_next(kit->css_it);
}
__bpf_kfunc void bpf_iter_css_task_destroy(struct bpf_iter_css_task *it)
{
struct bpf_iter_css_task_kern *kit = (void *)it;
if (!kit->css_it)
return;
css_task_iter_end(kit->css_it);
bpf_mem_free(&bpf_global_ma, kit->css_it);
}
__diag_pop();
struct bpf_iter_task {
__u64 __opaque[3];
} __attribute__((aligned(8)));
struct bpf_iter_task_kern {
struct task_struct *task;
struct task_struct *pos;
unsigned int flags;
} __attribute__((aligned(8)));
enum {
/* all process in the system */
BPF_TASK_ITER_ALL_PROCS,
/* all threads in the system */
BPF_TASK_ITER_ALL_THREADS,
/* all threads of a specific process */
BPF_TASK_ITER_PROC_THREADS
};
__diag_push();
__diag_ignore_all("-Wmissing-prototypes",
"Global functions as their definitions will be in vmlinux BTF");
__bpf_kfunc int bpf_iter_task_new(struct bpf_iter_task *it,
struct task_struct *task__nullable, unsigned int flags)
{
struct bpf_iter_task_kern *kit = (void *)it;
BUILD_BUG_ON(sizeof(struct bpf_iter_task_kern) > sizeof(struct bpf_iter_task));
BUILD_BUG_ON(__alignof__(struct bpf_iter_task_kern) !=
__alignof__(struct bpf_iter_task));
kit->task = kit->pos = NULL;
switch (flags) {
case BPF_TASK_ITER_ALL_THREADS:
case BPF_TASK_ITER_ALL_PROCS:
break;
case BPF_TASK_ITER_PROC_THREADS:
if (!task__nullable)
return -EINVAL;
break;
default:
return -EINVAL;
}
if (flags == BPF_TASK_ITER_PROC_THREADS)
kit->task = task__nullable;
else
kit->task = &init_task;
kit->pos = kit->task;
kit->flags = flags;
return 0;
}
__bpf_kfunc struct task_struct *bpf_iter_task_next(struct bpf_iter_task *it)
{
struct bpf_iter_task_kern *kit = (void *)it;
struct task_struct *pos;
unsigned int flags;
flags = kit->flags;
pos = kit->pos;
if (!pos)
return pos;
if (flags == BPF_TASK_ITER_ALL_PROCS)
goto get_next_task;
kit->pos = next_thread(kit->pos);
if (kit->pos == kit->task) {
if (flags == BPF_TASK_ITER_PROC_THREADS) {
kit->pos = NULL;
return pos;
}
} else
return pos;
get_next_task:
kit->pos = next_task(kit->pos);
kit->task = kit->pos;
if (kit->pos == &init_task)
kit->pos = NULL;
return pos;
}
__bpf_kfunc void bpf_iter_task_destroy(struct bpf_iter_task *it)
{
}
__diag_pop();
DEFINE_PER_CPU(struct mmap_unlock_irq_work, mmap_unlock_work);
static void do_mmap_read_unlock(struct irq_work *entry)
......
......@@ -1173,7 +1173,12 @@ static bool is_dynptr_type_expected(struct bpf_verifier_env *env, struct bpf_reg
static void __mark_reg_known_zero(struct bpf_reg_state *reg);
static bool in_rcu_cs(struct bpf_verifier_env *env);
static bool is_kfunc_rcu_protected(struct bpf_kfunc_call_arg_meta *meta);
static int mark_stack_slots_iter(struct bpf_verifier_env *env,
struct bpf_kfunc_call_arg_meta *meta,
struct bpf_reg_state *reg, int insn_idx,
struct btf *btf, u32 btf_id, int nr_slots)
{
......@@ -1194,6 +1199,12 @@ static int mark_stack_slots_iter(struct bpf_verifier_env *env,
__mark_reg_known_zero(st);
st->type = PTR_TO_STACK; /* we don't have dedicated reg type */
if (is_kfunc_rcu_protected(meta)) {
if (in_rcu_cs(env))
st->type |= MEM_RCU;
else
st->type |= PTR_UNTRUSTED;
}
st->live |= REG_LIVE_WRITTEN;
st->ref_obj_id = i == 0 ? id : 0;
st->iter.btf = btf;
......@@ -1268,7 +1279,7 @@ static bool is_iter_reg_valid_uninit(struct bpf_verifier_env *env,
return true;
}
static bool is_iter_reg_valid_init(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
static int is_iter_reg_valid_init(struct bpf_verifier_env *env, struct bpf_reg_state *reg,
struct btf *btf, u32 btf_id, int nr_slots)
{
struct bpf_func_state *state = func(env, reg);
......@@ -1276,26 +1287,28 @@ static bool is_iter_reg_valid_init(struct bpf_verifier_env *env, struct bpf_reg_
spi = iter_get_spi(env, reg, nr_slots);
if (spi < 0)
return false;
return -EINVAL;
for (i = 0; i < nr_slots; i++) {
struct bpf_stack_state *slot = &state->stack[spi - i];
struct bpf_reg_state *st = &slot->spilled_ptr;
if (st->type & PTR_UNTRUSTED)
return -EPROTO;
/* only main (first) slot has ref_obj_id set */
if (i == 0 && !st->ref_obj_id)
return false;
return -EINVAL;
if (i != 0 && st->ref_obj_id)
return false;
return -EINVAL;
if (st->iter.btf != btf || st->iter.btf_id != btf_id)
return false;
return -EINVAL;
for (j = 0; j < BPF_REG_SIZE; j++)
if (slot->slot_type[j] != STACK_ITER)
return false;
return -EINVAL;
}
return true;
return 0;
}
/* Check if given stack slot is "special":
......@@ -7640,15 +7653,24 @@ static int process_iter_arg(struct bpf_verifier_env *env, int regno, int insn_id
return err;
}
err = mark_stack_slots_iter(env, reg, insn_idx, meta->btf, btf_id, nr_slots);
err = mark_stack_slots_iter(env, meta, reg, insn_idx, meta->btf, btf_id, nr_slots);
if (err)
return err;
} else {
/* iter_next() or iter_destroy() expect initialized iter state*/
if (!is_iter_reg_valid_init(env, reg, meta->btf, btf_id, nr_slots)) {
err = is_iter_reg_valid_init(env, reg, meta->btf, btf_id, nr_slots);
switch (err) {
case 0:
break;
case -EINVAL:
verbose(env, "expected an initialized iter_%s as arg #%d\n",
iter_type_str(meta->btf, btf_id), regno);
return -EINVAL;
return err;
case -EPROTO:
verbose(env, "expected an RCU CS when using %s\n", meta->func_name);
return err;
default:
return err;
}
spi = iter_get_spi(env, reg, nr_slots);
......@@ -10231,6 +10253,11 @@ static bool is_kfunc_rcu(struct bpf_kfunc_call_arg_meta *meta)
return meta->kfunc_flags & KF_RCU;
}
static bool is_kfunc_rcu_protected(struct bpf_kfunc_call_arg_meta *meta)
{
return meta->kfunc_flags & KF_RCU_PROTECTED;
}
static bool __kfunc_param_match_suffix(const struct btf *btf,
const struct btf_param *arg,
const char *suffix)
......@@ -10305,6 +10332,11 @@ static bool is_kfunc_arg_refcounted_kptr(const struct btf *btf, const struct btf
return __kfunc_param_match_suffix(btf, arg, "__refcounted_kptr");
}
static bool is_kfunc_arg_nullable(const struct btf *btf, const struct btf_param *arg)
{
return __kfunc_param_match_suffix(btf, arg, "__nullable");
}
static bool is_kfunc_arg_scalar_with_name(const struct btf *btf,
const struct btf_param *arg,
const char *name)
......@@ -10447,6 +10479,7 @@ enum kfunc_ptr_arg_type {
KF_ARG_PTR_TO_CALLBACK,
KF_ARG_PTR_TO_RB_ROOT,
KF_ARG_PTR_TO_RB_NODE,
KF_ARG_PTR_TO_NULL,
};
enum special_kfunc_type {
......@@ -10472,6 +10505,7 @@ enum special_kfunc_type {
KF_bpf_percpu_obj_new_impl,
KF_bpf_percpu_obj_drop_impl,
KF_bpf_throw,
KF_bpf_iter_css_task_new,
};
BTF_SET_START(special_kfunc_set)
......@@ -10495,6 +10529,7 @@ BTF_ID(func, bpf_dynptr_clone)
BTF_ID(func, bpf_percpu_obj_new_impl)
BTF_ID(func, bpf_percpu_obj_drop_impl)
BTF_ID(func, bpf_throw)
BTF_ID(func, bpf_iter_css_task_new)
BTF_SET_END(special_kfunc_set)
BTF_ID_LIST(special_kfunc_list)
......@@ -10520,6 +10555,7 @@ BTF_ID(func, bpf_dynptr_clone)
BTF_ID(func, bpf_percpu_obj_new_impl)
BTF_ID(func, bpf_percpu_obj_drop_impl)
BTF_ID(func, bpf_throw)
BTF_ID(func, bpf_iter_css_task_new)
static bool is_kfunc_ret_null(struct bpf_kfunc_call_arg_meta *meta)
{
......@@ -10600,6 +10636,8 @@ get_kfunc_ptr_arg_type(struct bpf_verifier_env *env,
if (is_kfunc_arg_callback(env, meta->btf, &args[argno]))
return KF_ARG_PTR_TO_CALLBACK;
if (is_kfunc_arg_nullable(meta->btf, &args[argno]) && register_is_null(reg))
return KF_ARG_PTR_TO_NULL;
if (argno + 1 < nargs &&
(is_kfunc_arg_mem_size(meta->btf, &args[argno + 1], &regs[regno + 1]) ||
......@@ -11050,6 +11088,20 @@ static int process_kf_arg_ptr_to_rbtree_node(struct bpf_verifier_env *env,
&meta->arg_rbtree_root.field);
}
static bool check_css_task_iter_allowlist(struct bpf_verifier_env *env)
{
enum bpf_prog_type prog_type = resolve_prog_type(env->prog);
switch (prog_type) {
case BPF_PROG_TYPE_LSM:
return true;
case BPF_TRACE_ITER:
return env->prog->aux->sleepable;
default:
return false;
}
}
static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_arg_meta *meta,
int insn_idx)
{
......@@ -11136,7 +11188,8 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
}
if ((is_kfunc_trusted_args(meta) || is_kfunc_rcu(meta)) &&
(register_is_null(reg) || type_may_be_null(reg->type))) {
(register_is_null(reg) || type_may_be_null(reg->type)) &&
!is_kfunc_arg_nullable(meta->btf, &args[i])) {
verbose(env, "Possibly NULL pointer passed to trusted arg%d\n", i);
return -EACCES;
}
......@@ -11161,6 +11214,8 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
return kf_arg_type;
switch (kf_arg_type) {
case KF_ARG_PTR_TO_NULL:
continue;
case KF_ARG_PTR_TO_ALLOC_BTF_ID:
case KF_ARG_PTR_TO_BTF_ID:
if (!is_kfunc_trusted_args(meta) && !is_kfunc_rcu(meta))
......@@ -11300,6 +11355,12 @@ static int check_kfunc_args(struct bpf_verifier_env *env, struct bpf_kfunc_call_
break;
}
case KF_ARG_PTR_TO_ITER:
if (meta->func_id == special_kfunc_list[KF_bpf_iter_css_task_new]) {
if (!check_css_task_iter_allowlist(env)) {
verbose(env, "css_task_iter is only allowed in bpf_lsm and bpf iter-s\n");
return -EINVAL;
}
}
ret = process_iter_arg(env, regno, insn_idx, meta);
if (ret < 0)
return ret;
......@@ -11559,6 +11620,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
if (env->cur_state->active_rcu_lock) {
struct bpf_func_state *state;
struct bpf_reg_state *reg;
u32 clear_mask = (1 << STACK_SPILL) | (1 << STACK_ITER);
if (in_rbtree_lock_required_cb(env) && (rcu_lock || rcu_unlock)) {
verbose(env, "Calling bpf_rcu_read_{lock,unlock} in unnecessary rbtree callback\n");
......@@ -11569,7 +11631,7 @@ static int check_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
verbose(env, "nested rcu read lock (kernel function %s)\n", func_name);
return -EINVAL;
} else if (rcu_unlock) {
bpf_for_each_reg_in_vstate(env->cur_state, state, reg, ({
bpf_for_each_reg_in_vstate_mask(env->cur_state, state, reg, clear_mask, ({
if (reg->type & MEM_RCU) {
reg->type &= ~(MEM_RCU | PTR_MAYBE_NULL);
reg->type |= PTR_UNTRUSTED;
......
......@@ -4917,9 +4917,11 @@ static void css_task_iter_advance(struct css_task_iter *it)
void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
struct css_task_iter *it)
{
unsigned long irqflags;
memset(it, 0, sizeof(*it));
spin_lock_irq(&css_set_lock);
spin_lock_irqsave(&css_set_lock, irqflags);
it->ss = css->ss;
it->flags = flags;
......@@ -4933,7 +4935,7 @@ void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
css_task_iter_advance(it);
spin_unlock_irq(&css_set_lock);
spin_unlock_irqrestore(&css_set_lock, irqflags);
}
/**
......@@ -4946,12 +4948,14 @@ void css_task_iter_start(struct cgroup_subsys_state *css, unsigned int flags,
*/
struct task_struct *css_task_iter_next(struct css_task_iter *it)
{
unsigned long irqflags;
if (it->cur_task) {
put_task_struct(it->cur_task);
it->cur_task = NULL;
}
spin_lock_irq(&css_set_lock);
spin_lock_irqsave(&css_set_lock, irqflags);
/* @it may be half-advanced by skips, finish advancing */
if (it->flags & CSS_TASK_ITER_SKIPPED)
......@@ -4964,7 +4968,7 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it)
css_task_iter_advance(it);
}
spin_unlock_irq(&css_set_lock);
spin_unlock_irqrestore(&css_set_lock, irqflags);
return it->cur_task;
}
......@@ -4977,11 +4981,13 @@ struct task_struct *css_task_iter_next(struct css_task_iter *it)
*/
void css_task_iter_end(struct css_task_iter *it)
{
unsigned long irqflags;
if (it->cur_cset) {
spin_lock_irq(&css_set_lock);
spin_lock_irqsave(&css_set_lock, irqflags);
list_del(&it->iters_node);
put_css_set_locked(it->cur_cset);
spin_unlock_irq(&css_set_lock);
spin_unlock_irqrestore(&css_set_lock, irqflags);
}
if (it->cur_dcset)
......
......@@ -458,4 +458,23 @@ extern void bpf_throw(u64 cookie) __ksym;
__bpf_assert_op(LHS, <=, END, value, false); \
})
struct bpf_iter_css_task;
struct cgroup_subsys_state;
extern int bpf_iter_css_task_new(struct bpf_iter_css_task *it,
struct cgroup_subsys_state *css, unsigned int flags) __weak __ksym;
extern struct task_struct *bpf_iter_css_task_next(struct bpf_iter_css_task *it) __weak __ksym;
extern void bpf_iter_css_task_destroy(struct bpf_iter_css_task *it) __weak __ksym;
struct bpf_iter_task;
extern int bpf_iter_task_new(struct bpf_iter_task *it,
struct task_struct *task, unsigned int flags) __weak __ksym;
extern struct task_struct *bpf_iter_task_next(struct bpf_iter_task *it) __weak __ksym;
extern void bpf_iter_task_destroy(struct bpf_iter_task *it) __weak __ksym;
struct bpf_iter_css;
extern int bpf_iter_css_new(struct bpf_iter_css *it,
struct cgroup_subsys_state *start, unsigned int flags) __weak __ksym;
extern struct cgroup_subsys_state *bpf_iter_css_next(struct bpf_iter_css *it) __weak __ksym;
extern void bpf_iter_css_destroy(struct bpf_iter_css *it) __weak __ksym;
#endif
......@@ -7,7 +7,7 @@
#include "bpf_iter_ipv6_route.skel.h"
#include "bpf_iter_netlink.skel.h"
#include "bpf_iter_bpf_map.skel.h"
#include "bpf_iter_task.skel.h"
#include "bpf_iter_tasks.skel.h"
#include "bpf_iter_task_stack.skel.h"
#include "bpf_iter_task_file.skel.h"
#include "bpf_iter_task_vmas.skel.h"
......@@ -215,12 +215,12 @@ static void *do_nothing_wait(void *arg)
static void test_task_common_nocheck(struct bpf_iter_attach_opts *opts,
int *num_unknown, int *num_known)
{
struct bpf_iter_task *skel;
struct bpf_iter_tasks *skel;
pthread_t thread_id;
void *ret;
skel = bpf_iter_task__open_and_load();
if (!ASSERT_OK_PTR(skel, "bpf_iter_task__open_and_load"))
skel = bpf_iter_tasks__open_and_load();
if (!ASSERT_OK_PTR(skel, "bpf_iter_tasks__open_and_load"))
return;
ASSERT_OK(pthread_mutex_lock(&do_nothing_mutex), "pthread_mutex_lock");
......@@ -239,7 +239,7 @@ static void test_task_common_nocheck(struct bpf_iter_attach_opts *opts,
ASSERT_FALSE(pthread_join(thread_id, &ret) || ret != NULL,
"pthread_join");
bpf_iter_task__destroy(skel);
bpf_iter_tasks__destroy(skel);
}
static void test_task_common(struct bpf_iter_attach_opts *opts, int num_unknown, int num_known)
......@@ -307,10 +307,10 @@ static void test_task_pidfd(void)
static void test_task_sleepable(void)
{
struct bpf_iter_task *skel;
struct bpf_iter_tasks *skel;
skel = bpf_iter_task__open_and_load();
if (!ASSERT_OK_PTR(skel, "bpf_iter_task__open_and_load"))
skel = bpf_iter_tasks__open_and_load();
if (!ASSERT_OK_PTR(skel, "bpf_iter_tasks__open_and_load"))
return;
do_dummy_read(skel->progs.dump_task_sleepable);
......@@ -320,7 +320,7 @@ static void test_task_sleepable(void)
ASSERT_GT(skel->bss->num_success_copy_from_user_task, 0,
"num_success_copy_from_user_task");
bpf_iter_task__destroy(skel);
bpf_iter_tasks__destroy(skel);
}
static void test_task_stack(void)
......
// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2023 Meta Platforms, Inc. and affiliates. */
#include <sys/syscall.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <unistd.h>
#include <malloc.h>
#include <stdlib.h>
#include <test_progs.h>
#include "cgroup_helpers.h"
#include "iters.skel.h"
#include "iters_state_safety.skel.h"
......@@ -9,6 +16,10 @@
#include "iters_num.skel.h"
#include "iters_testmod_seq.skel.h"
#include "iters_task_vma.skel.h"
#include "iters_task.skel.h"
#include "iters_css_task.skel.h"
#include "iters_css.skel.h"
#include "iters_task_failure.skel.h"
static void subtest_num_iters(void)
{
......@@ -146,6 +157,138 @@ static void subtest_task_vma_iters(void)
iters_task_vma__destroy(skel);
}
static pthread_mutex_t do_nothing_mutex;
static void *do_nothing_wait(void *arg)
{
pthread_mutex_lock(&do_nothing_mutex);
pthread_mutex_unlock(&do_nothing_mutex);
pthread_exit(arg);
}
#define thread_num 2
static void subtest_task_iters(void)
{
struct iters_task *skel = NULL;
pthread_t thread_ids[thread_num];
void *ret;
int err;
skel = iters_task__open_and_load();
if (!ASSERT_OK_PTR(skel, "open_and_load"))
goto cleanup;
skel->bss->target_pid = getpid();
err = iters_task__attach(skel);
if (!ASSERT_OK(err, "iters_task__attach"))
goto cleanup;
pthread_mutex_lock(&do_nothing_mutex);
for (int i = 0; i < thread_num; i++)
ASSERT_OK(pthread_create(&thread_ids[i], NULL, &do_nothing_wait, NULL),
"pthread_create");
syscall(SYS_getpgid);
iters_task__detach(skel);
ASSERT_EQ(skel->bss->procs_cnt, 1, "procs_cnt");
ASSERT_EQ(skel->bss->threads_cnt, thread_num + 1, "threads_cnt");
ASSERT_EQ(skel->bss->proc_threads_cnt, thread_num + 1, "proc_threads_cnt");
pthread_mutex_unlock(&do_nothing_mutex);
for (int i = 0; i < thread_num; i++)
ASSERT_OK(pthread_join(thread_ids[i], &ret), "pthread_join");
cleanup:
iters_task__destroy(skel);
}
extern int stack_mprotect(void);
static void subtest_css_task_iters(void)
{
struct iters_css_task *skel = NULL;
int err, cg_fd, cg_id;
const char *cgrp_path = "/cg1";
err = setup_cgroup_environment();
if (!ASSERT_OK(err, "setup_cgroup_environment"))
goto cleanup;
cg_fd = create_and_get_cgroup(cgrp_path);
if (!ASSERT_GE(cg_fd, 0, "create_and_get_cgroup"))
goto cleanup;
cg_id = get_cgroup_id(cgrp_path);
err = join_cgroup(cgrp_path);
if (!ASSERT_OK(err, "join_cgroup"))
goto cleanup;
skel = iters_css_task__open_and_load();
if (!ASSERT_OK_PTR(skel, "open_and_load"))
goto cleanup;
skel->bss->target_pid = getpid();
skel->bss->cg_id = cg_id;
err = iters_css_task__attach(skel);
if (!ASSERT_OK(err, "iters_task__attach"))
goto cleanup;
err = stack_mprotect();
if (!ASSERT_EQ(err, -1, "stack_mprotect") ||
!ASSERT_EQ(errno, EPERM, "stack_mprotect"))
goto cleanup;
iters_css_task__detach(skel);
ASSERT_EQ(skel->bss->css_task_cnt, 1, "css_task_cnt");
cleanup:
cleanup_cgroup_environment();
iters_css_task__destroy(skel);
}
static void subtest_css_iters(void)
{
struct iters_css *skel = NULL;
struct {
const char *path;
int fd;
} cgs[] = {
{ "/cg1" },
{ "/cg1/cg2" },
{ "/cg1/cg2/cg3" },
{ "/cg1/cg2/cg3/cg4" },
};
int err, cg_nr = ARRAY_SIZE(cgs);
int i;
err = setup_cgroup_environment();
if (!ASSERT_OK(err, "setup_cgroup_environment"))
goto cleanup;
for (i = 0; i < cg_nr; i++) {
cgs[i].fd = create_and_get_cgroup(cgs[i].path);
if (!ASSERT_GE(cgs[i].fd, 0, "create_and_get_cgroup"))
goto cleanup;
}
skel = iters_css__open_and_load();
if (!ASSERT_OK_PTR(skel, "open_and_load"))
goto cleanup;
skel->bss->target_pid = getpid();
skel->bss->root_cg_id = get_cgroup_id(cgs[0].path);
skel->bss->leaf_cg_id = get_cgroup_id(cgs[cg_nr - 1].path);
err = iters_css__attach(skel);
if (!ASSERT_OK(err, "iters_task__attach"))
goto cleanup;
syscall(SYS_getpgid);
ASSERT_EQ(skel->bss->pre_order_cnt, cg_nr, "pre_order_cnt");
ASSERT_EQ(skel->bss->first_cg_id, get_cgroup_id(cgs[0].path), "first_cg_id");
ASSERT_EQ(skel->bss->post_order_cnt, cg_nr, "post_order_cnt");
ASSERT_EQ(skel->bss->last_cg_id, get_cgroup_id(cgs[0].path), "last_cg_id");
ASSERT_EQ(skel->bss->tree_high, cg_nr - 1, "tree_high");
iters_css__detach(skel);
cleanup:
cleanup_cgroup_environment();
iters_css__destroy(skel);
}
void test_iters(void)
{
RUN_TESTS(iters_state_safety);
......@@ -161,4 +304,11 @@ void test_iters(void)
subtest_testmod_seq_iters();
if (test__start_subtest("task_vma"))
subtest_task_vma_iters();
if (test__start_subtest("task"))
subtest_task_iters();
if (test__start_subtest("css_task"))
subtest_css_task_iters();
if (test__start_subtest("css"))
subtest_css_iters();
RUN_TESTS(iters_task_failure);
}
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2023 Chuyi Zhou <zhouchuyi@bytedance.com> */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "bpf_misc.h"
#include "bpf_experimental.h"
char _license[] SEC("license") = "GPL";
pid_t target_pid;
u64 root_cg_id, leaf_cg_id;
u64 first_cg_id, last_cg_id;
int pre_order_cnt, post_order_cnt, tree_high;
struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym;
void bpf_cgroup_release(struct cgroup *p) __ksym;
void bpf_rcu_read_lock(void) __ksym;
void bpf_rcu_read_unlock(void) __ksym;
SEC("fentry.s/" SYS_PREFIX "sys_getpgid")
int iter_css_for_each(const void *ctx)
{
struct task_struct *cur_task = bpf_get_current_task_btf();
struct cgroup_subsys_state *root_css, *leaf_css, *pos;
struct cgroup *root_cgrp, *leaf_cgrp, *cur_cgrp;
if (cur_task->pid != target_pid)
return 0;
root_cgrp = bpf_cgroup_from_id(root_cg_id);
if (!root_cgrp)
return 0;
leaf_cgrp = bpf_cgroup_from_id(leaf_cg_id);
if (!leaf_cgrp) {
bpf_cgroup_release(root_cgrp);
return 0;
}
root_css = &root_cgrp->self;
leaf_css = &leaf_cgrp->self;
pre_order_cnt = post_order_cnt = tree_high = 0;
first_cg_id = last_cg_id = 0;
bpf_rcu_read_lock();
bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
cur_cgrp = pos->cgroup;
post_order_cnt++;
last_cg_id = cur_cgrp->kn->id;
}
bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_DESCENDANTS_PRE) {
cur_cgrp = pos->cgroup;
pre_order_cnt++;
if (!first_cg_id)
first_cg_id = cur_cgrp->kn->id;
}
bpf_for_each(css, pos, leaf_css, BPF_CGROUP_ITER_ANCESTORS_UP)
tree_high++;
bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_ANCESTORS_UP)
tree_high--;
bpf_rcu_read_unlock();
bpf_cgroup_release(root_cgrp);
bpf_cgroup_release(leaf_cgrp);
return 0;
}
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2023 Chuyi Zhou <zhouchuyi@bytedance.com> */
#include "vmlinux.h"
#include <errno.h>
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "bpf_misc.h"
#include "bpf_experimental.h"
char _license[] SEC("license") = "GPL";
struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym;
void bpf_cgroup_release(struct cgroup *p) __ksym;
pid_t target_pid;
int css_task_cnt;
u64 cg_id;
SEC("lsm/file_mprotect")
int BPF_PROG(iter_css_task_for_each, struct vm_area_struct *vma,
unsigned long reqprot, unsigned long prot, int ret)
{
struct task_struct *cur_task = bpf_get_current_task_btf();
struct cgroup_subsys_state *css;
struct task_struct *task;
struct cgroup *cgrp;
if (cur_task->pid != target_pid)
return ret;
cgrp = bpf_cgroup_from_id(cg_id);
if (!cgrp)
return -EPERM;
css = &cgrp->self;
css_task_cnt = 0;
bpf_for_each(css_task, task, css, CSS_TASK_ITER_PROCS)
if (task->pid == target_pid)
css_task_cnt++;
bpf_cgroup_release(cgrp);
return -EPERM;
}
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2023 Chuyi Zhou <zhouchuyi@bytedance.com> */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "bpf_misc.h"
#include "bpf_experimental.h"
char _license[] SEC("license") = "GPL";
pid_t target_pid;
int procs_cnt, threads_cnt, proc_threads_cnt;
void bpf_rcu_read_lock(void) __ksym;
void bpf_rcu_read_unlock(void) __ksym;
SEC("fentry.s/" SYS_PREFIX "sys_getpgid")
int iter_task_for_each_sleep(void *ctx)
{
struct task_struct *cur_task = bpf_get_current_task_btf();
struct task_struct *pos;
if (cur_task->pid != target_pid)
return 0;
procs_cnt = threads_cnt = proc_threads_cnt = 0;
bpf_rcu_read_lock();
bpf_for_each(task, pos, NULL, BPF_TASK_ITER_ALL_PROCS)
if (pos->pid == target_pid)
procs_cnt++;
bpf_for_each(task, pos, cur_task, BPF_TASK_ITER_PROC_THREADS)
proc_threads_cnt++;
bpf_for_each(task, pos, NULL, BPF_TASK_ITER_ALL_THREADS)
if (pos->tgid == target_pid)
threads_cnt++;
bpf_rcu_read_unlock();
return 0;
}
// SPDX-License-Identifier: GPL-2.0
/* Copyright (C) 2023 Chuyi Zhou <zhouchuyi@bytedance.com> */
#include "vmlinux.h"
#include <bpf/bpf_helpers.h>
#include <bpf/bpf_tracing.h>
#include "bpf_misc.h"
#include "bpf_experimental.h"
char _license[] SEC("license") = "GPL";
struct cgroup *bpf_cgroup_from_id(u64 cgid) __ksym;
void bpf_cgroup_release(struct cgroup *p) __ksym;
void bpf_rcu_read_lock(void) __ksym;
void bpf_rcu_read_unlock(void) __ksym;
SEC("?fentry.s/" SYS_PREFIX "sys_getpgid")
__failure __msg("expected an RCU CS when using bpf_iter_task_next")
int BPF_PROG(iter_tasks_without_lock)
{
struct task_struct *pos;
bpf_for_each(task, pos, NULL, BPF_TASK_ITER_ALL_PROCS) {
}
return 0;
}
SEC("?fentry.s/" SYS_PREFIX "sys_getpgid")
__failure __msg("expected an RCU CS when using bpf_iter_css_next")
int BPF_PROG(iter_css_without_lock)
{
u64 cg_id = bpf_get_current_cgroup_id();
struct cgroup *cgrp = bpf_cgroup_from_id(cg_id);
struct cgroup_subsys_state *root_css, *pos;
if (!cgrp)
return 0;
root_css = &cgrp->self;
bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
}
bpf_cgroup_release(cgrp);
return 0;
}
SEC("?fentry.s/" SYS_PREFIX "sys_getpgid")
__failure __msg("expected an RCU CS when using bpf_iter_task_next")
int BPF_PROG(iter_tasks_lock_and_unlock)
{
struct task_struct *pos;
bpf_rcu_read_lock();
bpf_for_each(task, pos, NULL, BPF_TASK_ITER_ALL_PROCS) {
bpf_rcu_read_unlock();
bpf_rcu_read_lock();
}
bpf_rcu_read_unlock();
return 0;
}
SEC("?fentry.s/" SYS_PREFIX "sys_getpgid")
__failure __msg("expected an RCU CS when using bpf_iter_css_next")
int BPF_PROG(iter_css_lock_and_unlock)
{
u64 cg_id = bpf_get_current_cgroup_id();
struct cgroup *cgrp = bpf_cgroup_from_id(cg_id);
struct cgroup_subsys_state *root_css, *pos;
if (!cgrp)
return 0;
root_css = &cgrp->self;
bpf_rcu_read_lock();
bpf_for_each(css, pos, root_css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
bpf_rcu_read_unlock();
bpf_rcu_read_lock();
}
bpf_rcu_read_unlock();
bpf_cgroup_release(cgrp);
return 0;
}
SEC("?fentry.s/" SYS_PREFIX "sys_getpgid")
__failure __msg("css_task_iter is only allowed in bpf_lsm and bpf iter-s")
int BPF_PROG(iter_css_task_for_each)
{
u64 cg_id = bpf_get_current_cgroup_id();
struct cgroup *cgrp = bpf_cgroup_from_id(cg_id);
struct cgroup_subsys_state *css;
struct task_struct *task;
if (cgrp == NULL)
return 0;
css = &cgrp->self;
bpf_for_each(css_task, task, css, CSS_TASK_ITER_PROCS) {
}
bpf_cgroup_release(cgrp);
return 0;
}
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment