1. 09 Dec, 2023 4 commits
  2. 08 Dec, 2023 5 commits
    • Andrii Nakryiko's avatar
      Merge branch 'bpf-fix-accesses-to-uninit-stack-slots' · 4af20ab9
      Andrii Nakryiko authored
      Andrei Matei says:
      
      ====================
      bpf: fix accesses to uninit stack slots
      
      Fix two related issues issues around verifying stack accesses:
      1. accesses to uninitialized stack memory was allowed inconsistently
      2. the maximum stack depth needed for a program was not always
      maintained correctly
      
      The two issues are fixed together in one commit because the code for one
      affects the other.
      
      V4 to V5:
      - target bpf-next (Alexei)
      
      V3 to V4:
      - minor fixup to comment in patch 1 (Eduard)
      - C89-style in patch 3 (Andrii)
      
      V2 to V3:
      - address review comments from Andrii and Eduard
      - drop new verifier tests in favor of editing existing tests to check
        for stack depth
      - append a patch with a bit of cleanup coming out of the previous review
      ====================
      
      Link: https://lore.kernel.org/r/20231208032519.260451-1-andreimatei1@gmail.comSigned-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      4af20ab9
    • Andrei Matei's avatar
      bpf: Minor cleanup around stack bounds · 2929bfac
      Andrei Matei authored
      Push the rounding up of stack offsets into the function responsible for
      growing the stack, rather than relying on all the callers to do it.
      Uncertainty about whether the callers did it or not tripped up people in
      a previous review.
      Signed-off-by: default avatarAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20231208032519.260451-4-andreimatei1@gmail.com
      2929bfac
    • Andrei Matei's avatar
      bpf: Fix accesses to uninit stack slots · 6b4a64ba
      Andrei Matei authored
      Privileged programs are supposed to be able to read uninitialized stack
      memory (ever since 6715df8d) but, before this patch, these accesses
      were permitted inconsistently. In particular, accesses were permitted
      above state->allocated_stack, but not below it. In other words, if the
      stack was already "large enough", the access was permitted, but
      otherwise the access was rejected instead of being allowed to "grow the
      stack". This undesired rejection was happening in two places:
      - in check_stack_slot_within_bounds()
      - in check_stack_range_initialized()
      This patch arranges for these accesses to be permitted. A bunch of tests
      that were relying on the old rejection had to change; all of them were
      changed to add also run unprivileged, in which case the old behavior
      persists. One tests couldn't be updated - global_func16 - because it
      can't run unprivileged for other reasons.
      
      This patch also fixes the tracking of the stack size for variable-offset
      reads. This second fix is bundled in the same commit as the first one
      because they're inter-related. Before this patch, writes to the stack
      using registers containing a variable offset (as opposed to registers
      with fixed, known values) were not properly contributing to the
      function's needed stack size. As a result, it was possible for a program
      to verify, but then to attempt to read out-of-bounds data at runtime
      because a too small stack had been allocated for it.
      
      Each function tracks the size of the stack it needs in
      bpf_subprog_info.stack_depth, which is maintained by
      update_stack_depth(). For regular memory accesses, check_mem_access()
      was calling update_state_depth() but it was passing in only the fixed
      part of the offset register, ignoring the variable offset. This was
      incorrect; the minimum possible value of that register should be used
      instead.
      
      This tracking is now fixed by centralizing the tracking of stack size in
      grow_stack_state(), and by lifting the calls to grow_stack_state() to
      check_stack_access_within_bounds() as suggested by Andrii. The code is
      now simpler and more convincingly tracks the correct maximum stack size.
      check_stack_range_initialized() can now rely on enough stack having been
      allocated for the access; this helps with the fix for the first issue.
      
      A few tests were changed to also check the stack depth computation. The
      one that fails without this patch is verifier_var_off:stack_write_priv_vs_unpriv.
      
      Fixes: 01f810ac ("bpf: Allow variable-offset stack access")
      Reported-by: default avatarHao Sun <sunhao.th@gmail.com>
      Signed-off-by: default avatarAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20231208032519.260451-3-andreimatei1@gmail.com
      
      Closes: https://lore.kernel.org/bpf/CABWLsev9g8UP_c3a=1qbuZUi20tGoUXoU07FPf-5FLvhOKOY+Q@mail.gmail.com/
      6b4a64ba
    • Andrei Matei's avatar
      bpf: Add some comments to stack representation · 92e1567e
      Andrei Matei authored
      Add comments to the datastructure tracking the stack state, as the
      mapping between each stack slot and where its state is stored is not
      entirely obvious.
      Signed-off-by: default avatarAndrei Matei <andreimatei1@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/bpf/20231208032519.260451-2-andreimatei1@gmail.com
      92e1567e
    • David Vernet's avatar
      bpf: Load vmlinux btf for any struct_ops map · 8b7b0e5f
      David Vernet authored
      In libbpf, when determining whether we need to load vmlinux btf, we're
      currently (among other things) checking whether there is any struct_ops
      program present in the object. This works for most realistic struct_ops
      maps, as a struct_ops map is of course typically composed of one or more
      struct_ops programs. However, that technically need not be the case. A
      struct_ops interface could be defined which allows a map to be specified
      which one or more non-prog fields, and which provides default behavior
      if no struct_ops progs is actually provided otherwise. For sched_ext,
      for example, you technically only need to specify the name of the
      scheduler in the struct_ops map, with the core scheduler logic providing
      default behavior if no prog is actually specified.
      
      If we were to define and try to load such a struct_ops map, we would
      crash in libbpf when initializing it as obj->btf_vmlinux will be NULL:
      
      Reading symbols from minimal...
      (gdb) r
      Starting program: minimal_example
      [Thread debugging using libthread_db enabled]
      Using host libthread_db library "/usr/lib/libthread_db.so.1".
      
      Program received signal SIGSEGV, Segmentation fault.
      0x000055555558308c in btf__type_cnt (btf=0x0) at btf.c:612
      612             return btf->start_id + btf->nr_types;
      (gdb) bt
          type_name=0x5555555d99e3 "sched_ext_ops", kind=4) at btf.c:914
          kind=4) at btf.c:942
          type=0x7fffffffe558, type_id=0x7fffffffe548, ...
          data_member=0x7fffffffe568) at libbpf.c:948
          kern_btf=0x0) at libbpf.c:1017
          at libbpf.c:8059
      
      So as to account for such bare-bones struct_ops maps, let's update
      obj_needs_vmlinux_btf() to also iterate over an obj's maps and check
      whether any of them are struct_ops maps.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Link: https://lore.kernel.org/bpf/20231208061704.400463-1-void@manifault.com
      8b7b0e5f
  3. 07 Dec, 2023 12 commits
  4. 06 Dec, 2023 19 commits
    • Andrii Nakryiko's avatar
      bpf: rename MAX_BPF_LINK_TYPE into __MAX_BPF_LINK_TYPE for consistency · 7065eefb
      Andrii Nakryiko authored
      To stay consistent with the naming pattern used for similar cases in BPF
      UAPI (__MAX_BPF_ATTACH_TYPE, etc), rename MAX_BPF_LINK_TYPE into
      __MAX_BPF_LINK_TYPE.
      
      Also similar to MAX_BPF_ATTACH_TYPE and MAX_BPF_REG, add:
      
        #define MAX_BPF_LINK_TYPE __MAX_BPF_LINK_TYPE
      
      Not all __MAX_xxx enums have such #define, so I'm not sure if we should
      add it or not, but I figured I'll start with a completely backwards
      compatible way, and we can drop that, if necessary.
      
      Also adjust a selftest that used MAX_BPF_LINK_TYPE enum.
      Suggested-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20231206190920.1651226-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7065eefb
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-token-and-bpf-fs-based-delegation' · c35919dc
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      BPF token and BPF FS-based delegation
      
      This patch set introduces an ability to delegate a subset of BPF subsystem
      functionality from privileged system-wide daemon (e.g., systemd or any other
      container manager) through special mount options for userns-bound BPF FS to
      a *trusted* unprivileged application. Trust is the key here. This
      functionality is not about allowing unconditional unprivileged BPF usage.
      Establishing trust, though, is completely up to the discretion of respective
      privileged application that would create and mount a BPF FS instance with
      delegation enabled, as different production setups can and do achieve it
      through a combination of different means (signing, LSM, code reviews, etc),
      and it's undesirable and infeasible for kernel to enforce any particular way
      of validating trustworthiness of particular process.
      
      The main motivation for this work is a desire to enable containerized BPF
      applications to be used together with user namespaces. This is currently
      impossible, as CAP_BPF, required for BPF subsystem usage, cannot be namespaced
      or sandboxed, as a general rule. E.g., tracing BPF programs, thanks to BPF
      helpers like bpf_probe_read_kernel() and bpf_probe_read_user() can safely read
      arbitrary memory, and it's impossible to ensure that they only read memory of
      processes belonging to any given namespace. This means that it's impossible to
      have a mechanically verifiable namespace-aware CAP_BPF capability, and as such
      another mechanism to allow safe usage of BPF functionality is necessary.BPF FS
      delegation mount options and BPF token derived from such BPF FS instance is
      such a mechanism. Kernel makes no assumption about what "trusted" constitutes
      in any particular case, and it's up to specific privileged applications and
      their surrounding infrastructure to decide that. What kernel provides is a set
      of APIs to setup and mount special BPF FS instanecs and derive BPF tokens from
      it. BPF FS and BPF token are both bound to its owning userns and in such a way
      are constrained inside intended container. Users can then pass BPF token FD to
      privileged bpf() syscall commands, like BPF map creation and BPF program
      loading, to perform such operations without having init userns privileged.
      
      This version incorporates feedback and suggestions ([3]) received on v3 of
      this patch set, and instead of allowing to create BPF tokens directly assuming
      capable(CAP_SYS_ADMIN), we instead enhance BPF FS to accept a few new
      delegation mount options. If these options are used and BPF FS itself is
      properly created, set up, and mounted inside the user namespaced container,
      user application is able to derive a BPF token object from BPF FS instance,
      and pass that token to bpf() syscall. As explained in patch #3, BPF token
      itself doesn't grant access to BPF functionality, but instead allows kernel to
      do namespaced capabilities checks (ns_capable() vs capable()) for CAP_BPF,
      CAP_PERFMON, CAP_NET_ADMIN, and CAP_SYS_ADMIN, as applicable. So it forms one
      half of a puzzle and allows container managers and sys admins to have safe and
      flexible configuration options: determining which containers get delegation of
      BPF functionality through BPF FS, and then which applications within such
      containers are allowed to perform bpf() commands, based on namespaces
      capabilities.
      
      Previous attempt at addressing this very same problem ([0]) attempted to
      utilize authoritative LSM approach, but was conclusively rejected by upstream
      LSM maintainers. BPF token concept is not changing anything about LSM
      approach, but can be combined with LSM hooks for very fine-grained security
      policy. Some ideas about making BPF token more convenient to use with LSM (in
      particular custom BPF LSM programs) was briefly described in recent LSF/MM/BPF
      2023 presentation ([1]). E.g., an ability to specify user-provided data
      (context), which in combination with BPF LSM would allow implementing a very
      dynamic and fine-granular custom security policies on top of BPF token. In the
      interest of minimizing API surface area and discussions this was relegated to
      follow up patches, as it's not essential to the fundamental concept of
      delegatable BPF token.
      
      It should be noted that BPF token is conceptually quite similar to the idea of
      /dev/bpf device file, proposed by Song a while ago ([2]). The biggest
      difference is the idea of using virtual anon_inode file to hold BPF token and
      allowing multiple independent instances of them, each (potentially) with its
      own set of restrictions. And also, crucially, BPF token approach is not using
      any special stateful task-scoped flags. Instead, bpf() syscall accepts
      token_fd parameters explicitly for each relevant BPF command. This addresses
      main concerns brought up during the /dev/bpf discussion, and fits better with
      overall BPF subsystem design.
      
      This patch set adds a basic minimum of functionality to make BPF token idea
      useful and to discuss API and functionality. Currently only low-level libbpf
      APIs support creating and passing BPF token around, allowing to test kernel
      functionality, but for the most part is not sufficient for real-world
      applications, which typically use high-level libbpf APIs based on `struct
      bpf_object` type. This was done with the intent to limit the size of patch set
      and concentrate on mostly kernel-side changes. All the necessary plumbing for
      libbpf will be sent as a separate follow up patch set kernel support makes it
      upstream.
      
      Another part that should happen once kernel-side BPF token is established, is
      a set of conventions between applications (e.g., systemd), tools (e.g.,
      bpftool), and libraries (e.g., libbpf) on exposing delegatable BPF FS
      instance(s) at well-defined locations to allow applications take advantage of
      this in automatic fashion without explicit code changes on BPF application's
      side. But I'd like to postpone this discussion to after BPF token concept
      lands.
      
        [0] https://lore.kernel.org/bpf/20230412043300.360803-1-andrii@kernel.org/
        [1] http://vger.kernel.org/bpfconf2023_material/Trusted_unprivileged_BPF_LSFMM2023.pdf
        [2] https://lore.kernel.org/bpf/20190627201923.2589391-2-songliubraving@fb.com/
        [3] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/
      
      v11->v12:
        - enforce exact userns match in bpf_token_capable() and
          bpf_token_allow_cmd() checks, for added strictness (Christian);
      v10->v11:
        - fix BPF FS root check to disallow using bind-mounted subdirectory of BPF
          FS instance (Christian);
        - further restrict BPF_TOKEN_CREATE command to be executed from inside
          exactly the same user namespace as the one used to create BPF FS instance
          (Christian);
      v9->v10:
        - slight adjustments in LSM parts (Paul);
        - setting delegate_xxx  options require capable(CAP_SYS_ADMIN) (Christian);
        - simplify BPF_TOKEN_CREATE UAPI by accepting BPF FS FD directly (Christian);
      v8->v9:
        - fix issue in selftests due to sys/mount.h header (Jiri);
        - fix warning in doc comments in LSM hooks (kernel test robot);
      v7->v8:
        - add bpf_token_allow_cmd and bpf_token_capable hooks (Paul);
        - inline bpf_token_alloc() into bpf_token_create() to prevent accidental
          divergence with security_bpf_token_create() hook (Paul);
      v6->v7:
        - separate patches to refactor bpf_prog_alloc/bpf_map_alloc LSM hooks, as
          discussed with Paul, and now they also accept struct bpf_token;
        - added bpf_token_create/bpf_token_free to allow LSMs (SELinux,
          specifically) to set up security LSM blob (Paul);
        - last patch also wires bpf_security_struct setup by SELinux, similar to how
          it's done for BPF map/prog, though I'm not sure if that's enough, so worst
          case it's easy to drop this patch if more full fledged SELinux
          implementation will be done separately;
        - small fixes for issues caught by code reviews (Jiri, Hou);
        - fix for test_maps test that doesn't use LIBBPF_OPTS() macro (CI);
      v5->v6:
        - fix possible use of uninitialized variable in selftests (CI);
        - don't use anon_inode, instead create one from BPF FS instance (Christian);
        - don't store bpf_token inside struct bpf_map, instead pass it explicitly to
          map_check_btf(). We do store bpf_token inside prog->aux, because it's used
          during verification and even can be checked during attach time for some
          program types;
        - LSM hooks are left intact pending the conclusion of discussion with Paul
          Moore; I'd prefer to do LSM-related changes as a follow up patch set
          anyways;
      v4->v5:
        - add pre-patch unifying CAP_NET_ADMIN handling inside kernel/bpf/syscall.c
          (Paul Moore);
        - fix build warnings and errors in selftests and kernel, detected by CI and
          kernel test robot;
      v3->v4:
        - add delegation mount options to BPF FS;
        - BPF token is derived from the instance of BPF FS and associates itself
          with BPF FS' owning userns;
        - BPF token doesn't grant BPF functionality directly, it just turns
          capable() checks into ns_capable() checks within BPF FS' owning user;
        - BPF token cannot be pinned;
      v2->v3:
        - make BPF_TOKEN_CREATE pin created BPF token in BPF FS, and disallow
          BPF_OBJ_PIN for BPF token;
      v1->v2:
        - fix build failures on Kconfig with CONFIG_BPF_SYSCALL unset;
        - drop BPF_F_TOKEN_UNKNOWN_* flags and simplify UAPI (Stanislav).
      ====================
      
      Link: https://lore.kernel.org/r/20231130185229.2688956-1-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c35919dc
    • Andrii Nakryiko's avatar
      bpf,selinux: allocate bpf_security_struct per BPF token · 36fb9494
      Andrii Nakryiko authored
      Utilize newly added bpf_token_create/bpf_token_free LSM hooks to
      allocate struct bpf_security_struct for each BPF token object in
      SELinux. This just follows similar pattern for BPF prog and map.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-18-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      36fb9494
    • Andrii Nakryiko's avatar
      selftests/bpf: add BPF token-enabled tests · dc5196fa
      Andrii Nakryiko authored
      Add a selftest that attempts to conceptually replicate intended BPF
      token use cases inside user namespaced container.
      
      Child process is forked. It is then put into its own userns and mountns.
      Child creates BPF FS context object. This ensures child userns is
      captured as the owning userns for this instance of BPF FS. Given setting
      delegation mount options is privileged operation, we ensure that child
      cannot set them.
      
      This context is passed back to privileged parent process through Unix
      socket, where parent sets up delegation options, creates, and mounts it
      as a detached mount. This mount FD is passed back to the child to be
      used for BPF token creation, which allows otherwise privileged BPF
      operations to succeed inside userns.
      
      We validate that all of token-enabled privileged commands (BPF_BTF_LOAD,
      BPF_MAP_CREATE, and BPF_PROG_LOAD) work as intended. They should only
      succeed inside the userns if a) BPF token is provided with proper
      allowed sets of commands and types; and b) namespaces CAP_BPF and other
      privileges are set. Lacking a) or b) should lead to -EPERM failures.
      
      Based on suggested workflow by Christian Brauner ([0]).
      
        [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-17-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dc5196fa
    • Andrii Nakryiko's avatar
      1571740a
    • Andrii Nakryiko's avatar
      libbpf: add BPF token support to bpf_btf_load() API · 1a8df7fa
      Andrii Nakryiko authored
      Allow user to specify token_fd for bpf_btf_load() API that wraps
      kernel's BPF_BTF_LOAD command. This allows loading BTF from unprivileged
      process as long as it has BPF token allowing BPF_BTF_LOAD command, which
      can be created and delegated by privileged process.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-15-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1a8df7fa
    • Andrii Nakryiko's avatar
      libbpf: add BPF token support to bpf_map_create() API · 37891cea
      Andrii Nakryiko authored
      Add ability to provide token_fd for BPF_MAP_CREATE command through
      bpf_map_create() API.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-14-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      37891cea
    • Andrii Nakryiko's avatar
      libbpf: add bpf_token_create() API · ecd43514
      Andrii Nakryiko authored
      Add low-level wrapper API for BPF_TOKEN_CREATE command in bpf() syscall.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-13-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ecd43514
    • Andrii Nakryiko's avatar
      bpf,lsm: add BPF token LSM hooks · d734ca7b
      Andrii Nakryiko authored
      Wire up bpf_token_create and bpf_token_free LSM hooks, which allow to
      allocate LSM security blob (we add `void *security` field to struct
      bpf_token for that), but also control who can instantiate BPF token.
      This follows existing pattern for BPF map and BPF prog.
      
      Also add security_bpf_token_allow_cmd() and security_bpf_token_capable()
      LSM hooks that allow LSM implementation to control and negate (if
      necessary) BPF token's delegation of a specific bpf_cmd and capability,
      respectively.
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-12-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d734ca7b
    • Andrii Nakryiko's avatar
      bpf,lsm: refactor bpf_map_alloc/bpf_map_free LSM hooks · 66d636d7
      Andrii Nakryiko authored
      Similarly to bpf_prog_alloc LSM hook, rename and extend bpf_map_alloc
      hook into bpf_map_create, taking not just struct bpf_map, but also
      bpf_attr and bpf_token, to give a fuller context to LSMs.
      
      Unlike bpf_prog_alloc, there is no need to move the hook around, as it
      currently is firing right before allocating BPF map ID and FD, which
      seems to be a sweet spot.
      
      But like bpf_prog_alloc/bpf_prog_free combo, make sure that bpf_map_free
      LSM hook is called even if bpf_map_create hook returned error, as if few
      LSMs are combined together it could be that one LSM successfully
      allocated security blob for its needs, while subsequent LSM rejected BPF
      map creation. The former LSM would still need to free up LSM blob, so we
      need to ensure security_bpf_map_free() is called regardless of the
      outcome.
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-11-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      66d636d7
    • Andrii Nakryiko's avatar
      bpf,lsm: refactor bpf_prog_alloc/bpf_prog_free LSM hooks · c3dd6e94
      Andrii Nakryiko authored
      Based on upstream discussion ([0]), rework existing
      bpf_prog_alloc_security LSM hook. Rename it to bpf_prog_load and instead
      of passing bpf_prog_aux, pass proper bpf_prog pointer for a full BPF
      program struct. Also, we pass bpf_attr union with all the user-provided
      arguments for BPF_PROG_LOAD command.  This will give LSMs as much
      information as we can basically provide.
      
      The hook is also BPF token-aware now, and optional bpf_token struct is
      passed as a third argument. bpf_prog_load LSM hook is called after
      a bunch of sanity checks were performed, bpf_prog and bpf_prog_aux were
      allocated and filled out, but right before performing full-fledged BPF
      verification step.
      
      bpf_prog_free LSM hook is now accepting struct bpf_prog argument, for
      consistency. SELinux code is adjusted to all new names, types, and
      signatures.
      
      Note, given that bpf_prog_load (previously bpf_prog_alloc) hook can be
      used by some LSMs to allocate extra security blob, but also by other
      LSMs to reject BPF program loading, we need to make sure that
      bpf_prog_free LSM hook is called after bpf_prog_load/bpf_prog_alloc one
      *even* if the hook itself returned error. If we don't do that, we run
      the risk of leaking memory. This seems to be possible today when
      combining SELinux and BPF LSM, as one example, depending on their
      relative ordering.
      
      Also, for BPF LSM setup, add bpf_prog_load and bpf_prog_free to
      sleepable LSM hooks list, as they are both executed in sleepable
      context. Also drop bpf_prog_load hook from untrusted, as there is no
      issue with refcount or anything else anymore, that originally forced us
      to add it to untrusted list in c0c852dd ("bpf: Do not mark certain LSM
      hook arguments as trusted"). We now trigger this hook much later and it
      should not be an issue anymore.
      
        [0] https://lore.kernel.org/bpf/9fe88aef7deabbe87d3fc38c4aea3c69.paul@paul-moore.com/Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-10-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c3dd6e94
    • Andrii Nakryiko's avatar
      bpf: consistently use BPF token throughout BPF verifier logic · 8062fb12
      Andrii Nakryiko authored
      Remove remaining direct queries to perfmon_capable() and bpf_capable()
      in BPF verifier logic and instead use BPF token (if available) to make
      decisions about privileges.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-9-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8062fb12
    • Andrii Nakryiko's avatar
      bpf: take into account BPF token when fetching helper protos · 4cbb270e
      Andrii Nakryiko authored
      Instead of performing unconditional system-wide bpf_capable() and
      perfmon_capable() calls inside bpf_base_func_proto() function (and other
      similar ones) to determine eligibility of a given BPF helper for a given
      program, use previously recorded BPF token during BPF_PROG_LOAD command
      handling to inform the decision.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-8-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4cbb270e
    • Andrii Nakryiko's avatar
      bpf: add BPF token support to BPF_PROG_LOAD command · e1cef620
      Andrii Nakryiko authored
      Add basic support of BPF token to BPF_PROG_LOAD. Wire through a set of
      allowed BPF program types and attach types, derived from BPF FS at BPF
      token creation time. Then make sure we perform bpf_token_capable()
      checks everywhere where it's relevant.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-7-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e1cef620
    • Andrii Nakryiko's avatar
      bpf: add BPF token support to BPF_BTF_LOAD command · ee54b1a9
      Andrii Nakryiko authored
      Accept BPF token FD in BPF_BTF_LOAD command to allow BTF data loading
      through delegated BPF token. BTF loading is a pretty straightforward
      operation, so as long as BPF token is created with allow_cmds granting
      BPF_BTF_LOAD command, kernel proceeds to parsing BTF data and creating
      BTF object.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-6-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ee54b1a9
    • Andrii Nakryiko's avatar
      bpf: add BPF token support to BPF_MAP_CREATE command · 688b7270
      Andrii Nakryiko authored
      Allow providing token_fd for BPF_MAP_CREATE command to allow controlled
      BPF map creation from unprivileged process through delegated BPF token.
      
      Wire through a set of allowed BPF map types to BPF token, derived from
      BPF FS at BPF token creation time. This, in combination with allowed_cmds
      allows to create a narrowly-focused BPF token (controlled by privileged
      agent) with a restrictive set of BPF maps that application can attempt
      to create.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-5-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      688b7270
    • Andrii Nakryiko's avatar
      bpf: introduce BPF token object · 4527358b
      Andrii Nakryiko authored
      Add new kind of BPF kernel object, BPF token. BPF token is meant to
      allow delegating privileged BPF functionality, like loading a BPF
      program or creating a BPF map, from privileged process to a *trusted*
      unprivileged process, all while having a good amount of control over which
      privileged operations could be performed using provided BPF token.
      
      This is achieved through mounting BPF FS instance with extra delegation
      mount options, which determine what operations are delegatable, and also
      constraining it to the owning user namespace (as mentioned in the
      previous patch).
      
      BPF token itself is just a derivative from BPF FS and can be created
      through a new bpf() syscall command, BPF_TOKEN_CREATE, which accepts BPF
      FS FD, which can be attained through open() API by opening BPF FS mount
      point. Currently, BPF token "inherits" delegated command, map types,
      prog type, and attach type bit sets from BPF FS as is. In the future,
      having an BPF token as a separate object with its own FD, we can allow
      to further restrict BPF token's allowable set of things either at the
      creation time or after the fact, allowing the process to guard itself
      further from unintentionally trying to load undesired kind of BPF
      programs. But for now we keep things simple and just copy bit sets as is.
      
      When BPF token is created from BPF FS mount, we take reference to the
      BPF super block's owning user namespace, and then use that namespace for
      checking all the {CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN, CAP_SYS_ADMIN}
      capabilities that are normally only checked against init userns (using
      capable()), but now we check them using ns_capable() instead (if BPF
      token is provided). See bpf_token_capable() for details.
      
      Such setup means that BPF token in itself is not sufficient to grant BPF
      functionality. User namespaced process has to *also* have necessary
      combination of capabilities inside that user namespace. So while
      previously CAP_BPF was useless when granted within user namespace, now
      it gains a meaning and allows container managers and sys admins to have
      a flexible control over which processes can and need to use BPF
      functionality within the user namespace (i.e., container in practice).
      And BPF FS delegation mount options and derived BPF tokens serve as
      a per-container "flag" to grant overall ability to use bpf() (plus further
      restrict on which parts of bpf() syscalls are treated as namespaced).
      
      Note also, BPF_TOKEN_CREATE command itself requires ns_capable(CAP_BPF)
      within the BPF FS owning user namespace, rounding up the ns_capable()
      story of BPF token.
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4527358b
    • Andrii Nakryiko's avatar
      bpf: add BPF token delegation mount options to BPF FS · 40bba140
      Andrii Nakryiko authored
      Add few new mount options to BPF FS that allow to specify that a given
      BPF FS instance allows creation of BPF token (added in the next patch),
      and what sort of operations are allowed under BPF token. As such, we get
      4 new mount options, each is a bit mask
        - `delegate_cmds` allow to specify which bpf() syscall commands are
          allowed with BPF token derived from this BPF FS instance;
        - if BPF_MAP_CREATE command is allowed, `delegate_maps` specifies
          a set of allowable BPF map types that could be created with BPF token;
        - if BPF_PROG_LOAD command is allowed, `delegate_progs` specifies
          a set of allowable BPF program types that could be loaded with BPF token;
        - if BPF_PROG_LOAD command is allowed, `delegate_attachs` specifies
          a set of allowable BPF program attach types that could be loaded with
          BPF token; delegate_progs and delegate_attachs are meant to be used
          together, as full BPF program type is, in general, determined
          through both program type and program attach type.
      
      Currently, these mount options accept the following forms of values:
        - a special value "any", that enables all possible values of a given
        bit set;
        - numeric value (decimal or hexadecimal, determined by kernel
        automatically) that specifies a bit mask value directly;
        - all the values for a given mount option are combined, if specified
        multiple times. E.g., `mount -t bpf nodev /path/to/mount -o
        delegate_maps=0x1 -o delegate_maps=0x2` will result in a combined 0x3
        mask.
      
      Ideally, more convenient (for humans) symbolic form derived from
      corresponding UAPI enums would be accepted (e.g., `-o
      delegate_progs=kprobe|tracepoint`) and I intend to implement this, but
      it requires a bunch of UAPI header churn, so I postponed it until this
      feature lands upstream or at least there is a definite consensus that
      this feature is acceptable and is going to make it, just to minimize
      amount of wasted effort and not increase amount of non-essential code to
      be reviewed.
      
      Attentive reader will notice that BPF FS is now marked as
      FS_USERNS_MOUNT, which theoretically makes it mountable inside non-init
      user namespace as long as the process has sufficient *namespaced*
      capabilities within that user namespace. But in reality we still
      restrict BPF FS to be mountable only by processes with CAP_SYS_ADMIN *in
      init userns* (extra check in bpf_fill_super()). FS_USERNS_MOUNT is added
      to allow creating BPF FS context object (i.e., fsopen("bpf")) from
      inside unprivileged process inside non-init userns, to capture that
      userns as the owning userns. It will still be required to pass this
      context object back to privileged process to instantiate and mount it.
      
      This manipulation is important, because capturing non-init userns as the
      owning userns of BPF FS instance (super block) allows to use that userns
      to constraint BPF token to that userns later on (see next patch). So
      creating BPF FS with delegation inside unprivileged userns will restrict
      derived BPF token objects to only "work" inside that intended userns,
      making it scoped to a intended "container". Also, setting these
      delegation options requires capable(CAP_SYS_ADMIN), so unprivileged
      process cannot set this up without involvement of a privileged process.
      
      There is a set of selftests at the end of the patch set that simulates
      this sequence of steps and validates that everything works as intended.
      But careful review is requested to make sure there are no missed gaps in
      the implementation and testing.
      
      This somewhat subtle set of aspects is the result of previous
      discussions ([0]) about various user namespace implications and
      interactions with BPF token functionality and is necessary to contain
      BPF token inside intended user namespace.
      
        [0] https://lore.kernel.org/bpf/20230704-hochverdient-lehne-eeb9eeef785e@brauner/Acked-by: default avatarChristian Brauner <brauner@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      40bba140
    • Andrii Nakryiko's avatar
      bpf: align CAP_NET_ADMIN checks with bpf_capable() approach · 909fa05d
      Andrii Nakryiko authored
      Within BPF syscall handling code CAP_NET_ADMIN checks stand out a bit
      compared to CAP_BPF and CAP_PERFMON checks. For the latter, CAP_BPF or
      CAP_PERFMON are checked first, but if they are not set, CAP_SYS_ADMIN
      takes over and grants whatever part of BPF syscall is required.
      
      Similar kind of checks that involve CAP_NET_ADMIN are not so consistent.
      One out of four uses does follow CAP_BPF/CAP_PERFMON model: during
      BPF_PROG_LOAD, if the type of BPF program is "network-related" either
      CAP_NET_ADMIN or CAP_SYS_ADMIN is required to proceed.
      
      But in three other cases CAP_NET_ADMIN is required even if CAP_SYS_ADMIN
      is set:
        - when creating DEVMAP/XDKMAP/CPU_MAP maps;
        - when attaching CGROUP_SKB programs;
        - when handling BPF_PROG_QUERY command.
      
      This patch is changing the latter three cases to follow BPF_PROG_LOAD
      model, that is allowing to proceed under either CAP_NET_ADMIN or
      CAP_SYS_ADMIN.
      
      This also makes it cleaner in subsequent BPF token patches to switch
      wholesomely to a generic bpf_token_capable(int cap) check, that always
      falls back to CAP_SYS_ADMIN if requested capability is missing.
      
      Cc: Jakub Kicinski <kuba@kernel.org>
      Acked-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20231130185229.2688956-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      909fa05d