1. 02 Sep, 2022 2 commits
    • Shmulik Ladkani's avatar
      bpf: Support getting tunnel flags · 44c51472
      Shmulik Ladkani authored
      Existing 'bpf_skb_get_tunnel_key' extracts various tunnel parameters
      (id, ttl, tos, local and remote) but does not expose ip_tunnel_info's
      tun_flags to the BPF program.
      
      It makes sense to expose tun_flags to the BPF program.
      
      Assume for example multiple GRE tunnels maintained on a single GRE
      interface in collect_md mode. The program expects origins to initiate
      over GRE, however different origins use different GRE characteristics
      (e.g. some prefer to use GRE checksum, some do not; some pass a GRE key,
      some do not, etc..).
      
      A BPF program getting tun_flags can therefore remember the relevant
      flags (e.g. TUNNEL_CSUM, TUNNEL_SEQ...) for each initiating remote. In
      the reply path, the program can use 'bpf_skb_set_tunnel_key' in order
      to correctly reply to the remote, using similar characteristics, based
      on the stored tunnel flags.
      
      Introduce BPF_F_TUNINFO_FLAGS flag for bpf_skb_get_tunnel_key. If
      specified, 'bpf_tunnel_key->tunnel_flags' is set with the tun_flags.
      
      Decided to use the existing unused 'tunnel_ext' as the storage for the
      'tunnel_flags' in order to avoid changing bpf_tunnel_key's layout.
      
      Also, the following has been considered during the design:
      
        1. Convert the "interesting" internal TUNNEL_xxx flags back to BPF_F_yyy
           and place into the new 'tunnel_flags' field. This has 2 drawbacks:
      
           - The BPF_F_yyy flags are from *set_tunnel_key* enumeration space,
             e.g. BPF_F_ZERO_CSUM_TX. It is awkward that it is "returned" into
             tunnel_flags from a *get_tunnel_key* call.
           - Not all "interesting" TUNNEL_xxx flags can be mapped to existing
             BPF_F_yyy flags, and it doesn't make sense to create new BPF_F_yyy
             flags just for purposes of the returned tunnel_flags.
      
        2. Place key.tun_flags into 'tunnel_flags' but mask them, keeping only
           "interesting" flags. That's ok, but the drawback is that what's
           "interesting" for my usecase might be limiting for other usecases.
      
      Therefore I decided to expose what's in key.tun_flags *as is*, which seems
      most flexible. The BPF user can just choose to ignore bits he's not
      interested in. The TUNNEL_xxx are also UAPI, so no harm exposing them
      back in the get_tunnel_key call.
      Signed-off-by: default avatarShmulik Ladkani <shmulik.ladkani@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220831144010.174110-1-shmulik.ladkani@gmail.com
      44c51472
    • Shung-Hsi Yu's avatar
      bpf, tnums: Warn against the usage of tnum_in(tnum_range(), ...) · dc84dbbc
      Shung-Hsi Yu authored
      Commit a657182a ("bpf: Don't use tnum_range on array range checking
      for poke descriptors") has shown that using tnum_range() as argument to
      tnum_in() can lead to misleading code that looks like tight bound check
      when in fact the actual allowed range is much wider.
      
      Document such behavior to warn against its usage in general, and suggest
      some scenario where result can be trusted.
      Signed-off-by: default avatarShung-Hsi Yu <shung-hsi.yu@suse.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/984b37f9fdf7ac36831d2137415a4a915744c1b6.1661462653.git.daniel@iogearbox.net
      Link: https://www.openwall.com/lists/oss-security/2022/08/26/1
      Link: https://lore.kernel.org/bpf/20220831031907.16133-3-shung-hsi.yu@suse.com
      Link: https://lore.kernel.org/bpf/20220831031907.16133-2-shung-hsi.yu@suse.com
      dc84dbbc
  2. 01 Sep, 2022 7 commits
  3. 31 Aug, 2022 7 commits
  4. 30 Aug, 2022 1 commit
  5. 29 Aug, 2022 3 commits
    • James Hilliard's avatar
      selftests/bpf: Fix connect4_prog tcp/socket header type conflict · 2eb68040
      James Hilliard authored
      There is a potential for us to hit a type conflict when including
      netinet/tcp.h and sys/socket.h, we can replace both of these includes
      with linux/tcp.h and bpf_tcp_helpers.h to avoid this conflict.
      
      Fixes errors like the below when compiling with gcc BPF backend:
      
        In file included from /usr/include/netinet/tcp.h:91,
                         from progs/connect4_prog.c:11:
        /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: error: conflicting types for 'int8_t'; have 'char'
           34 | typedef __INT8_TYPE__ int8_t;
              |                       ^~~~~~
        In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155,
                         from /usr/include/x86_64-linux-gnu/bits/socket.h:29,
                         from /usr/include/x86_64-linux-gnu/sys/socket.h:33,
                         from progs/connect4_prog.c:10:
        /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'signed char'}
           24 | typedef __int8_t int8_t;
              |                  ^~~~~~
        /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: error: conflicting types for 'int64_t'; have 'long int'
           43 | typedef __INT64_TYPE__ int64_t;
              |                        ^~~~~~~
        /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long long int'}
           27 | typedef __int64_t int64_t;
              |                   ^~~~~~~
      Signed-off-by: default avatarJames Hilliard <james.hilliard1@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220829154710.3870139-1-james.hilliard1@gmail.com
      2eb68040
    • James Hilliard's avatar
      selftests/bpf: Fix bind{4,6} tcp/socket header type conflict · 3721359d
      James Hilliard authored
      There is a potential for us to hit a type conflict when including
      netinet/tcp.h with sys/socket.h, we can remove these as they are not
      actually needed.
      
      Fixes errors like the below when compiling with gcc BPF backend:
      
        In file included from /usr/include/netinet/tcp.h:91,
                         from progs/bind4_prog.c:10:
        /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:34:23: error: conflicting types for 'int8_t'; have 'char'
           34 | typedef __INT8_TYPE__ int8_t;
              |                       ^~~~~~
        In file included from /usr/include/x86_64-linux-gnu/sys/types.h:155,
                         from /usr/include/x86_64-linux-gnu/bits/socket.h:29,
                         from /usr/include/x86_64-linux-gnu/sys/socket.h:33,
                         from progs/bind4_prog.c:9:
        /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:24:18: note: previous declaration of 'int8_t' with type 'int8_t' {aka 'signed char'}
           24 | typedef __int8_t int8_t;
              |                  ^~~~~~
        /home/buildroot/opt/cross/lib/gcc/bpf/13.0.0/include/stdint.h:43:24: error: conflicting types for 'int64_t'; have 'long int'
           43 | typedef __INT64_TYPE__ int64_t;
              |                        ^~~~~~~
        /usr/include/x86_64-linux-gnu/bits/stdint-intn.h:27:19: note: previous declaration of 'int64_t' with type 'int64_t' {aka 'long long int'}
           27 | typedef __int64_t int64_t;
              |                   ^~~~~~~
        make: *** [Makefile:537: /home/buildroot/bpf-next/tools/testing/selftests/bpf/bpf_gcc/bind4_prog.o] Error 1
      Signed-off-by: default avatarJames Hilliard <james.hilliard1@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220826052925.980431-1-james.hilliard1@gmail.com
      3721359d
    • Tiezhu Yang's avatar
      bpf, mips: No need to use min() to get MAX_TAIL_CALL_CNT · bbcf0f55
      Tiezhu Yang authored
      MAX_TAIL_CALL_CNT is 33, so min(MAX_TAIL_CALL_CNT, 0xffff) is always
      MAX_TAIL_CALL_CNT, it is better to use MAX_TAIL_CALL_CNT directly.
      
      At the same time, add BUILD_BUG_ON(MAX_TAIL_CALL_CNT > 0xffff) with a
      comment on why the assertion is there.
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Suggested-by: default avatarJohan Almbladh <johan.almbladh@anyfinetworks.com>
      Signed-off-by: default avatarTiezhu Yang <yangtiezhu@loongson.cn>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/1661742309-2320-1-git-send-email-yangtiezhu@loongson.cn
      bbcf0f55
  6. 27 Aug, 2022 2 commits
  7. 26 Aug, 2022 3 commits
  8. 25 Aug, 2022 14 commits
    • Hao Luo's avatar
      bpf: Add CGROUP prefix to cgroup_iter_order · d4ffb6f3
      Hao Luo authored
      bpf_cgroup_iter_order is globally visible but the entries do not have
      CGROUP prefix. As requested by Andrii, put a CGROUP in the names
      in bpf_cgroup_iter_order.
      
      This patch fixes two previous commits: one introduced the API and
      the other uses the API in bpf selftest (that is, the selftest
      cgroup_hierarchical_stats).
      
      I tested this patch via the following command:
      
        test_progs -t cgroup,iter,btf_dump
      
      Fixes: d4ccaf58 ("bpf: Introduce cgroup iter")
      Fixes: 88886309 ("selftests/bpf: add a selftest for cgroup hierarchical stats collection")
      Suggested-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20220825223936.1865810-1-haoluo@google.comSigned-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      d4ffb6f3
    • Eyal Birger's avatar
      bpf/scripts: Assert helper enum value is aligned with comment order · 0a0d55ef
      Eyal Birger authored
      The helper value is ABI as defined by enum bpf_func_id.
      As bpf_helper_defs.h is used for the userpace part, it must be consistent
      with this enum.
      
      Before this change the comments order was used by the bpf_doc script in
      order to set the helper values defined in the helpers file.
      
      When adding new helpers it is very puzzling when the userspace application
      breaks in weird places if the comment is inserted instead of appended -
      because the generated helper ABI is incorrect and shifted.
      
      This commit sets the helper value to the enum value.
      
      In addition it is currently the practice to have the comments appended
      and kept in the same order as the enum. As such, add an assertion
      validating the comment order is consistent with enum value.
      
      In case a different comments ordering is desired, this assertion can
      be lifted.
      Signed-off-by: default avatarEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20220824181043.1601429-1-eyal.birger@gmail.com
      0a0d55ef
    • Lam Thai's avatar
      bpftool: Fix a wrong type cast in btf_dumper_int · 7184aef9
      Lam Thai authored
      When `data` points to a boolean value, casting it to `int *` is problematic
      and could lead to a wrong value being passed to `jsonw_bool`. Change the
      cast to `bool *` instead.
      
      Fixes: b12d6ec0 ("bpf: btf: add btf print functionality")
      Signed-off-by: default avatarLam Thai <lamthai@arista.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220824225859.9038-1-lamthai@arista.com
      7184aef9
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: rstat: cgroup hierarchical' · eef3c3d3
      Alexei Starovoitov authored
      Hao Luo says:
      
      ====================
      
      This patch series allows for using bpf to collect hierarchical cgroup
      stats efficiently by integrating with the rstat framework. The rstat
      framework provides an efficient way to collect cgroup stats percpu and
      propagate them through the cgroup hierarchy.
      
      The stats are exposed to userspace in textual form by reading files in
      bpffs, similar to cgroupfs stats by using a cgroup_iter program.
      cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
      - walking a cgroup's descendants in pre-order.
      - walking a cgroup's descendants in post-order.
      - walking a cgroup's ancestors.
      - process only a single object.
      
      When attaching cgroup_iter, one needs to set a cgroup to the iter_link
      created from attaching. This cgroup can be passed either as a file
      descriptor or a cgroup id. That cgroup serves as the starting point of
      the walk.
      
      One can also terminate the walk early by returning 1 from the iter
      program.
      
      Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
      program is called with cgroup_mutex held.
      
      ** Background on rstat for stats collection **
      (I am using a subscriber analogy that is not commonly used)
      
      The rstat framework maintains a tree of cgroups that have updates and
      which cpus have updates. A subscriber to the rstat framework maintains
      their own stats. The framework is used to tell the subscriber when
      and what to flush, for the most efficient stats propagation. The
      workflow is as follows:
      
      - When a subscriber updates a cgroup on a cpu, it informs the rstat
        framework by calling cgroup_rstat_updated(cgrp, cpu).
      
      - When a subscriber wants to read some stats for a cgroup, it asks
        the rstat framework to initiate a stats flush (propagation) by calling
        cgroup_rstat_flush(cgrp).
      
      - When the rstat framework initiates a flush, it makes callbacks to
        subscribers to aggregate stats on cpus that have updates, and
        propagate updates to their parent.
      
      Currently, the main subscribers to the rstat framework are cgroup
      subsystems (e.g. memory, block). This patch series allow bpf programs to
      become subscribers as well.
      
      Patches in this series are organized as follows:
      * Patches 1-2 introduce cgroup_iter prog, and a selftest.
      * Patches 3-5 allow bpf programs to integrate with rstat by adding the
        necessary hook points and kfunc. A comprehensive selftest that
        demonstrates the entire workflow for using bpf and rstat to
        efficiently collect and output cgroup stats is added.
      ---
      Changelog:
      v8 -> v9:
      - Make UNSPEC (an invalid option) as the default order for cgroup_iter.
      - Use enum for specifying cgroup_iter order, instead of u32.
      - Add BPF_ITER_RESHCED to cgroup_iter.
      - Add cgroup_hierarchical_stats to s390x denylist.
      
      v7 -> v8:
      - Removed the confusing BPF_ITER_DEFAULT (Andrii)
      - s/SELF/SELF_ONLY/g
      - Fixed typo (e.g. outputing) (Andrii)
      - Use "descendants_pre", "descendants_post" etc. instead of "pre",
        "post" (Andrii)
      
      v6 -> v7:
      - Updated commit/comments in cgroup_iter for read() behavior (Yonghong)
      - Extracted BPF_ITER_SELF and other options out of cgroup_iter, so
        that they can be used in other iters. Also renamed them. (Andrii)
      - Supports both cgroup_fd and cgroup_id when specifying target cgroup.
        (Andrii)
      - Avoided using macro for formatting expected output in cgroup_iter
        selftest. (Andrii)
      - Applied 'static' on all vars and functions in cgroup_iter selftest.
        (Andrii)
      - Fixed broken buf reading in cgroup_iter selftest. (Andrii)
      - Switched to use bpf_link__destroy() unconditionally. (Andrii)
      - Removed 'volatile' for non-const global vars in selftests. (Andrii)
      - Started using bpf_core_enum_value() to get memory_cgrp_id. (Andrii)
      
      v5 -> v6:
      - Rebased on bpf-next
      - Tidy up cgroup_hierarchical_stats test (Andrii)
        * 'static' and 'inline'
        * avoid using libbpf_get_error()
        * string literals of cgroup paths.
      - Rename patch 8/8 to 'selftests/bpf' (Yonghong)
      - Fix cgroup_iter comments (e.g. PAGE_SIZE and uapi) (Yonghong)
      - Make sure further read() returns OK after previous read() finished
        properly (Yonghong)
      - Release cgroup_mutex before the last call of show() (Kumar)
      
      v4 -> v5:
      - Rebased on top of new kfunc flags infrastructure, updated patch 1 and
        patch 6 accordingly.
      - Added docs for sleepable kfuncs.
      
      v3 -> v4:
      - cgroup_iter:
        * reorder fields in bpf_link_info to avoid break uapi (Yonghong)
        * comment the behavior when cgroup_fd=0 (Yonghong)
        * comment on the limit of number of cgroups supported by cgroup_iter.
          (Yonghong)
      - cgroup_hierarchical_stats selftest:
        * Do not return -1 if stats are not found (causes overflow in userspace).
        * Check if child process failed to join cgroup.
        * Make buf and path arrays in get_cgroup_vmscan_delay() static.
        * Increase the test map sizes to accomodate cgroups that are not
          created by the test.
      
      v2 -> v3:
      - cgroup_iter:
        * Added conditional compilation of cgroup_iter.c in kernel/bpf/Makefile
          (kernel test) and dropped the !CONFIG_CGROUP patch.
        * Added validation of traversal_order when attaching (Yonghong).
        * Fixed previous wording "two modes" to "three modes" (Yonghong).
        * Fixed the btf_dump selftest broken by this patch (Yonghong).
        * Fixed ctx_arg_info[0] to use "PTR_TO_BTF_ID_OR_NULL" instead of
          "PTR_TO_BTF_ID", because the "cgroup" pointer passed to iter prog can
           be null.
      - Use __diag_push to eliminate __weak noinline warning in
        bpf_rstat_flush().
      - cgroup_hierarchical_stats selftest:
        * Added write_cgroup_file_parent() helper.
        * Added error handling for failed map updates.
        * Added null check for cgroup in vmscan_flush.
        * Fixed the signature of vmscan_[start/end].
        * Correctly return error code when attaching trace programs fail.
        * Make sure all links are destroyed correctly and not leaking in
          cgroup_hierarchical_stats selftest.
        * Use memory.reclaim instead of memory.high as a more reliable way to
          invoke reclaim.
        * Eliminated sleeps, the test now runs faster.
      
      v1 -> v2:
      - Redesign of cgroup_iter from v1, based on Alexei's idea [1]:
        * supports walking cgroup subtree.
        * supports walking ancestors of a cgroup. (Andrii)
        * supports terminating the walk early.
        * uses fd instead of cgroup_id as parameter for iter_link. Using fd is
          a convention in bpf.
        * gets cgroup's ref at attach time and deref at detach.
        * brought back cgroup1 support for cgroup_iter.
      - Squashed the patches adding the rstat flush hook points and kfuncs
        (Tejun).
      - Added a comment explaining why bpf_rstat_flush() needs to be weak
        (Tejun).
      - Updated the final selftest with the new cgroup_iter design.
      - Changed CHECKs in the selftest with ASSERTs (Yonghong, Andrii).
      - Removed empty line at the end of the selftest (Yonghong).
      - Renamed test files to cgroup_hierarchical_stats.c.
      - Reordered CGROUP_PATH params order to match struct declaration
        in the selftest (Michal).
      - Removed memory_subsys_enabled() and made sure memcg controller
        enablement checks make sense and are documented (Michal).
      
      RFC v2 -> v1:
      - Instead of introducing a new program type for rstat flushing, add an
        empty hook point, bpf_rstat_flush(), and use fentry bpf programs to
        attach to it and flush bpf stats.
      - Instead of using helpers, use kfuncs for rstat functions.
      - These changes simplify the patchset greatly, with minimal changes to
        uapi.
      
      RFC v1 -> RFC v2:
      - Instead of rstat flush programs attach to subsystems, they now attach
        to rstat (global flushers, not per-subsystem), based on discussions
        with Tejun. The first patch is entirely rewritten.
      - Pass cgroup pointers to rstat flushers instead of cgroup ids. This is
        much more flexibility and less likely to need a uapi update later.
      - rstat helpers are now only defined if CGROUP_CONFIG.
      - Most of the code is now only defined if CGROUP_CONFIG and
        CONFIG_BPF_SYSCALL.
      - Move rstat helper protos from bpf_base_func_proto() to
        tracing_prog_func_proto().
      - rstat helpers argument (cgroup pointer) is now ARG_PTR_TO_BTF_ID, not
        ARG_ANYTHING.
      - Rewrote the selftest to use the cgroup helpers.
      - Dropped bpf_map_lookup_percpu_elem (already added by Feng).
      - Dropped patch to support cgroup v1 for cgroup_iter.
      - Dropped patch to define some cgroup_put() when !CONFIG_CGROUP. The
        code that calls it is no longer compiled when !CONFIG_CGROUP.
      
      cgroup_iter was originally introduced in a different patch series[2].
      Hao and I agreed that it fits better as part of this series.
      RFC v1 of this patch series had the following changes from [2]:
      - Getting the cgroup's reference at the time at attaching, instead of
        at the time when iterating. (Yonghong)
      - Remove .init_seq_private and .fini_seq_private callbacks for
        cgroup_iter. They are not needed now. (Yonghong)
      
      [1] https://lore.kernel.org/bpf/20220520221919.jnqgv52k4ajlgzcl@MBP-98dd607d3435.dhcp.thefacebook.com/
      [2] https://lore.kernel.org/lkml/20220225234339.2386398-9-haoluo@google.com/
      
      Hao Luo (2):
        bpf: Introduce cgroup iter
        selftests/bpf: Test cgroup_iter.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      eef3c3d3
    • Yosry Ahmed's avatar
      selftests/bpf: add a selftest for cgroup hierarchical stats collection · 88886309
      Yosry Ahmed authored
      Add a selftest that tests the whole workflow for collecting,
      aggregating (flushing), and displaying cgroup hierarchical stats.
      
      TL;DR:
      - Userspace program creates a cgroup hierarchy and induces memcg reclaim
        in parts of it.
      - Whenever reclaim happens, vmscan_start and vmscan_end update
        per-cgroup percpu readings, and tell rstat which (cgroup, cpu) pairs
        have updates.
      - When userspace tries to read the stats, vmscan_dump calls rstat to flush
        the stats, and outputs the stats in text format to userspace (similar
        to cgroupfs stats).
      - rstat calls vmscan_flush once for every (cgroup, cpu) pair that has
        updates, vmscan_flush aggregates cpu readings and propagates updates
        to parents.
      - Userspace program makes sure the stats are aggregated and read
        correctly.
      
      Detailed explanation:
      - The test loads tracing bpf programs, vmscan_start and vmscan_end, to
        measure the latency of cgroup reclaim. Per-cgroup readings are stored in
        percpu maps for efficiency. When a cgroup reading is updated on a cpu,
        cgroup_rstat_updated(cgroup, cpu) is called to add the cgroup to the
        rstat updated tree on that cpu.
      
      - A cgroup_iter program, vmscan_dump, is loaded and pinned to a file, for
        each cgroup. Reading this file invokes the program, which calls
        cgroup_rstat_flush(cgroup) to ask rstat to propagate the updates for all
        cpus and cgroups that have updates in this cgroup's subtree. Afterwards,
        the stats are exposed to the user. vmscan_dump returns 1 to terminate
        iteration early, so that we only expose stats for one cgroup per read.
      
      - An ftrace program, vmscan_flush, is also loaded and attached to
        bpf_rstat_flush. When rstat flushing is ongoing, vmscan_flush is invoked
        once for each (cgroup, cpu) pair that has updates. cgroups are popped
        from the rstat tree in a bottom-up fashion, so calls will always be
        made for cgroups that have updates before their parents. The program
        aggregates percpu readings to a total per-cgroup reading, and also
        propagates them to the parent cgroup. After rstat flushing is over, all
        cgroups will have correct updated hierarchical readings (including all
        cpus and all their descendants).
      
      - Finally, the test creates a cgroup hierarchy and induces memcg reclaim
        in parts of it, and makes sure that the stats collection, aggregation,
        and reading workflow works as expected.
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20220824233117.1312810-6-haoluo@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      88886309
    • Yosry Ahmed's avatar
      selftests/bpf: extend cgroup helpers · 434992bb
      Yosry Ahmed authored
      This patch extends bpf selft cgroup_helpers [ID] n various ways:
      - Add enable_controllers() that allows tests to enable all or a
        subset of controllers for a specific cgroup.
      - Add join_cgroup_parent(). The cgroup workdir is based on the pid,
        therefore a spawned child cannot join the same cgroup hierarchy of the
        test through join_cgroup(). join_cgroup_parent() is used in child
        processes to join a cgroup under the parent's workdir.
      - Add write_cgroup_file() and write_cgroup_file_parent() (similar to
        join_cgroup_parent() above).
      - Add get_root_cgroup() for tests that need to do checks on root cgroup.
      - Distinguish relative and absolute cgroup paths in function arguments.
        Now relative paths are called relative_path, and absolute paths are
        called cgroup_path.
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20220824233117.1312810-5-haoluo@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      434992bb
    • Yosry Ahmed's avatar
      cgroup: bpf: enable bpf programs to integrate with rstat · a319185b
      Yosry Ahmed authored
      Enable bpf programs to make use of rstat to collect cgroup hierarchical
      stats efficiently:
      - Add cgroup_rstat_updated() kfunc, for bpf progs that collect stats.
      - Add cgroup_rstat_flush() sleepable kfunc, for bpf progs that read stats.
      - Add an empty bpf_rstat_flush() hook that is called during rstat
        flushing, for bpf progs that flush stats to attach to. Attaching a bpf
        prog to this hook effectively registers it as a flush callback.
      Signed-off-by: default avatarYosry Ahmed <yosryahmed@google.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20220824233117.1312810-4-haoluo@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a319185b
    • Hao Luo's avatar
      selftests/bpf: Test cgroup_iter. · fe0dd9d4
      Hao Luo authored
      Add a selftest for cgroup_iter. The selftest creates a mini cgroup tree
      of the following structure:
      
          ROOT (working cgroup)
           |
         PARENT
        /      \
      CHILD1  CHILD2
      
      and tests the following scenarios:
      
       - invalid cgroup fd.
       - pre-order walk over descendants from PARENT.
       - post-order walk over descendants from PARENT.
       - walk of ancestors from PARENT.
       - process only a single object (i.e. PARENT).
       - early termination.
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20220824233117.1312810-3-haoluo@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      fe0dd9d4
    • Hao Luo's avatar
      bpf: Introduce cgroup iter · d4ccaf58
      Hao Luo authored
      Cgroup_iter is a type of bpf_iter. It walks over cgroups in four modes:
      
       - walking a cgroup's descendants in pre-order.
       - walking a cgroup's descendants in post-order.
       - walking a cgroup's ancestors.
       - process only the given cgroup.
      
      When attaching cgroup_iter, one can set a cgroup to the iter_link
      created from attaching. This cgroup is passed as a file descriptor
      or cgroup id and serves as the starting point of the walk. If no
      cgroup is specified, the starting point will be the root cgroup v2.
      
      For walking descendants, one can specify the order: either pre-order or
      post-order. For walking ancestors, the walk starts at the specified
      cgroup and ends at the root.
      
      One can also terminate the walk early by returning 1 from the iter
      program.
      
      Note that because walking cgroup hierarchy holds cgroup_mutex, the iter
      program is called with cgroup_mutex held.
      
      Currently only one session is supported, which means, depending on the
      volume of data bpf program intends to send to user space, the number
      of cgroups that can be walked is limited. For example, given the current
      buffer size is 8 * PAGE_SIZE, if the program sends 64B data for each
      cgroup, assuming PAGE_SIZE is 4kb, the total number of cgroups that can
      be walked is 512. This is a limitation of cgroup_iter. If the output
      data is larger than the kernel buffer size, after all data in the
      kernel buffer is consumed by user space, the subsequent read() syscall
      will signal EOPNOTSUPP. In order to work around, the user may have to
      update their program to reduce the volume of data sent to output. For
      example, skip some uninteresting cgroups. In future, we may extend
      bpf_iter flags to allow customizing buffer size.
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/r/20220824233117.1312810-2-haoluo@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d4ccaf58
    • Yang Yingliang's avatar
      selftests/bpf: Fix wrong size passed to bpf_setsockopt() · 7e165d19
      Yang Yingliang authored
      sizeof(new_cc) is not real memory size that new_cc points to; introduce
      a new_cc_len to store the size and then pass it to bpf_setsockopt().
      
      Fixes: 31123c03 ("selftests/bpf: bpf_setsockopt tests")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220824013907.380448-1-yangyingliang@huawei.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7e165d19
    • Daniel Müller's avatar
      selftests/bpf: Add cb_refs test to s390x deny list · b03914f7
      Daniel Müller authored
      The cb_refs BPF selftest is failing execution on s390x machines. This is
      a newly added test that requires a feature not presently supported on
      this architecture.
      
      Denylist the test for this architecture.
      
      Fixes: 3cf7e7d8685c ("selftests/bpf: Add tests for reference state fixes for callbacks")
      Signed-off-by: default avatarDaniel Müller <deso@posteo.net>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20220824163906.1186832-1-deso@posteo.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b03914f7
    • Alexei Starovoitov's avatar
      Merge branch 'Fix reference state management for synchronous callbacks' · 09683080
      Alexei Starovoitov authored
      Kumar Kartikeya Dwivedi says:
      
      ====================
      
      This is patch 1, 2 + their individual tests split into a separate series from
      the RFC, so that these can be taken in, while we continue working towards a fix
      for handling stack access inside the callback.
      
      Changelog:
      ----------
      v1 -> v2:
      v1: https://lore.kernel.org/bpf/20220822131923.21476-1-memxor@gmail.com
      
        * Fix error for test_progs-no_alu32 due to distinct alloc_insn in errstr
      
      RFC v1 -> v1:
      RFC v1: https://lore.kernel.org/bpf/20220815051540.18791-1-memxor@gmail.com
      
        * Fix up commit log to add more explanation (Alexei)
        * Split reference state fix out into a separate series
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      09683080
    • Kumar Kartikeya Dwivedi's avatar
      selftests/bpf: Add tests for reference state fixes for callbacks · 35f14dbd
      Kumar Kartikeya Dwivedi authored
      These are regression tests to ensure we don't end up in invalid runtime
      state for helpers that execute callbacks multiple times. It exercises
      the fixes to verifier callback handling for reference state in previous
      patches.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220823013226.24988-1-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      35f14dbd
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Fix reference state management for synchronous callbacks · 9d9d00ac
      Kumar Kartikeya Dwivedi authored
      Currently, verifier verifies callback functions (sync and async) as if
      they will be executed once, (i.e. it explores execution state as if the
      function was being called once). The next insn to explore is set to
      start of subprog and the exit from nested frame is handled using
      curframe > 0 and prepare_func_exit. In case of async callback it uses a
      customized variant of push_stack simulating a kind of branch to set up
      custom state and execution context for the async callback.
      
      While this approach is simple and works when callback really will be
      executed only once, it is unsafe for all of our current helpers which
      are for_each style, i.e. they execute the callback multiple times.
      
      A callback releasing acquired references of the caller may do so
      multiple times, but currently verifier sees it as one call inside the
      frame, which then returns to caller. Hence, it thinks it released some
      reference that the cb e.g. got access through callback_ctx (register
      filled inside cb from spilled typed register on stack).
      
      Similarly, it may see that an acquire call is unpaired inside the
      callback, so the caller will copy the reference state of callback and
      then will have to release the register with new ref_obj_ids. But again,
      the callback may execute multiple times, but the verifier will only
      account for acquired references for a single symbolic execution of the
      callback, which will cause leaks.
      
      Note that for async callback case, things are different. While currently
      we have bpf_timer_set_callback which only executes it once, even for
      multiple executions it would be safe, as reference state is NULL and
      check_reference_leak would force program to release state before
      BPF_EXIT. The state is also unaffected by analysis for the caller frame.
      Hence async callback is safe.
      
      Since we want the reference state to be accessible, e.g. for pointers
      loaded from stack through callback_ctx's PTR_TO_STACK, we still have to
      copy caller's reference_state to callback's bpf_func_state, but we
      enforce that whatever references it adds to that reference_state has
      been released before it hits BPF_EXIT. This requires introducing a new
      callback_ref member in the reference state to distinguish between caller
      vs callee references. Hence, check_reference_leak now errors out if it
      sees we are in callback_fn and we have not released callback_ref refs.
      Since there can be multiple nested callbacks, like frame 0 -> cb1 -> cb2
      etc. we need to also distinguish between whether this particular ref
      belongs to this callback frame or parent, and only error for our own, so
      we store state->frameno (which is always non-zero for callbacks).
      
      In short, callbacks can read parent reference_state, but cannot mutate
      it, to be able to use pointers acquired by the caller. They must only
      undo their changes (by releasing their own acquired_refs before
      BPF_EXIT) on top of caller reference_state before returning (at which
      point the caller and callback state will match anyway, so no need to
      copy it back to caller).
      
      Fixes: 69c087ba ("bpf: Add bpf_for_each_map_elem() helper")
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Link: https://lore.kernel.org/r/20220823013125.24938-1-memxor@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9d9d00ac
  9. 23 Aug, 2022 1 commit