1. 15 Jan, 2020 9 commits
  2. 14 Jan, 2020 9 commits
  3. 10 Jan, 2020 14 commits
  4. 09 Jan, 2020 8 commits
    • Andrey Ignatov's avatar
      bpf: Document BPF_F_QUERY_EFFECTIVE flag · f5bfcd95
      Andrey Ignatov authored
      Document BPF_F_QUERY_EFFECTIVE flag, mostly to clarify how it affects
      attach_flags what may not be obvious and what may lead to confision.
      
      Specifically attach_flags is returned only for target_fd but if programs
      are inherited from an ancestor cgroup then returned attach_flags for
      current cgroup may be confusing. For example, two effective programs of
      same attach_type can be returned but w/o BPF_F_ALLOW_MULTI in
      attach_flags.
      
      Simple repro:
        # bpftool c s /sys/fs/cgroup/path/to/task
        ID       AttachType      AttachFlags     Name
        # bpftool c s /sys/fs/cgroup/path/to/task effective
        ID       AttachType      AttachFlags     Name
        95043    ingress                         tw_ipt_ingress
        95048    ingress                         tw_ingress
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200108014006.938363-1-rdna@fb.com
      f5bfcd95
    • Alexei Starovoitov's avatar
      Merge branch 'tcp-bpf-cc' · 417759f7
      Alexei Starovoitov authored
      Martin Lau says:
      
      ====================
      This series introduces BPF STRUCT_OPS.  It is an infra to allow
      implementing some specific kernel's function pointers in BPF.
      The first use case included in this series is to implement
      TCP congestion control algorithm in BPF  (i.e. implement
      struct tcp_congestion_ops in BPF).
      
      There has been attempt to move the TCP CC to the user space
      (e.g. CCP in TCP).   The common arguments are faster turn around,
      get away from long-tail kernel versions in production...etc,
      which are legit points.
      
      BPF has been the continuous effort to join both kernel and
      userspace upsides together (e.g. XDP to gain the performance
      advantage without bypassing the kernel).  The recent BPF
      advancements (in particular BTF-aware verifier, BPF trampoline,
      BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
      possible in BPF.
      
      The idea is to allow implementing tcp_congestion_ops in bpf.
      It allows a faster turnaround for testing algorithm in the
      production while leveraging the existing (and continue growing) BPF
      feature/framework instead of building one specifically for
      userspace TCP CC.
      
      Please see individual patch for details.
      
      The bpftool support will be posted in follow-up patches.
      
      v4:
      - Expose tcp_ca_find() to tcp.h in patch 7.
        It is used to check the same bpf-tcp-cc
        does not exist to guarantee the register()
        will succeed.
      - set_memory_ro() and then set_memory_x() only after all
        trampolines are written to the image in patch 6. (Daniel)
        spinlock is replaced by mutex because set_memory_*
        requires sleepable context.
      
      v3:
      - Fix kbuild error by considering CONFIG_BPF_SYSCALL (kbuild)
      - Support anonymous bitfield in patch 4 (Andrii, Yonghong)
      - Push boundary safety check to a specific arch's trampoline function
        (in patch 6) (Yonghong).
        Reuse the WANR_ON_ONCE check in arch_prepare_bpf_trampoline() in x86.
      - Check module field is 0 in udata in patch 6 (Yonghong)
      - Check zero holes in patch 6 (Andrii)
      - s/_btf_vmlinux/btf/ in patch 5 and 7 (Andrii)
      - s/check_xxx/is_xxx/ in patch 7 (Andrii)
      - Use "struct_ops/" convention in patch 11 (Andrii)
      - Use the skel instead of bpf_object in patch 11 (Andrii)
      - libbpf: Decide BPF_PROG_TYPE_STRUCT_OPS at open phase by using
                find_sec_def()
      - libbpf: Avoid a debug message at open phase (Andrii)
      - libbpf: Add bpf_program__(is|set)_struct_ops() for consistency (Andrii)
      - libbpf: Add "struct_ops" to section_defs (Andrii)
      - libbpf: Some code shuffling in init_kern_struct_ops() (Andrii)
      - libbpf: A few safety checks (Andrii)
      
      v2:
      - Dropped cubic for now.  They will be reposted
        once there are more clarity in "jiffies" on both
        bpf side (about the helper) and
        tcp_cubic side (some of jiffies usages are being replaced
        by tp->tcp_mstamp)
      - Remove unnecssary check on bitfield support from btf_struct_access()
        (Yonghong)
      - BTF_TYPE_EMIT macro (Yonghong, Andrii)
      - value_name's length check to avoid an unlikely
        type match during truncation case (Yonghong)
      - BUILD_BUG_ON to ensure no trampoline-image overrun
        in the future (Yonghong)
      - Simplify get_next_key() (Yonghong)
      - Added comment to explain how to check mandatory
        func ptr in net/ipv4/bpf_tcp_ca.c (Yonghong)
      - Rename "__bpf_" to "bpf_struct_ops_" for value prefix (Andrii)
      - Add comment to highlight the bpf_dctcp.c is not necessarily
        the same as tcp_dctcp.c. (Alexei, Eric)
      - libbpf: Renmae "struct_ops" to ".struct_ops" for elf sec (Andrii)
      - libbpf: Expose struct_ops as a bpf_map (Andrii)
      - libbpf: Support multiple struct_ops in SEC(".struct_ops") (Andrii)
      - libbpf: Add bpf_map__attach_struct_ops()  (Andrii)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      417759f7
    • Martin KaFai Lau's avatar
      bpf: Add bpf_dctcp example · 09903869
      Martin KaFai Lau authored
      This patch adds a bpf_dctcp example.  It currently does not do
      no-ECN fallback but the same could be done through the cgrp2-bpf.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200109003517.3856825-1-kafai@fb.com
      09903869
    • Martin KaFai Lau's avatar
      bpf: libbpf: Add STRUCT_OPS support · 590a0088
      Martin KaFai Lau authored
      This patch adds BPF STRUCT_OPS support to libbpf.
      
      The only sec_name convention is SEC(".struct_ops") to identify the
      struct_ops implemented in BPF,
      e.g. To implement a tcp_congestion_ops:
      
      SEC(".struct_ops")
      struct tcp_congestion_ops dctcp = {
      	.init           = (void *)dctcp_init,  /* <-- a bpf_prog */
      	/* ... some more func prts ... */
      	.name           = "bpf_dctcp",
      };
      
      Each struct_ops is defined as a global variable under SEC(".struct_ops")
      as above.  libbpf creates a map for each variable and the variable name
      is the map's name.  Multiple struct_ops is supported under
      SEC(".struct_ops").
      
      In the bpf_object__open phase, libbpf will look for the SEC(".struct_ops")
      section and find out what is the btf-type the struct_ops is
      implementing.  Note that the btf-type here is referring to
      a type in the bpf_prog.o's btf.  A "struct bpf_map" is added
      by bpf_object__add_map() as other maps do.  It will then
      collect (through SHT_REL) where are the bpf progs that the
      func ptrs are referring to.  No btf_vmlinux is needed in
      the open phase.
      
      In the bpf_object__load phase, the map-fields, which depend
      on the btf_vmlinux, are initialized (in bpf_map__init_kern_struct_ops()).
      It will also set the prog->type, prog->attach_btf_id, and
      prog->expected_attach_type.  Thus, the prog's properties do
      not rely on its section name.
      [ Currently, the bpf_prog's btf-type ==> btf_vmlinux's btf-type matching
        process is as simple as: member-name match + btf-kind match + size match.
        If these matching conditions fail, libbpf will reject.
        The current targeting support is "struct tcp_congestion_ops" which
        most of its members are function pointers.
        The member ordering of the bpf_prog's btf-type can be different from
        the btf_vmlinux's btf-type. ]
      
      Then, all obj->maps are created as usual (in bpf_object__create_maps()).
      
      Once the maps are created and prog's properties are all set,
      the libbpf will proceed to load all the progs.
      
      bpf_map__attach_struct_ops() is added to register a struct_ops
      map to a kernel subsystem.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200109003514.3856730-1-kafai@fb.com
      590a0088
    • Martin KaFai Lau's avatar
      bpf: Synch uapi bpf.h to tools/ · 17328d61
      Martin KaFai Lau authored
      This patch sync uapi bpf.h to tools/
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200109003512.3856559-1-kafai@fb.com
      17328d61
    • Martin KaFai Lau's avatar
      bpf: Add BPF_FUNC_tcp_send_ack helper · 206057fe
      Martin KaFai Lau authored
      Add a helper to send out a tcp-ack.  It will be used in the later
      bpf_dctcp implementation that requires to send out an ack
      when the CE state changed.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109004551.3900448-1-kafai@fb.com
      206057fe
    • Martin KaFai Lau's avatar
      bpf: tcp: Support tcp_congestion_ops in bpf · 0baf26b0
      Martin KaFai Lau authored
      This patch makes "struct tcp_congestion_ops" to be the first user
      of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
      in bpf.
      
      The BPF implemented tcp_congestion_ops can be used like
      regular kernel tcp-cc through sysctl and setsockopt.  e.g.
      [root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
      net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
      net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
      net.ipv4.tcp_congestion_control = bpf_cubic
      
      There has been attempt to move the TCP CC to the user space
      (e.g. CCP in TCP).   The common arguments are faster turn around,
      get away from long-tail kernel versions in production...etc,
      which are legit points.
      
      BPF has been the continuous effort to join both kernel and
      userspace upsides together (e.g. XDP to gain the performance
      advantage without bypassing the kernel).  The recent BPF
      advancements (in particular BTF-aware verifier, BPF trampoline,
      BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
      possible in BPF.  It allows a faster turnaround for testing algorithm
      in the production while leveraging the existing (and continue growing)
      BPF feature/framework instead of building one specifically for
      userspace TCP CC.
      
      This patch allows write access to a few fields in tcp-sock
      (in bpf_tcp_ca_btf_struct_access()).
      
      The optional "get_info" is unsupported now.  It can be added
      later.  One possible way is to output the info with a btf-id
      to describe the content.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
      0baf26b0
    • Martin KaFai Lau's avatar
      bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS · 85d33df3
      Martin KaFai Lau authored
      The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
      is a kernel struct with its func ptr implemented in bpf prog.
      This new map is the interface to register/unregister/introspect
      a bpf implemented kernel struct.
      
      The kernel struct is actually embedded inside another new struct
      (or called the "value" struct in the code).  For example,
      "struct tcp_congestion_ops" is embbeded in:
      struct bpf_struct_ops_tcp_congestion_ops {
      	refcount_t refcnt;
      	enum bpf_struct_ops_state state;
      	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
      }
      The map value is "struct bpf_struct_ops_tcp_congestion_ops".
      The "bpftool map dump" will then be able to show the
      state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
      number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
      is created automatically by a macro.  Having a separate "value" struct
      will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
      "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
      initialization works before registering the struct_ops to the kernel
      subsystem).  The libbpf will take care of finding and populating the
      "struct bpf_struct_ops_XYZ" from "struct XYZ".
      
      Register a struct_ops to a kernel subsystem:
      1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
      2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
         set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
         running kernel.
         Instead of reusing the attr->btf_value_type_id,
         btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
         used as the "user" btf which could store other useful sysadmin/debug
         info that may be introduced in the furture,
         e.g. creation-date/compiler-details/map-creator...etc.
      3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
         in the running kernel btf.  Populate the value of this object.
         The function ptr should be populated with the prog fds.
      4. Call BPF_MAP_UPDATE with the object created in (3) as
         the map value.  The key is always "0".
      
      During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
      args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
      the specific struct_ops to do some final checks in "st_ops->init_member()"
      (e.g. ensure all mandatory func ptrs are implemented).
      If everything looks good, it will register this kernel struct
      to the kernel subsystem.  The map will not allow further update
      from this point.
      
      Unregister a struct_ops from the kernel subsystem:
      BPF_MAP_DELETE with key "0".
      
      Introspect a struct_ops:
      BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
      have the prog _id_ populated as the func ptr.
      
      The map value state (enum bpf_struct_ops_state) will transit from:
      INIT (map created) =>
      INUSE (map updated, i.e. reg) =>
      TOBEFREE (map value deleted, i.e. unreg)
      
      The kernel subsystem needs to call bpf_struct_ops_get() and
      bpf_struct_ops_put() to manage the "refcnt" in the
      "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
      for the purose of tracking the subsystem usage.  Another approach
      is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
      the subsystem's usage by doing map->refcnt - map->usercnt to filter out
      the map-fd/pinned-map usage.  However, that will also tie down the
      future semantics of map->refcnt and map->usercnt.
      
      The very first subsystem's refcnt (during reg()) holds one
      count to map->refcnt.  When the very last subsystem's refcnt
      is gone, it will also release the map->refcnt.  All bpf_prog will be
      freed when the map->refcnt reaches 0 (i.e. during map_free()).
      
      Here is how the bpftool map command will look like:
      [root@arch-fb-vm1 bpf]# bpftool map show
      6: struct_ops  name dctcp  flags 0x0
      	key 4B  value 256B  max_entries 1  memlock 4096B
      	btf_id 6
      [root@arch-fb-vm1 bpf]# bpftool map dump id 6
      [{
              "value": {
                  "refcnt": {
                      "refs": {
                          "counter": 1
                      }
                  },
                  "state": 1,
                  "data": {
                      "list": {
                          "next": 0,
                          "prev": 0
                      },
                      "key": 0,
                      "flags": 2,
                      "init": 24,
                      "release": 0,
                      "ssthresh": 25,
                      "cong_avoid": 30,
                      "set_state": 27,
                      "cwnd_event": 28,
                      "in_ack_event": 26,
                      "undo_cwnd": 29,
                      "pkts_acked": 0,
                      "min_tso_segs": 0,
                      "sndbuf_expand": 0,
                      "cong_control": 0,
                      "get_info": 0,
                      "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                      ],
                      "owner": 0
                  }
              }
          }
      ]
      
      Misc Notes:
      * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
        It does an inplace update on "*value" instead returning a pointer
        to syscall.c.  Otherwise, it needs a separate copy of "zero" value
        for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
      
      * The bpf_struct_ops_map_delete_elem() is also called without
        preempt_disable() from map_delete_elem().  It is because
        the "->unreg()" may requires sleepable context, e.g.
        the "tcp_unregister_congestion_control()".
      
      * "const" is added to some of the existing "struct btf_func_model *"
        function arg to avoid a compiler warning caused by this patch.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
      85d33df3