1. 21 Apr, 2023 9 commits
  2. 20 Apr, 2023 9 commits
  3. 19 Apr, 2023 1 commit
  4. 18 Apr, 2023 7 commits
  5. 17 Apr, 2023 4 commits
    • Yonghong Song's avatar
      selftests/bpf: Add a selftest for checking subreg equality · 49859de9
      Yonghong Song authored
      Add a selftest to ensure subreg equality if source register
      upper 32bit is 0. Without previous patch, the test will
      fail verification.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20230417222139.360607-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      49859de9
    • Yonghong Song's avatar
      bpf: Improve verifier u32 scalar equality checking · 3be49f79
      Yonghong Song authored
      In [1], I tried to remove bpf-specific codes to prevent certain
      llvm optimizations, and add llvm TTI (target transform info) hooks
      to prevent those optimizations. During this process, I found
      if I enable llvm SimplifyCFG:shouldFoldTwoEntryPHINode
      transformation, I will hit the following verification failure with selftests:
      
        ...
        8: (18) r1 = 0xffffc900001b2230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
        10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
        13: (bc) w2 = w1                      ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (test < __NR_TESTS)
        14: (a6) if w1 < 0x9 goto pc+1 16: R0=2 R1_w=scalar(umax=8,var_off=(0x0; 0xf)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=ctx(off=0,imm=0) R10=fp0
        ;
        16: (27) r2 *= 28                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
        17: (18) r3 = 0xffffc900001b2118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
        19: (0f) r3 += r2                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) R3_w=map_value(off=280,ks=4,vs=564,umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
        20: (61) r2 = *(u32 *)(r3 +0)
        R3 unbounded memory access, make sure to bounds check any such access
        processed 97 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 10 mark_read 6
        -- END PROG LOAD LOG --
        libbpf: prog 'ingress_fwdns_prio100': failed to load: -13
        libbpf: failed to load object 'test_tc_dtime'
        libbpf: failed to load BPF skeleton 'test_tc_dtime': -13
        ...
      
      At insn 14, with condition 'w1 < 9', register r1 is changed from an arbitrary
      u32 value to `scalar(umax=8,var_off=(0x0; 0xf))`. Register r2, however, remains
      as an arbitrary u32 value. Current verifier won't claim r1/r2 equality if
      the previous mov is alu32 ('w2 = w1').
      
      If r1 upper 32bit value is not 0, we indeed cannot clamin r1/r2 equality
      after 'w2 = w1'. But in this particular case, we know r1 upper 32bit value
      is 0, so it is safe to claim r1/r2 equality. This patch exactly did this.
      For a 32bit subreg mov, if the src register upper 32bit is 0,
      it is okay to claim equality between src and dst registers.
      
      With this patch, the above verification sequence becomes
      
        ...
        8: (18) r1 = 0xffffc9000048e230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
        10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
        13: (bc) w2 = w1                      ; R1_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (test < __NR_TESTS)
        14: (a6) if w1 < 0x9 goto pc+1        ; R1_w=scalar(id=6,umin=9,umax=4294967295,var_off=(0x0; 0xffffffff))
        ...
        from 14 to 16: R0=2 R1_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R2_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R6=ctx(off=0,imm=0) R10=fp0
        16: (27) r2 *= 28                     ; R2_w=scalar(umax=224,var_off=(0x0; 0xfc))
        17: (18) r3 = 0xffffc9000048e118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
        19: (0f) r3 += r2
        20: (61) r2 = *(u32 *)(r3 +0)         ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R3_w=map_value(off=280,ks=4,vs=564,umax=224,var_off=(0x0; 0xfc),s32_max=252,u32_max=252)
        ...
      
      and eventually the bpf program can be verified successfully.
      
        [1] https://reviews.llvm.org/D147968Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20230417222134.359714-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3be49f79
    • Sean Young's avatar
      bpf: lirc program type should not require SYS_CAP_ADMIN · 69a8c792
      Sean Young authored
      Make it possible to load lirc program type with just CAP_BPF. There is
      nothing exceptional about lirc programs that means they require
      SYS_CAP_ADMIN.
      
      In order to attach or detach a lirc program type you need permission to
      open /dev/lirc0; if you have permission to do that, you can alter all
      sorts of lirc receiving options. Changing the IR protocol decoder is no
      different.
      
      Right now on a typical distribution /dev/lirc devices are only
      read/write by root. Ideally we would make them group read/write like
      other devices so that local users can use them without becoming root.
      Signed-off-by: default avatarSean Young <sean@mess.org>
      Link: https://lore.kernel.org/r/ZD0ArKpwnDBJZsrE@gofer.mess.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      69a8c792
    • Daniel Borkmann's avatar
      bpf: Set skb redirect and from_ingress info in __bpf_tx_skb · 59e498a3
      Daniel Borkmann authored
      There are some use-cases where it is desirable to use bpf_redirect()
      in combination with ifb device, which currently is not supported, for
      example, around filtering inbound traffic with BPF to then push it to
      ifb which holds the qdisc for shaping in contrast to doing that on the
      egress device.
      
      Toke mentions the following case related to OpenWrt:
      
         Because there's not always a single egress on the other side. These are
         mainly home routers, which tend to have one or more WiFi devices bridged
         to one or more ethernet ports on the LAN side, and a single upstream WAN
         port. And the objective is to control the total amount of traffic going
         over the WAN link (in both directions), to deal with bufferbloat in the
         ISP network (which is sadly still all too prevalent).
      
         In this setup, the traffic can be split arbitrarily between the links
         on the LAN side, and the only "single bottleneck" is the WAN link. So we
         install both egress and ingress shapers on this, configured to something
         like 95-98% of the true link bandwidth, thus moving the queues into the
         qdisc layer in the router. It's usually necessary to set the ingress
         bandwidth shaper a bit lower than the egress due to being "downstream"
         of the bottleneck link, but it does work surprisingly well.
      
         We usually use something like a matchall filter to put all ingress
         traffic on the ifb, so doing the redirect from BPF has not been an
         immediate requirement thus far. However, it does seem a bit odd that
         this is not possible, and we do have a BPF-based filter that layers on
         top of this kind of setup, which currently uses u32 as the ingress
         filter and so it could presumably be improved to use BPF instead if
         that was available.
      Reported-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Reported-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Reported-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README
      Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk
      Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      59e498a3
  6. 16 Apr, 2023 10 commits
    • Alexei Starovoitov's avatar
      Merge branch 'Remove KF_KPTR_GET kfunc flag' · d40f4f68
      Alexei Starovoitov authored
      David Vernet says:
      
      ====================
      
      We've managed to improve the UX for kptrs significantly over the last 9
      months. All of the existing use cases which previously had KF_KPTR_GET
      kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
      have all been updated to be synchronized using RCU. In other words,
      their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
      KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
      maps in an RCU read region thanks to the types being RCU safe.
      
      While KF_KPTR_GET was a logical starting point for kptrs, it's become
      clear that they're not the correct abstraction. KF_KPTR_GET is a flag
      that essentially does nothing other than enforcing that the argument to
      a function is a pointer to a referenced kptr map value. At first glance,
      that's a useful thing to guarantee to a kfunc. It gives kfuncs the
      ability to try and acquire a reference on that kptr without requiring
      the BPF prog to do something like this:
      
      struct kptr_type *in_map, *new = NULL;
      
      in_map = bpf_kptr_xchg(&map->value, NULL);
      if (in_map) {
      	new = bpf_kptr_type_acquire(in_map);
      	in_map = bpf_kptr_xchg(&map->value, in_map);
      	if (in_map)
      		bpf_kptr_type_release(in_map);
      }
      
      That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
      the only alternative, it's better than nothing. However, the problem
      with any KF_KPTR_GET kfunc lies in the fact that it always requires some
      kind of synchronization in order to safely do an opportunistic acquire
      of the kptr in the map. This is because a BPF program running on another
      CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
      after it's been read by the KF_KPTR_GET kfunc. For example, the
      now-removed bpf_task_kptr_get() kfunc did the following:
      
      struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
      {
      	    struct task_struct *p;
      
      	rcu_read_lock();
      	p = READ_ONCE(*pp);
      	/* If p is non-NULL, it could still be freed by another CPU,
       	 * so we have to do an opportunistic refcount_inc_not_zero()
      	 * and return NULL if the task will be freed after the
      	 * current RCU read region.
      	 */
      	|f (p && !refcount_inc_not_zero(&p->rcu_users))
      		p = NULL;
      	rcu_read_unlock();
      
      	return p;
      }
      
      In other words, the kfunc uses RCU to ensure that the task remains valid
      after it's been peeked from the map. However, this is completely
      redundant with just defining a KF_RCU kfunc that itself does a
      refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
      does.
      
      So, the question of whether KF_KPTR_GET is useful is actually, "Are
      there any synchronization mechanisms / safety flags that are required by
      certain kptrs, but which are not provided by the verifier to kfuncs?"
      The answer to that question today is "No", because every kptr we
      currently care about is RCU protected.
      
      Even if the answer ever became "yes", the proper way to support that
      referenced kptr type would be to add support for whatever
      synchronization mechanism it requires in the verifier, rather than
      giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
      in a map, do whatever you need to do."
      
      With all that said -- so as to allow us to consolidate the kfunc API,
      and simplify the verifier, this patchset removes the KF_KPTR_GET kfunc
      flag.
      ---
      
      This is v2 of this patchset
      
      v1: https://lore.kernel.org/all/20230415103231.236063-1-void@manifault.com/
      
      Changelog:
      ----------
      
      v1 -> v2:
      - Fix KF_RU -> KF_RCU typo in commit summary for patch 2/3, and in cover
        letter (Alexei)
      - In order to reduce churn, don't shift all KF_* flags down by 1. We'll
        just fill the now-empty slot the next time we add a flag (Alexei)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d40f4f68
    • David Vernet's avatar
      bpf,docs: Remove KF_KPTR_GET from documentation · 530474e6
      David Vernet authored
      A prior patch removed KF_KPTR_GET from the kernel. Now that it's no
      longer accessible to kfunc authors, this patch removes it from the BPF
      kfunc documentation.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230416084928.326135-4-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      530474e6
    • David Vernet's avatar
      bpf: Remove KF_KPTR_GET kfunc flag · 7b4ddf39
      David Vernet authored
      We've managed to improve the UX for kptrs significantly over the last 9
      months. All of the existing use cases which previously had KF_KPTR_GET
      kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
      have all been updated to be synchronized using RCU. In other words,
      their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
      KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
      maps in an RCU read region thanks to the types being RCU safe.
      
      While KF_KPTR_GET was a logical starting point for kptrs, it's become
      clear that they're not the correct abstraction. KF_KPTR_GET is a flag
      that essentially does nothing other than enforcing that the argument to
      a function is a pointer to a referenced kptr map value. At first glance,
      that's a useful thing to guarantee to a kfunc. It gives kfuncs the
      ability to try and acquire a reference on that kptr without requiring
      the BPF prog to do something like this:
      
      struct kptr_type *in_map, *new = NULL;
      
      in_map = bpf_kptr_xchg(&map->value, NULL);
      if (in_map) {
              new = bpf_kptr_type_acquire(in_map);
              in_map = bpf_kptr_xchg(&map->value, in_map);
              if (in_map)
                      bpf_kptr_type_release(in_map);
      }
      
      That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
      the only alternative, it's better than nothing. However, the problem
      with any KF_KPTR_GET kfunc lies in the fact that it always requires some
      kind of synchronization in order to safely do an opportunistic acquire
      of the kptr in the map. This is because a BPF program running on another
      CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
      after it's been read by the KF_KPTR_GET kfunc. For example, the
      now-removed bpf_task_kptr_get() kfunc did the following:
      
      struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
      {
                  struct task_struct *p;
      
              rcu_read_lock();
              p = READ_ONCE(*pp);
              /* If p is non-NULL, it could still be freed by another CPU,
               * so we have to do an opportunistic refcount_inc_not_zero()
               * and return NULL if the task will be freed after the
               * current RCU read region.
               */
              |f (p && !refcount_inc_not_zero(&p->rcu_users))
                      p = NULL;
              rcu_read_unlock();
      
              return p;
      }
      
      In other words, the kfunc uses RCU to ensure that the task remains valid
      after it's been peeked from the map. However, this is completely
      redundant with just defining a KF_RCU kfunc that itself does a
      refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
      does.
      
      So, the question of whether KF_KPTR_GET is useful is actually, "Are
      there any synchronization mechanisms / safety flags that are required by
      certain kptrs, but which are not provided by the verifier to kfuncs?"
      The answer to that question today is "No", because every kptr we
      currently care about is RCU protected.
      
      Even if the answer ever became "yes", the proper way to support that
      referenced kptr type would be to add support for whatever
      synchronization mechanism it requires in the verifier, rather than
      giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
      in a map, do whatever you need to do."
      
      With all that said -- so as to allow us to consolidate the kfunc API,
      and simplify the verifier a bit, this patch removes KF_KPTR_GET, and all
      relevant logic from the verifier.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230416084928.326135-3-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7b4ddf39
    • David Vernet's avatar
      bpf: Remove bpf_kfunc_call_test_kptr_get() test kfunc · 09b501d9
      David Vernet authored
      We've managed to improve the UX for kptrs significantly over the last 9
      months. All of the prior main use cases, struct bpf_cpumask *, struct
      task_struct *, and struct cgroup *, have all been updated to be
      synchronized mainly using RCU. In other words, their KF_ACQUIRE kfunc
      calls are all KF_RCU, and the pointers themselves are MEM_RCU and can be
      accessed in an RCU read region in BPF.
      
      In a follow-on change, we'll be removing the KF_KPTR_GET kfunc flag.
      This patch prepares for that by removing the
      bpf_kfunc_call_test_kptr_get() kfunc, and all associated selftests.
      Signed-off-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/r/20230416084928.326135-2-void@manifault.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      09b501d9
    • Alexei Starovoitov's avatar
      Merge branch 'Shared ownership for local kptrs' · 7a0788fe
      Alexei Starovoitov authored
      Dave Marchevsky says:
      
      ====================
      
      This series adds support for refcounted local kptrs to the verifier. A local
      kptr is 'refcounted' if its type contains a struct bpf_refcount field:
      
        struct refcounted_node {
          long data;
          struct bpf_list_node ll;
          struct bpf_refcount ref;
        };
      
      bpf_refcount is used to implement shared ownership for local kptrs.
      
      Motivating usecase
      ==================
      
      If a struct has two collection node fields, e.g.:
      
        struct node {
          long key;
          long val;
          struct bpf_rb_node rb;
          struct bpf_list_node ll;
        };
      
      It's not currently possible to add a node to both the list and rbtree:
      
        long bpf_prog(void *ctx)
        {
          struct node *n = bpf_obj_new(typeof(*n));
          if (!n) { /* ... */ }
      
          bpf_spin_lock(&lock);
      
          bpf_list_push_back(&head, &n->ll);
          bpf_rbtree_add(&root, &n->rb, less); /* Assume a resonable less() */
          bpf_spin_unlock(&lock);
        }
      
      The above program will fail verification due to current owning / non-owning ref
      logic: after bpf_list_push_back, n is a non-owning reference and thus cannot be
      passed to bpf_rbtree_add. The only way to get an owning reference for the node
      that was added is to bpf_list_pop_{front,back} it.
      
      More generally, verifier ownership semantics expect that a node has one
      owner (program, collection, or stashed in map) with exclusive ownership
      of the node's lifetime. The owner free's the node's underlying memory when it
      itself goes away.
      
      Without a shared ownership concept it's impossible to express many real-world
      usecases such that they pass verification.
      
      Semantic Changes
      ================
      
      Before this series, the verifier could make this statement: "whoever has the
      owning reference has exclusive ownership of the referent's lifetime". As
      demonstrated in the previous section, this implies that a BPF program can't
      have an owning reference to some node if that node is in a collection. If
      such a state were possible, the node would have multiple owners, each thinking
      they have exclusive ownership. In order to support shared ownership it's
      necessary to modify the exclusive ownership semantic.
      
      After this series' changes, an owning reference has ownership of the referent's
      lifetime, but it's not necessarily exclusive. The referent's underlying memory
      is guaranteed to be valid (i.e. not free'd) until the reference is dropped or
      used for collection insert.
      
      This change doesn't affect UX of owning or non-owning references much:
      
        * insert kfuncs (bpf_rbtree_add, bpf_list_push_{front,back}) still require
          an owning reference arg, as ownership still must be passed to the
          collection in a shared-ownership world.
      
        * non-owning references still refer to valid memory without claiming
          any ownership.
      
      One important conclusion that followed from "exclusive ownership" statement
      is no longer valid, though. In exclusive-ownership world, if a BPF prog has
      an owning reference to a node, the verifier can conclude that no collection has
      ownership of it. This conclusion was used to avoid runtime checking in the
      implementations of insert and remove operations (""has the node already been
      {inserted, removed}?").
      
      In a shared-ownership world the aforementioned conclusion is no longer valid,
      which necessitates doing runtime checking in insert and remove operation
      kfuncs, and those functions possibly failing to insert or remove anything.
      
      Luckily the verifier changes necessary to go from exclusive to shared ownership
      were fairly minimal. Patches in this series which do change verifier semantics
      generally have some summary dedicated to explaining why certain usecases
      Just Work for shared ownership without verifier changes.
      
      Implementation
      ==============
      
      The changes in this series can be categorized as follows:
      
        * struct bpf_refcount opaque field + plumbing
        * support for refcounted kptrs in bpf_obj_new and bpf_obj_drop
        * bpf_refcount_acquire kfunc
          * enables shared ownershp by bumping refcount + acquiring owning ref
        * support for possibly-failing collection insertion and removal
          * insertion changes are more complex
      
      If a patch's changes have some nuance to their effect - or lack of effect - on
      verifier behavior, the patch summary talks about it at length.
      
      Patch contents:
        * Patch 1 removes btf_field_offs struct
        * Patch 2 adds struct bpf_refcount and associated plumbing
        * Patch 3 modifies semantics of bpf_obj_drop and bpf_obj_new to handle
          refcounted kptrs
        * Patch 4 adds bpf_refcount_acquire
        * Patches 5-7 add support for possibly-failing collection insert and remove
        * Patch 8 centralizes constructor-like functionality for local kptr types
        * Patch 9 adds tests for new functionality
      
      base-commit: 4a1e885c
      
      Changelog:
      
      v1 -> v2: lore.kernel.org/bpf/20230410190753.2012798-1-davemarchevsky@fb.com
      
      Patch #s used below refer to the patch's position in v1 unless otherwise
      specified.
      
        * General
          * Rebase onto latest bpf-next (base-commit updated above)
      
        * Patch 4 - "bpf: Add bpf_refcount_acquire kfunc"
          * Fix typo in summary (Alexei)
        * Patch 7 - "Migrate bpf_rbtree_remove to possibly fail"
          * Modify a paragraph in patch summary to more clearly state that only
            bpf_rbtree_remove's non-owning ref clobbering behavior is changed by the
            patch (Alexei)
          * refcount_off == -1 -> refcount_off < 0  in "node type w/ both list
            and rb_node fields" check, since any negative value means "no
            bpf_refcount field found", and furthermore refcount_off is never
            explicitly set to -1, but rather -EINVAL. (Alexei)
          * Instead of just changing "btf: list_node and rb_node in same struct" test
            expectation to pass instead of fail, do some refactoring to test both
            "list_node, rb_node, and bpf_refcount" (success) and "list_node, rb_node,
            _no_ bpf_refcount" (failure) cases. This ensures that logic change in
            previous bullet point is correct.
            * v1's "btf: list_node and rb_node in same struct" test changes didn't
              add bpf_refcount, so the fact that btf load succeeded w/ list and
              rb_nodes but no bpf_refcount field is further proof that this logic
              was incorrect in v1.
        * Patch 8 - "bpf: Centralize btf_field-specific initialization logic"
          * Instead of doing __init_field_infer_size in kfuncs when taking
            bpf_list_head type input which might've been 0-initialized in map, go
            back to simple oneliner initialization. Add short comment explaining why
            this is necessary. (Alexei)
        * Patch 9 - "selftests/bpf: Add refcounted_kptr tests"
          * Don't __always_inline helper fns in progs/refcounted_kptr.c (Alexei)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7a0788fe
    • Dave Marchevsky's avatar
      selftests/bpf: Add refcounted_kptr tests · 6147f151
      Dave Marchevsky authored
      Test refcounted local kptr functionality added in previous patches in
      the series.
      
      Usecases which pass verification:
      
      * Add refcounted local kptr to both tree and list. Then, read and -
        possibly, depending on test variant - delete from tree, then list.
        * Also test doing read-and-maybe-delete in opposite order
      * Stash a refcounted local kptr in a map_value, then add it to a
        rbtree. Read from both, possibly deleting after tree read.
      * Add refcounted local kptr to both tree and list. Then, try reading and
        deleting twice from one of the collections.
      * bpf_refcount_acquire of just-added non-owning ref should work, as
        should bpf_refcount_acquire of owning ref just out of bpf_obj_new
      
      Usecases which fail verification:
      
      * The simple successful bpf_refcount_acquire cases from above should
        both fail to verify if the newly-acquired owning ref is not dropped
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-10-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6147f151
    • Dave Marchevsky's avatar
      bpf: Centralize btf_field-specific initialization logic · 3e81740a
      Dave Marchevsky authored
      All btf_fields in an object are 0-initialized by memset in
      bpf_obj_init. This might not be a valid initial state for some field
      types, in which case kfuncs that use the type will properly initialize
      their input if it's been 0-initialized. Some BPF graph collection types
      and kfuncs do this: bpf_list_{head,node} and bpf_rb_node.
      
      An earlier patch in this series added the bpf_refcount field, for which
      the 0 state indicates that the refcounted object should be free'd.
      bpf_obj_init treats this field specially, setting refcount to 1 instead
      of relying on scattered "refcount is 0? Must have just been initialized,
      let's set to 1" logic in kfuncs.
      
      This patch extends this treatment to list and rbtree field types,
      allowing most scattered initialization logic in kfuncs to be removed.
      
      Note that bpf_{list_head,rb_root} may be inside a BPF map, in which case
      they'll be 0-initialized without passing through the newly-added logic,
      so scattered initialization logic must remain for these collection root
      types.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-9-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3e81740a
    • Dave Marchevsky's avatar
      bpf: Migrate bpf_rbtree_remove to possibly fail · 404ad75a
      Dave Marchevsky authored
      This patch modifies bpf_rbtree_remove to account for possible failure
      due to the input rb_node already not being in any collection.
      The function can now return NULL, and does when the aforementioned
      scenario occurs. As before, on successful removal an owning reference to
      the removed node is returned.
      
      Adding KF_RET_NULL to bpf_rbtree_remove's kfunc flags - now KF_RET_NULL |
      KF_ACQUIRE - provides the desired verifier semantics:
      
        * retval must be checked for NULL before use
        * if NULL, retval's ref_obj_id is released
        * retval is a "maybe acquired" owning ref, not a non-owning ref,
          so it will live past end of critical section (bpf_spin_unlock), and
          thus can be checked for NULL after the end of the CS
      
      BPF programs must add checks
      ============================
      
      This does change bpf_rbtree_remove's verifier behavior. BPF program
      writers will need to add NULL checks to their programs, but the
      resulting UX looks natural:
      
        bpf_spin_lock(&glock);
      
        n = bpf_rbtree_first(&ghead);
        if (!n) { /* ... */}
        res = bpf_rbtree_remove(&ghead, &n->node);
      
        bpf_spin_unlock(&glock);
      
        if (!res)  /* Newly-added check after this patch */
          return 1;
      
        n = container_of(res, /* ... */);
        /* Do something else with n */
        bpf_obj_drop(n);
        return 0;
      
      The "if (!res)" check above is the only addition necessary for the above
      program to pass verification after this patch.
      
      bpf_rbtree_remove no longer clobbers non-owning refs
      ====================================================
      
      An issue arises when bpf_rbtree_remove fails, though. Consider this
      example:
      
        struct node_data {
          long key;
          struct bpf_list_node l;
          struct bpf_rb_node r;
          struct bpf_refcount ref;
        };
      
        long failed_sum;
      
        void bpf_prog()
        {
          struct node_data *n = bpf_obj_new(/* ... */);
          struct bpf_rb_node *res;
          n->key = 10;
      
          bpf_spin_lock(&glock);
      
          bpf_list_push_back(&some_list, &n->l); /* n is now a non-owning ref */
          res = bpf_rbtree_remove(&some_tree, &n->r, /* ... */);
          if (!res)
            failed_sum += n->key;  /* not possible */
      
          bpf_spin_unlock(&glock);
          /* if (res) { do something useful and drop } ... */
        }
      
      The bpf_rbtree_remove in this example will always fail. Similarly to
      bpf_spin_unlock, bpf_rbtree_remove is a non-owning reference
      invalidation point. The verifier clobbers all non-owning refs after a
      bpf_rbtree_remove call, so the "failed_sum += n->key" line will fail
      verification, and in fact there's no good way to get information about
      the node which failed to add after the invalidation. This patch removes
      non-owning reference invalidation from bpf_rbtree_remove to allow the
      above usecase to pass verification. The logic for why this is now
      possible is as follows:
      
      Before this series, bpf_rbtree_add couldn't fail and thus assumed that
      its input, a non-owning reference, was in the tree. But it's easy to
      construct an example where two non-owning references pointing to the same
      underlying memory are acquired and passed to rbtree_remove one after
      another (see rbtree_api_release_aliasing in
      selftests/bpf/progs/rbtree_fail.c).
      
      So it was necessary to clobber non-owning refs to prevent this
      case and, more generally, to enforce "non-owning ref is definitely
      in some collection" invariant. This series removes that invariant and
      the failure / runtime checking added in this patch provide a clean way
      to deal with the aliasing issue - just fail to remove.
      
      Because the aliasing issue prevented by clobbering non-owning refs is no
      longer an issue, this patch removes the invalidate_non_owning_refs
      call from verifier handling of bpf_rbtree_remove. Note that
      bpf_spin_unlock - the other caller of invalidate_non_owning_refs -
      clobbers non-owning refs for a different reason, so its clobbering
      behavior remains unchanged.
      
      No BPF program changes are necessary for programs to remain valid as a
      result of this clobbering change. A valid program before this patch
      passed verification with its non-owning refs having shorter (or equal)
      lifetimes due to more aggressive clobbering.
      
      Also, update existing tests to check bpf_rbtree_remove retval for NULL
      where necessary, and move rbtree_api_release_aliasing from
      progs/rbtree_fail.c to progs/rbtree.c since it's now expected to pass
      verification.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-8-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      404ad75a
    • Dave Marchevsky's avatar
      selftests/bpf: Modify linked_list tests to work with macro-ified inserts · de67ba39
      Dave Marchevsky authored
      The linked_list tests use macros and function pointers to reduce code
      duplication. Earlier in the series, bpf_list_push_{front,back} were
      modified to be macros, expanding to invoke actual kfuncs
      bpf_list_push_{front,back}_impl. Due to this change, a code snippet
      like:
      
        void (*p)(void *, void *) = (void *)&bpf_list_##op;
        p(hexpr, nexpr);
      
      meant to do bpf_list_push_{front,back}(hexpr, nexpr), will no longer
      work as it's no longer valid to do &bpf_list_push_{front,back} since
      they're no longer functions.
      
      This patch fixes issues of this type, along with two other minor changes
      - one improvement and one fix - both related to the node argument to
      list_push_{front,back}.
      
        * The fix: migration of list_push tests away from (void *, void *)
          func ptr uncovered that some tests were incorrectly passing pointer
          to node, not pointer to struct bpf_list_node within the node. This
          patch fixes such issues (CHECK(..., f) -> CHECK(..., &f->node))
      
        * The improvement: In linked_list tests, the struct foo type has two
          list_node fields: node and node2, at byte offsets 0 and 40 within
          the struct, respectively. Currently node is used in ~all tests
          involving struct foo and lists. The verifier needs to do some work
          to account for the offset of bpf_list_node within the node type, so
          using node2 instead of node exercises that logic more in the tests.
          This patch migrates linked_list tests to use node2 instead of node.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-7-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      de67ba39
    • Dave Marchevsky's avatar
      bpf: Migrate bpf_rbtree_add and bpf_list_push_{front,back} to possibly fail · d2dcc67d
      Dave Marchevsky authored
      Consider this code snippet:
      
        struct node {
          long key;
          bpf_list_node l;
          bpf_rb_node r;
          bpf_refcount ref;
        }
      
        int some_bpf_prog(void *ctx)
        {
          struct node *n = bpf_obj_new(/*...*/), *m;
      
          bpf_spin_lock(&glock);
      
          bpf_rbtree_add(&some_tree, &n->r, /* ... */);
          m = bpf_refcount_acquire(n);
          bpf_rbtree_add(&other_tree, &m->r, /* ... */);
      
          bpf_spin_unlock(&glock);
      
          /* ... */
        }
      
      After bpf_refcount_acquire, n and m point to the same underlying memory,
      and that node's bpf_rb_node field is being used by the some_tree insert,
      so overwriting it as a result of the second insert is an error. In order
      to properly support refcounted nodes, the rbtree and list insert
      functions must be allowed to fail. This patch adds such support.
      
      The kfuncs bpf_rbtree_add, bpf_list_push_{front,back} are modified to
      return an int indicating success/failure, with 0 -> success, nonzero ->
      failure.
      
      bpf_obj_drop on failure
      =======================
      
      Currently the only reason an insert can fail is the example above: the
      bpf_{list,rb}_node is already in use. When such a failure occurs, the
      insert kfuncs will bpf_obj_drop the input node. This allows the insert
      operations to logically fail without changing their verifier owning ref
      behavior, namely the unconditional release_reference of the input
      owning ref.
      
      With insert that always succeeds, ownership of the node is always passed
      to the collection, since the node always ends up in the collection.
      
      With a possibly-failed insert w/ bpf_obj_drop, ownership of the node
      is always passed either to the collection (success), or to bpf_obj_drop
      (failure). Regardless, it's correct to continue unconditionally
      releasing the input owning ref, as something is always taking ownership
      from the calling program on insert.
      
      Keeping owning ref behavior unchanged results in a nice default UX for
      insert functions that can fail. If the program's reaction to a failed
      insert is "fine, just get rid of this owning ref for me and let me go
      on with my business", then there's no reason to check for failure since
      that's default behavior. e.g.:
      
        long important_failures = 0;
      
        int some_bpf_prog(void *ctx)
        {
          struct node *n, *m, *o; /* all bpf_obj_new'd */
      
          bpf_spin_lock(&glock);
          bpf_rbtree_add(&some_tree, &n->node, /* ... */);
          bpf_rbtree_add(&some_tree, &m->node, /* ... */);
          if (bpf_rbtree_add(&some_tree, &o->node, /* ... */)) {
            important_failures++;
          }
          bpf_spin_unlock(&glock);
        }
      
      If we instead chose to pass ownership back to the program on failed
      insert - by returning NULL on success or an owning ref on failure -
      programs would always have to do something with the returned ref on
      failure. The most likely action is probably "I'll just get rid of this
      owning ref and go about my business", which ideally would look like:
      
        if (n = bpf_rbtree_add(&some_tree, &n->node, /* ... */))
          bpf_obj_drop(n);
      
      But bpf_obj_drop isn't allowed in a critical section and inserts must
      occur within one, so in reality error handling would become a
      hard-to-parse mess.
      
      For refcounted nodes, we can replicate the "pass ownership back to
      program on failure" logic with this patch's semantics, albeit in an ugly
      way:
      
        struct node *n = bpf_obj_new(/* ... */), *m;
      
        bpf_spin_lock(&glock);
      
        m = bpf_refcount_acquire(n);
        if (bpf_rbtree_add(&some_tree, &n->node, /* ... */)) {
          /* Do something with m */
        }
      
        bpf_spin_unlock(&glock);
        bpf_obj_drop(m);
      
      bpf_refcount_acquire is used to simulate "return owning ref on failure".
      This should be an uncommon occurrence, though.
      
      Addition of two verifier-fixup'd args to collection inserts
      ===========================================================
      
      The actual bpf_obj_drop kfunc is
      bpf_obj_drop_impl(void *, struct btf_struct_meta *), with bpf_obj_drop
      macro populating the second arg with 0 and the verifier later filling in
      the arg during insn fixup.
      
      Because bpf_rbtree_add and bpf_list_push_{front,back} now might do
      bpf_obj_drop, these kfuncs need a btf_struct_meta parameter that can be
      passed to bpf_obj_drop_impl.
      
      Similarly, because the 'node' param to those insert functions is the
      bpf_{list,rb}_node within the node type, and bpf_obj_drop expects a
      pointer to the beginning of the node, the insert functions need to be
      able to find the beginning of the node struct. A second
      verifier-populated param is necessary: the offset of {list,rb}_node within the
      node type.
      
      These two new params allow the insert kfuncs to correctly call
      __bpf_obj_drop_impl:
      
        beginning_of_node = bpf_rb_node_ptr - offset
        if (already_inserted)
          __bpf_obj_drop_impl(beginning_of_node, btf_struct_meta->record);
      
      Similarly to other kfuncs with "hidden" verifier-populated params, the
      insert functions are renamed with _impl prefix and a macro is provided
      for common usage. For example, bpf_rbtree_add kfunc is now
      bpf_rbtree_add_impl and bpf_rbtree_add is now a macro which sets
      "hidden" args to 0.
      
      Due to the two new args BPF progs will need to be recompiled to work
      with the new _impl kfuncs.
      
      This patch also rewrites the "hidden argument" explanation to more
      directly say why the BPF program writer doesn't need to populate the
      arguments with anything meaningful.
      
      How does this new logic affect non-owning references?
      =====================================================
      
      Currently, non-owning refs are valid until the end of the critical
      section in which they're created. We can make this guarantee because, if
      a non-owning ref exists, the referent was added to some collection. The
      collection will drop() its nodes when it goes away, but it can't go away
      while our program is accessing it, so that's not a problem. If the
      referent is removed from the collection in the same CS that it was added
      in, it can't be bpf_obj_drop'd until after CS end. Those are the only
      two ways to free the referent's memory and neither can happen until
      after the non-owning ref's lifetime ends.
      
      On first glance, having these collection insert functions potentially
      bpf_obj_drop their input seems like it breaks the "can't be
      bpf_obj_drop'd until after CS end" line of reasoning. But we care about
      the memory not being _freed_ until end of CS end, and a previous patch
      in the series modified bpf_obj_drop such that it doesn't free refcounted
      nodes until refcount == 0. So the statement can be more accurately
      rewritten as "can't be free'd until after CS end".
      
      We can prove that this rewritten statement holds for any non-owning
      reference produced by collection insert functions:
      
      * If the input to the insert function is _not_ refcounted
        * We have an owning reference to the input, and can conclude it isn't
          in any collection
          * Inserting a node in a collection turns owning refs into
            non-owning, and since our input type isn't refcounted, there's no
            way to obtain additional owning refs to the same underlying
            memory
        * Because our node isn't in any collection, the insert operation
          cannot fail, so bpf_obj_drop will not execute
        * If bpf_obj_drop is guaranteed not to execute, there's no risk of
          memory being free'd
      
      * Otherwise, the input to the insert function is refcounted
        * If the insert operation fails due to the node's list_head or rb_root
          already being in some collection, there was some previous successful
          insert which passed refcount to the collection
        * We have an owning reference to the input, it must have been
          acquired via bpf_refcount_acquire, which bumped the refcount
        * refcount must be >= 2 since there's a valid owning reference and the
          node is already in a collection
        * Insert triggering bpf_obj_drop will decr refcount to >= 1, never
          resulting in a free
      
      So although we may do bpf_obj_drop during the critical section, this
      will never result in memory being free'd, and no changes to non-owning
      ref logic are needed in this patch.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Link: https://lore.kernel.org/r/20230415201811.343116-6-davemarchevsky@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d2dcc67d