1. 21 Apr, 2023 18 commits
  2. 20 Apr, 2023 9 commits
  3. 19 Apr, 2023 1 commit
  4. 18 Apr, 2023 7 commits
  5. 17 Apr, 2023 4 commits
    • Yonghong Song's avatar
      selftests/bpf: Add a selftest for checking subreg equality · 49859de9
      Yonghong Song authored
      Add a selftest to ensure subreg equality if source register
      upper 32bit is 0. Without previous patch, the test will
      fail verification.
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20230417222139.360607-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      49859de9
    • Yonghong Song's avatar
      bpf: Improve verifier u32 scalar equality checking · 3be49f79
      Yonghong Song authored
      In [1], I tried to remove bpf-specific codes to prevent certain
      llvm optimizations, and add llvm TTI (target transform info) hooks
      to prevent those optimizations. During this process, I found
      if I enable llvm SimplifyCFG:shouldFoldTwoEntryPHINode
      transformation, I will hit the following verification failure with selftests:
      
        ...
        8: (18) r1 = 0xffffc900001b2230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
        10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
        13: (bc) w2 = w1                      ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (test < __NR_TESTS)
        14: (a6) if w1 < 0x9 goto pc+1 16: R0=2 R1_w=scalar(umax=8,var_off=(0x0; 0xf)) R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R6=ctx(off=0,imm=0) R10=fp0
        ;
        16: (27) r2 *= 28                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
        17: (18) r3 = 0xffffc900001b2118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
        19: (0f) r3 += r2                     ; R2_w=scalar(umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4) R3_w=map_value(off=280,ks=4,vs=564,umax=120259084260,var_off=(0x0; 0x1ffffffffc),s32_max=2147483644,u32_max=-4)
        20: (61) r2 = *(u32 *)(r3 +0)
        R3 unbounded memory access, make sure to bounds check any such access
        processed 97 insns (limit 1000000) max_states_per_insn 1 total_states 10 peak_states 10 mark_read 6
        -- END PROG LOAD LOG --
        libbpf: prog 'ingress_fwdns_prio100': failed to load: -13
        libbpf: failed to load object 'test_tc_dtime'
        libbpf: failed to load BPF skeleton 'test_tc_dtime': -13
        ...
      
      At insn 14, with condition 'w1 < 9', register r1 is changed from an arbitrary
      u32 value to `scalar(umax=8,var_off=(0x0; 0xf))`. Register r2, however, remains
      as an arbitrary u32 value. Current verifier won't claim r1/r2 equality if
      the previous mov is alu32 ('w2 = w1').
      
      If r1 upper 32bit value is not 0, we indeed cannot clamin r1/r2 equality
      after 'w2 = w1'. But in this particular case, we know r1 upper 32bit value
      is 0, so it is safe to claim r1/r2 equality. This patch exactly did this.
      For a 32bit subreg mov, if the src register upper 32bit is 0,
      it is okay to claim equality between src and dst registers.
      
      With this patch, the above verification sequence becomes
      
        ...
        8: (18) r1 = 0xffffc9000048e230       ; R1_w=map_value(off=560,ks=4,vs=564,imm=0)
        10: (61) r1 = *(u32 *)(r1 +0)         ; R1_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        11: (79) r2 = *(u64 *)(r6 +152)       ; R2_w=scalar() R6=ctx(off=0,imm=0)
        ; if (skb->tstamp == EGRESS_ENDHOST_MAGIC)
        12: (55) if r2 != 0xb9fbeef goto pc+10        ; R2_w=195018479
        13: (bc) w2 = w1                      ; R1_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff)) R2_w=scalar(id=6,umax=4294967295,var_off=(0x0; 0xffffffff))
        ; if (test < __NR_TESTS)
        14: (a6) if w1 < 0x9 goto pc+1        ; R1_w=scalar(id=6,umin=9,umax=4294967295,var_off=(0x0; 0xffffffff))
        ...
        from 14 to 16: R0=2 R1_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R2_w=scalar(id=6,umax=8,var_off=(0x0; 0xf)) R6=ctx(off=0,imm=0) R10=fp0
        16: (27) r2 *= 28                     ; R2_w=scalar(umax=224,var_off=(0x0; 0xfc))
        17: (18) r3 = 0xffffc9000048e118      ; R3_w=map_value(off=280,ks=4,vs=564,imm=0)
        19: (0f) r3 += r2
        20: (61) r2 = *(u32 *)(r3 +0)         ; R2_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R3_w=map_value(off=280,ks=4,vs=564,umax=224,var_off=(0x0; 0xfc),s32_max=252,u32_max=252)
        ...
      
      and eventually the bpf program can be verified successfully.
      
        [1] https://reviews.llvm.org/D147968Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20230417222134.359714-1-yhs@fb.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3be49f79
    • Sean Young's avatar
      bpf: lirc program type should not require SYS_CAP_ADMIN · 69a8c792
      Sean Young authored
      Make it possible to load lirc program type with just CAP_BPF. There is
      nothing exceptional about lirc programs that means they require
      SYS_CAP_ADMIN.
      
      In order to attach or detach a lirc program type you need permission to
      open /dev/lirc0; if you have permission to do that, you can alter all
      sorts of lirc receiving options. Changing the IR protocol decoder is no
      different.
      
      Right now on a typical distribution /dev/lirc devices are only
      read/write by root. Ideally we would make them group read/write like
      other devices so that local users can use them without becoming root.
      Signed-off-by: default avatarSean Young <sean@mess.org>
      Link: https://lore.kernel.org/r/ZD0ArKpwnDBJZsrE@gofer.mess.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      69a8c792
    • Daniel Borkmann's avatar
      bpf: Set skb redirect and from_ingress info in __bpf_tx_skb · 59e498a3
      Daniel Borkmann authored
      There are some use-cases where it is desirable to use bpf_redirect()
      in combination with ifb device, which currently is not supported, for
      example, around filtering inbound traffic with BPF to then push it to
      ifb which holds the qdisc for shaping in contrast to doing that on the
      egress device.
      
      Toke mentions the following case related to OpenWrt:
      
         Because there's not always a single egress on the other side. These are
         mainly home routers, which tend to have one or more WiFi devices bridged
         to one or more ethernet ports on the LAN side, and a single upstream WAN
         port. And the objective is to control the total amount of traffic going
         over the WAN link (in both directions), to deal with bufferbloat in the
         ISP network (which is sadly still all too prevalent).
      
         In this setup, the traffic can be split arbitrarily between the links
         on the LAN side, and the only "single bottleneck" is the WAN link. So we
         install both egress and ingress shapers on this, configured to something
         like 95-98% of the true link bandwidth, thus moving the queues into the
         qdisc layer in the router. It's usually necessary to set the ingress
         bandwidth shaper a bit lower than the egress due to being "downstream"
         of the bottleneck link, but it does work surprisingly well.
      
         We usually use something like a matchall filter to put all ingress
         traffic on the ifb, so doing the redirect from BPF has not been an
         immediate requirement thus far. However, it does seem a bit odd that
         this is not possible, and we do have a BPF-based filter that layers on
         top of this kind of setup, which currently uses u32 as the ingress
         filter and so it could presumably be improved to use BPF instead if
         that was available.
      Reported-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Reported-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Reported-by: default avatarTonghao Zhang <xiangxia.m.yue@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYafang Shao <laoar.shao@gmail.com>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://git.openwrt.org/?p=project/qosify.git;a=blob;f=README
      Link: https://lore.kernel.org/bpf/875y9yzbuy.fsf@toke.dk
      Link: https://lore.kernel.org/r/8cebc8b2b6e967e10cbafe2ffd6795050e74accd.1681739137.git.daniel@iogearbox.netSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      59e498a3
  6. 16 Apr, 2023 1 commit
    • Alexei Starovoitov's avatar
      Merge branch 'Remove KF_KPTR_GET kfunc flag' · d40f4f68
      Alexei Starovoitov authored
      David Vernet says:
      
      ====================
      
      We've managed to improve the UX for kptrs significantly over the last 9
      months. All of the existing use cases which previously had KF_KPTR_GET
      kfuncs (struct bpf_cpumask *, struct task_struct *, and struct cgroup *)
      have all been updated to be synchronized using RCU. In other words,
      their KF_KPTR_GET kfuncs have been removed in favor of KF_RCU |
      KF_ACQUIRE kfuncs, with the pointers themselves also being readable from
      maps in an RCU read region thanks to the types being RCU safe.
      
      While KF_KPTR_GET was a logical starting point for kptrs, it's become
      clear that they're not the correct abstraction. KF_KPTR_GET is a flag
      that essentially does nothing other than enforcing that the argument to
      a function is a pointer to a referenced kptr map value. At first glance,
      that's a useful thing to guarantee to a kfunc. It gives kfuncs the
      ability to try and acquire a reference on that kptr without requiring
      the BPF prog to do something like this:
      
      struct kptr_type *in_map, *new = NULL;
      
      in_map = bpf_kptr_xchg(&map->value, NULL);
      if (in_map) {
      	new = bpf_kptr_type_acquire(in_map);
      	in_map = bpf_kptr_xchg(&map->value, in_map);
      	if (in_map)
      		bpf_kptr_type_release(in_map);
      }
      
      That's clearly a pretty ugly (and racy) UX, and if using KF_KPTR_GET is
      the only alternative, it's better than nothing. However, the problem
      with any KF_KPTR_GET kfunc lies in the fact that it always requires some
      kind of synchronization in order to safely do an opportunistic acquire
      of the kptr in the map. This is because a BPF program running on another
      CPU could do a bpf_kptr_xchg() on that map value, and free the kptr
      after it's been read by the KF_KPTR_GET kfunc. For example, the
      now-removed bpf_task_kptr_get() kfunc did the following:
      
      struct task_struct *bpf_task_kptr_get(struct task_struct **pp)
      {
      	    struct task_struct *p;
      
      	rcu_read_lock();
      	p = READ_ONCE(*pp);
      	/* If p is non-NULL, it could still be freed by another CPU,
       	 * so we have to do an opportunistic refcount_inc_not_zero()
      	 * and return NULL if the task will be freed after the
      	 * current RCU read region.
      	 */
      	|f (p && !refcount_inc_not_zero(&p->rcu_users))
      		p = NULL;
      	rcu_read_unlock();
      
      	return p;
      }
      
      In other words, the kfunc uses RCU to ensure that the task remains valid
      after it's been peeked from the map. However, this is completely
      redundant with just defining a KF_RCU kfunc that itself does a
      refcount_inc_not_zero(), which is exactly what bpf_task_acquire() now
      does.
      
      So, the question of whether KF_KPTR_GET is useful is actually, "Are
      there any synchronization mechanisms / safety flags that are required by
      certain kptrs, but which are not provided by the verifier to kfuncs?"
      The answer to that question today is "No", because every kptr we
      currently care about is RCU protected.
      
      Even if the answer ever became "yes", the proper way to support that
      referenced kptr type would be to add support for whatever
      synchronization mechanism it requires in the verifier, rather than
      giving kfuncs a flag that says, "Here's a pointer to a referenced kptr
      in a map, do whatever you need to do."
      
      With all that said -- so as to allow us to consolidate the kfunc API,
      and simplify the verifier, this patchset removes the KF_KPTR_GET kfunc
      flag.
      ---
      
      This is v2 of this patchset
      
      v1: https://lore.kernel.org/all/20230415103231.236063-1-void@manifault.com/
      
      Changelog:
      ----------
      
      v1 -> v2:
      - Fix KF_RU -> KF_RCU typo in commit summary for patch 2/3, and in cover
        letter (Alexei)
      - In order to reduce churn, don't shift all KF_* flags down by 1. We'll
        just fill the now-empty slot the next time we add a flag (Alexei)
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d40f4f68