1. 16 Jul, 2021 15 commits
  2. 15 Jul, 2021 16 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-timers' · 76283171
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      The first request to support timers in bpf was made in 2013 before sys_bpf
      syscall was added. That use case was periodic sampling. It was address with
      attaching bpf programs to perf_events. Then during XDP development the timers
      were requested to do garbage collection and health checks. They were worked
      around by implementing timers in user space and triggering progs with
      BPF_PROG_RUN command. The user space timers and perf_event+bpf timers are not
      armed by the bpf program. They're done asynchronously vs program execution.
      The XDP program cannot send a packet and arm the timer at the same time. The
      tracing prog cannot record an event and arm the timer right away. This large
      class of use cases remained unaddressed. The jiffy based and hrtimer based
      timers are essential part of the kernel development and with this patch set
      the hrtimer based timers will be available to bpf programs.
      
      TLDR: bpf timers is a wrapper of hrtimers with all the extra safety added
      to make sure bpf progs cannot crash the kernel.
      
      v6->v7:
      - address Andrii's comments and add his Acks.
      
      v5->v6:
      - address code review feedback from Martin and add his Acks.
      - add usercnt > 0 check to bpf_timer_init and remove timers_cancel_and_free
      second loop in map_free callbacks.
      - add cond_resched_rcu.
      
      v4->v5:
      - Martin noticed the following issues:
      . prog could be reallocated bpf_patch_insn_data().
      Fixed by passing 'aux' into bpf_timer_set_callback, since 'aux' is stable
      during insn patching.
      . Added missing rcu_read_lock.
      . Removed redundant record_map.
      - Discovered few bugs with stress testing:
      . One cpu does htab_free_prealloced_timers->bpf_timer_cancel_and_free->hrtimer_cancel
      while another is trying to do something with the timer like bpf_timer_start/set_callback.
      Those ops try to acquire bpf_spin_lock that is already taken by bpf_timer_cancel_and_free,
      so both cpus spin forever. The same problem existed in bpf_timer_cancel().
      One bpf prog on one cpu might call bpf_timer_cancel and wait, while another cpu is in
      the timer callback that tries to do bpf_timer_*() helper on the same timer.
      The fix is to do drop_prog_refcnt() and unlock. And only then hrtimer_cancel.
      Because of this had to add callback_fn != NULL check to bpf_timer_cb().
      Also removed redundant bpf_prog_inc/put from bpf_timer_cb() and replaced
      with rcu_dereference_check similar to recent rcu_read_lock-removal from drivers.
      bpf_timer_cb is in softirq.
      . Managed to hit refcnt==0 while doing bpf_prog_put from bpf_timer_cancel_and_free().
      That exposed the issue that bpf_prog_put wasn't ready to be called from irq context.
      Fixed similar to bpf_map_put which is irq ready.
      - Refactored BPF_CALL_1(bpf_spin_lock) into __bpf_spin_lock_irqsave() to
      make the main logic more clear, since Martin and Yonghong brought up this concern.
      
      v3->v4:
      1.
      Split callback_fn from bpf_timer_start into bpf_timer_set_callback as
      suggested by Martin. That makes bpf timer api match one to one to
      kernel hrtimer api and provides greater flexibility.
      2.
      Martin also discovered the following issue with uref approach:
      bpftool prog load xdp_timer.o /sys/fs/bpf/xdp_timer type xdp
      bpftool net attach xdpgeneric pinned /sys/fs/bpf/xdp_timer dev lo
      rm /sys/fs/bpf/xdp_timer
      nc -6 ::1 8888
      bpftool net detach xdpgeneric dev lo
      The timer callback stays active in the kernel though the prog was detached
      and map usercnt == 0.
      It happened because 'bpftool prog load' pinned the prog only.
      The map usercnt went to zero. Subsequent attach and runs didn't
      affect map usercnt. The timer was able to start and bpf_prog_inc itself.
      When the prog was detached the prog stayed active.
      To address this issue added
      if (!atomic64_read(&(t->map->usercnt))) return -EPERM;
      to the first patch.
      Which means that timers are allowed only in the maps that are held
      by user space with open file descriptor or maps pinned in bpffs.
      3.
      Discovered that timers in inner maps were broken.
      The inner map pointers are dynamic. Therefore changed bpf_timer_init()
      to accept explicit map pointer supplied by the program instead
      of hidden map pointer supplied by the verifier.
      To make sure that pointer to a timer actually belongs to that map
      added the verifier check in patch 3.
      4.
      Addressed Yonghong's feedback. Improved comments and added
      dynamic in_nmi() check.
      Added Acks.
      
      v2->v3:
      The v2 approach attempted to bump bpf_prog refcnt when bpf_timer_start is
      called to make sure callback code doesn't disappear when timer is active and
      drop refcnt when timer cb is done. That led to a ton of race conditions between
      callback running and concurrent bpf_timer_init/start/cancel on another cpu,
      and concurrent bpf_map_update/delete_elem, and map destroy.
      
      Then v2.5 approach skipped prog refcnt altogether. Instead it remembered all
      timers that bpf prog armed in a link list and canceled them when prog refcnt
      went to zero. The race conditions disappeared, but timers in map-in-map could
      not be supported cleanly, since timers in inner maps have inner map's life time
      and don't match prog's life time.
      
      This v3 approach makes timers to be owned by maps. It allows timers in inner
      maps to be supported from the start. This apporach relies on "user refcnt"
      scheme used in prog_array that stores bpf programs for bpf_tail_call. The
      bpf_timer_start() increments prog refcnt, but unlike 1st approach the timer
      callback does decrement the refcnt. The ops->map_release_uref is
      responsible for cancelling the timers and dropping prog refcnt when user space
      reference to a map is dropped. That addressed all the races and simplified
      locking.
      
      Andrii presented a use case where specifying callback_fn in bpf_timer_init()
      is inconvenient vs specifying in bpf_timer_start(). The bpf_timer_init()
      typically is called outside for timer callback, while bpf_timer_start() most
      likely will be called from the callback.
      timer_cb() { ... bpf_timer_start(timer_cb); ...} looks like recursion and as
      infinite loop to the verifier. The verifier had to be made smarter to recognize
      such async callbacks. Patches 7,8,9 addressed that.
      
      Patch 1 and 2 refactoring.
      Patch 3 implements bpf timer helpers and locking.
      Patch 4 implements map side of bpf timer support.
      Patch 5 prevent pointer mismatch in bpf_timer_init.
      Patch 6 adds support for BTF in inner maps.
      Patch 7 teaches check_cfg() pass to understand async callbacks.
      Patch 8 teaches do_check() pass to understand async callbacks.
      Patch 9 teaches check_max_stack_depth() pass to understand async callbacks.
      Patches 10 and 11 are the tests.
      
      v1->v2:
      - Addressed great feedback from Andrii and Toke.
      - Fixed race between parallel bpf_timer_*() ops.
      - Fixed deadlock between timer callback and LRU eviction or bpf_map_delete/update.
      - Disallowed mmap and global timers.
      - Allow spin_lock and bpf_timer in an element.
      - Fixed memory leaks due to map destruction and LRU eviction.
      - A ton more tests.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      76283171
    • Alexei Starovoitov's avatar
      selftests/bpf: Add a test with bpf_timer in inner map. · 61f71e74
      Alexei Starovoitov authored
      Check that map-in-map supports bpf timers.
      
      Check that indirect "recursion" of timer callbacks works:
      timer_cb1() { bpf_timer_set_callback(timer_cb2); }
      timer_cb2() { bpf_timer_set_callback(timer_cb1); }
      
      Check that
        bpf_map_release
          htab_free_prealloced_timers
            bpf_timer_cancel_and_free
              hrtimer_cancel
      works while timer cb is running.
      "while true; do ./test_progs -t timer_mim; done"
      is a great stress test. It caught missing timer cancel in htab->extra_elems.
      
      timer_mim_reject.c is a negative test that checks
      that timer<->map mismatch is prevented.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-12-alexei.starovoitov@gmail.com
      61f71e74
    • Alexei Starovoitov's avatar
      selftests/bpf: Add bpf_timer test. · 3540f7c6
      Alexei Starovoitov authored
      Add bpf_timer test that creates timers in preallocated and
      non-preallocated hash, in array and in lru maps.
      Let array timer expire once and then re-arm it for 35 seconds.
      Arm lru timer into the same callback.
      Then arm and re-arm hash timers 10 times each.
      At the last invocation of prealloc hash timer cancel the array timer.
      Force timer free via LRU eviction and direct bpf_map_delete_elem.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-11-alexei.starovoitov@gmail.com
      3540f7c6
    • Alexei Starovoitov's avatar
      bpf: Teach stack depth check about async callbacks. · 7ddc80a4
      Alexei Starovoitov authored
      Teach max stack depth checking algorithm about async callbacks
      that don't increase bpf program stack size.
      Also add sanity check that bpf_tail_call didn't sneak into async cb.
      It's impossible, since PTR_TO_CTX is not available in async cb,
      hence the program cannot contain bpf_tail_call(ctx,...);
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com
      7ddc80a4
    • Alexei Starovoitov's avatar
      bpf: Implement verifier support for validation of async callbacks. · bfc6bb74
      Alexei Starovoitov authored
      bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
      PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
      and pass them into the helpers as function callbacks.
      In case of bpf_for_each_map_elem() the callback is invoked synchronously
      and the verifier treats it as a normal subprogram call by adding another
      bpf_func_state and new frame in __check_func_call().
      bpf_timer_set_callback() doesn't invoke the callback directly.
      The subprogram will be called asynchronously from bpf_timer_cb().
      Teach the verifier to validate such async callbacks as special kind
      of jump by pushing verifier state into stack and let pop_stack() process it.
      
      Special care needs to be taken during state pruning.
      The call insn doing bpf_timer_set_callback has to be a prune_point.
      Otherwise short timer callbacks might not have prune points in front of
      bpf_timer_set_callback() which means is_state_visited() will be called
      after this call insn is processed in __check_func_call(). Which means that
      another async_cb state will be pushed to be walked later and the verifier
      will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
      Since push_async_cb() looks like another push_stack() branch the
      infinite loop detection will trigger false positive. To recognize
      this case mark such states as in_async_callback_fn.
      To distinguish infinite loop in async callback vs the same callback called
      with different arguments for different map and timer add async_entry_cnt
      to bpf_func_state.
      
      Enforce return zero from async callbacks.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com
      bfc6bb74
    • Alexei Starovoitov's avatar
      bpf: Relax verifier recursion check. · 86fc6ee6
      Alexei Starovoitov authored
      In the following bpf subprogram:
      static int timer_cb(void *map, void *key, void *value)
      {
          bpf_timer_set_callback(.., timer_cb);
      }
      
      the 'timer_cb' is a pointer to a function.
      ld_imm64 insn is used to carry this pointer.
      bpf_pseudo_func() returns true for such ld_imm64 insn.
      
      Unlike bpf_for_each_map_elem() the bpf_timer_set_callback() is asynchronous.
      Relax control flow check to allow such "recursion" that is seen as an infinite
      loop by check_cfg(). The distinction between bpf_for_each_map_elem() the
      bpf_timer_set_callback() is done in the follow up patch.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-8-alexei.starovoitov@gmail.com
      86fc6ee6
    • Alexei Starovoitov's avatar
      bpf: Remember BTF of inner maps. · 40ec00ab
      Alexei Starovoitov authored
      BTF is required for 'struct bpf_timer' to be recognized inside map value.
      The bpf timers are supported inside inner maps.
      Remember 'struct btf *' in inner_map_meta to make it available
      to the verifier in the sequence:
      
      struct bpf_map *inner_map = bpf_map_lookup_elem(&outer_map, ...);
      if (inner_map)
          timer = bpf_map_lookup_elem(&inner_map, ...);
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-7-alexei.starovoitov@gmail.com
      40ec00ab
    • Alexei Starovoitov's avatar
      bpf: Prevent pointer mismatch in bpf_timer_init. · 3e8ce298
      Alexei Starovoitov authored
      bpf_timer_init() arguments are:
      1. pointer to a timer (which is embedded in map element).
      2. pointer to a map.
      Make sure that pointer to a timer actually belongs to that map.
      
      Use map_uid (which is unique id of inner map) to reject:
      inner_map1 = bpf_map_lookup_elem(outer_map, key1)
      inner_map2 = bpf_map_lookup_elem(outer_map, key2)
      if (inner_map1 && inner_map2) {
          timer = bpf_map_lookup_elem(inner_map1);
          if (timer)
              // mismatch would have been allowed
              bpf_timer_init(timer, inner_map2);
      }
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-6-alexei.starovoitov@gmail.com
      3e8ce298
    • Alexei Starovoitov's avatar
      bpf: Add map side support for bpf timers. · 68134668
      Alexei Starovoitov authored
      Restrict bpf timers to array, hash (both preallocated and kmalloced), and
      lru map types. The per-cpu maps with timers don't make sense, since 'struct
      bpf_timer' is a part of map value. bpf timers in per-cpu maps would mean that
      the number of timers depends on number of possible cpus and timers would not be
      accessible from all cpus. lpm map support can be added in the future.
      The timers in inner maps are supported.
      
      The bpf_map_update/delete_elem() helpers and sys_bpf commands cancel and free
      bpf_timer in a given map element.
      
      Similar to 'struct bpf_spin_lock' BTF is required and it is used to validate
      that map element indeed contains 'struct bpf_timer'.
      
      Make check_and_init_map_value() init both bpf_spin_lock and bpf_timer when
      map element data is reused in preallocated htab and lru maps.
      
      Teach copy_map_value() to support both bpf_spin_lock and bpf_timer in a single
      map element. There could be one of each, but not more than one. Due to 'one
      bpf_timer in one element' restriction do not support timers in global data,
      since global data is a map of single element, but from bpf program side it's
      seen as many global variables and restriction of single global timer would be
      odd. The sys_bpf map_freeze and sys_mmap syscalls are not allowed on maps with
      timers, since user space could have corrupted mmap element and crashed the
      kernel. The maps with timers cannot be readonly. Due to these restrictions
      search for bpf_timer in datasec BTF in case it was placed in the global data to
      report clear error.
      
      The previous patch allowed 'struct bpf_timer' as a first field in a map
      element only. Relax this restriction.
      
      Refactor lru map to s/bpf_lru_push_free/htab_lru_push_free/ to cancel and free
      the timer when lru map deletes an element as a part of it eviction algorithm.
      
      Make sure that bpf program cannot access 'struct bpf_timer' via direct load/store.
      The timer operation are done through helpers only.
      This is similar to 'struct bpf_spin_lock'.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-5-alexei.starovoitov@gmail.com
      68134668
    • Alexei Starovoitov's avatar
      bpf: Introduce bpf timers. · b00628b1
      Alexei Starovoitov authored
      Introduce 'struct bpf_timer { __u64 :64; __u64 :64; };' that can be embedded
      in hash/array/lru maps as a regular field and helpers to operate on it:
      
      // Initialize the timer.
      // First 4 bits of 'flags' specify clockid.
      // Only CLOCK_MONOTONIC, CLOCK_REALTIME, CLOCK_BOOTTIME are allowed.
      long bpf_timer_init(struct bpf_timer *timer, struct bpf_map *map, int flags);
      
      // Configure the timer to call 'callback_fn' static function.
      long bpf_timer_set_callback(struct bpf_timer *timer, void *callback_fn);
      
      // Arm the timer to expire 'nsec' nanoseconds from the current time.
      long bpf_timer_start(struct bpf_timer *timer, u64 nsec, u64 flags);
      
      // Cancel the timer and wait for callback_fn to finish if it was running.
      long bpf_timer_cancel(struct bpf_timer *timer);
      
      Here is how BPF program might look like:
      struct map_elem {
          int counter;
          struct bpf_timer timer;
      };
      
      struct {
          __uint(type, BPF_MAP_TYPE_HASH);
          __uint(max_entries, 1000);
          __type(key, int);
          __type(value, struct map_elem);
      } hmap SEC(".maps");
      
      static int timer_cb(void *map, int *key, struct map_elem *val);
      /* val points to particular map element that contains bpf_timer. */
      
      SEC("fentry/bpf_fentry_test1")
      int BPF_PROG(test1, int a)
      {
          struct map_elem *val;
          int key = 0;
      
          val = bpf_map_lookup_elem(&hmap, &key);
          if (val) {
              bpf_timer_init(&val->timer, &hmap, CLOCK_REALTIME);
              bpf_timer_set_callback(&val->timer, timer_cb);
              bpf_timer_start(&val->timer, 1000 /* call timer_cb2 in 1 usec */, 0);
          }
      }
      
      This patch adds helper implementations that rely on hrtimers
      to call bpf functions as timers expire.
      The following patches add necessary safety checks.
      
      Only programs with CAP_BPF are allowed to use bpf_timer.
      
      The amount of timers used by the program is constrained by
      the memcg recorded at map creation time.
      
      The bpf_timer_init() helper needs explicit 'map' argument because inner maps
      are dynamic and not known at load time. While the bpf_timer_set_callback() is
      receiving hidden 'aux->prog' argument supplied by the verifier.
      
      The prog pointer is needed to do refcnting of bpf program to make sure that
      program doesn't get freed while the timer is armed. This approach relies on
      "user refcnt" scheme used in prog_array that stores bpf programs for
      bpf_tail_call. The bpf_timer_set_callback() will increment the prog refcnt which is
      paired with bpf_timer_cancel() that will drop the prog refcnt. The
      ops->map_release_uref is responsible for cancelling the timers and dropping
      prog refcnt when user space reference to a map reaches zero.
      This uref approach is done to make sure that Ctrl-C of user space process will
      not leave timers running forever unless the user space explicitly pinned a map
      that contained timers in bpffs.
      
      bpf_timer_init() and bpf_timer_set_callback() will return -EPERM if map doesn't
      have user references (is not held by open file descriptor from user space and
      not pinned in bpffs).
      
      The bpf_map_delete_elem() and bpf_map_update_elem() operations cancel
      and free the timer if given map element had it allocated.
      "bpftool map update" command can be used to cancel timers.
      
      The 'struct bpf_timer' is explicitly __attribute__((aligned(8))) because
      '__u64 :64' has 1 byte alignment of 8 byte padding.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-4-alexei.starovoitov@gmail.com
      b00628b1
    • Alexei Starovoitov's avatar
      bpf: Factor out bpf_spin_lock into helpers. · c1b3fed3
      Alexei Starovoitov authored
      Move ____bpf_spin_lock/unlock into helpers to make it more clear
      that quadruple underscore bpf_spin_lock/unlock are irqsave/restore variants.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-3-alexei.starovoitov@gmail.com
      c1b3fed3
    • Alexei Starovoitov's avatar
      bpf: Prepare bpf_prog_put() to be called from irq context. · d809e134
      Alexei Starovoitov authored
      Currently bpf_prog_put() is called from the task context only.
      With addition of bpf timers the timer related helpers will start calling
      bpf_prog_put() from irq-saved region and in rare cases might drop
      the refcnt to zero.
      To address this case, first, convert bpf_prog_free_id() to be irq-save
      (this is similar to bpf_map_free_id), and, second, defer non irq
      appropriate calls into work queue.
      For example:
      bpf_audit_prog() is calling kmalloc and wake_up_interruptible,
      bpf_prog_kallsyms_del_all()->bpf_ksym_del()->spin_unlock_bh().
      They are not safe with irqs disabled.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210715005417.78572-2-alexei.starovoitov@gmail.com
      d809e134
    • Tobias Klauser's avatar
      selftests/bpf: Remove unused variable in tc_tunnel prog · de587d56
      Tobias Klauser authored
      The variable buf is unused since commit 005edd16 ("selftests/bpf:
      convert bpf tunnel test to BPF_ADJ_ROOM_MAC"). Remove it to fix the
      following warning:
      
          test_tc_tunnel.c:531:7: warning: unused variable 'buf' [-Wunused-variable]
      
      Fixes: 005edd16 ("selftests/bpf: convert bpf tunnel test to BPF_ADJ_ROOM_MAC")
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20210713102719.8890-1-tklauser@distanz.ch
      de587d56
    • He Fengqing's avatar
      bpf: Fix potential memleak and UAF in the verifier. · 75f0fc7b
      He Fengqing authored
      In bpf_patch_insn_data(), we first use the bpf_patch_insn_single() to
      insert new instructions, then use adjust_insn_aux_data() to adjust
      insn_aux_data. If the old env->prog have no enough room for new inserted
      instructions, we use bpf_prog_realloc to construct new_prog and free the
      old env->prog.
      
      There have two errors here. First, if adjust_insn_aux_data() return
      ENOMEM, we should free the new_prog. Second, if adjust_insn_aux_data()
      return ENOMEM, bpf_patch_insn_data() will return NULL, and env->prog has
      been freed in bpf_prog_realloc, but we will use it in bpf_check().
      
      So in this patch, we make the adjust_insn_aux_data() never fails. In
      bpf_patch_insn_data(), we first pre-malloc memory for the new
      insn_aux_data, then call bpf_patch_insn_single() to insert new
      instructions, at last call adjust_insn_aux_data() to adjust
      insn_aux_data.
      
      Fixes: 8041902d ("bpf: adjust insn_aux_data when patching insns")
      Signed-off-by: default avatarHe Fengqing <hefengqing@huawei.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20210714101815.164322-1-hefengqing@huawei.com
      75f0fc7b
    • Kuniyuki Iwashima's avatar
      bpf: Fix a typo of reuseport map in bpf.h. · f170acda
      Kuniyuki Iwashima authored
      Fix s/BPF_MAP_TYPE_REUSEPORT_ARRAY/BPF_MAP_TYPE_REUSEPORT_SOCKARRAY/ typo
      in bpf.h.
      
      Fixes: 2dbb9b9e ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
      Signed-off-by: default avatarKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20210714124317.67526-1-kuniyu@amazon.co.jp
      f170acda
    • Alexei Starovoitov's avatar
      bpf: Sync tools/include/uapi/linux/bpf.h · cf2c6f08
      Alexei Starovoitov authored
      Commit 47316f4a missed updating tools/.../bpf.h.
      Sync it.
      
      Fixes: 47316f4a ("bpf: Support input xdp_md context in BPF_PROG_TEST_RUN")
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      cf2c6f08
  3. 12 Jul, 2021 1 commit
    • Martynas Pumputis's avatar
      libbpf: Fix reuse of pinned map on older kernel · 97eb3138
      Martynas Pumputis authored
      When loading a BPF program with a pinned map, the loader checks whether
      the pinned map can be reused, i.e. their properties match. To derive
      such of the pinned map, the loader invokes BPF_OBJ_GET_INFO_BY_FD and
      then does the comparison.
      
      Unfortunately, on < 4.12 kernels the BPF_OBJ_GET_INFO_BY_FD is not
      available, so loading the program fails with the following error:
      
      	libbpf: failed to get map info for map FD 5: Invalid argument
      	libbpf: couldn't reuse pinned map at
      		'/sys/fs/bpf/tc/globals/cilium_call_policy': parameter
      		mismatch"
      	libbpf: map 'cilium_call_policy': error reusing pinned map
      	libbpf: map 'cilium_call_policy': failed to create:
      		Invalid argument(-22)
      	libbpf: failed to load object 'bpf_overlay.o'
      
      To fix this, fallback to derivation of the map properties via
      /proc/$PID/fdinfo/$MAP_FD if BPF_OBJ_GET_INFO_BY_FD fails with EINVAL,
      which can be used as an indicator that the kernel doesn't support
      the latter.
      Signed-off-by: default avatarMartynas Pumputis <m@lambda.lt>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20210712125552.58705-1-m@lambda.lt
      97eb3138
  4. 08 Jul, 2021 8 commits
    • Jesper Dangaard Brouer's avatar
      samples/bpf: xdp_redirect_cpu_user: Cpumap qsize set larger default · eff94154
      Jesper Dangaard Brouer authored
      Experience from production shows queue size of 192 is too small, as
      this caused packet drops during cpumap-enqueue on RX-CPU.  This can be
      diagnosed with xdp_monitor sample program.
      
      This bpftrace program was used to diagnose the problem in more detail:
      
       bpftrace -e '
        tracepoint:xdp:xdp_cpumap_kthread { @deq_bulk = lhist(args->processed,0,10,1); @drop_net = lhist(args->drops,0,10,1) }
        tracepoint:xdp:xdp_cpumap_enqueue { @enq_bulk = lhist(args->processed,0,10,1); @enq_drops = lhist(args->drops,0,10,1); }'
      
      Watch out for the @enq_drops counter. The @drop_net counter can happen
      when netstack gets invalid packets, so don't despair it can be
      natural, and that counter will likely disappear in newer kernels as it
      was a source of confusion (look at netstat info for reason of the
      netstack @drop_net counters).
      
      The production system was configured with CPU power-saving C6 state.
      Learn more in this blogpost[1].
      
      And wakeup latency in usec for the states are:
      
       # grep -H . /sys/devices/system/cpu/cpu0/cpuidle/*/latency
       /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
       /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
       /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
       /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:133
      
      Deepest state take 133 usec to wakeup from (133/10^6). The link speed
      is 25Gbit/s ((25*10^9/8) in bytes/sec). How many bytes can arrive with
      in 133 usec at this speed: (25*10^9/8)*(133/10^6) = 415625 bytes. With
      MTU size packets this is 275 packets, and with minimum Ethernet (incl
      intergap overhead) 84 bytes it is 4948 packets. Clearly default queue
      size is too small.
      
      Setting default cpumap queue to 2048 as worst-case (small packet) at
      10Gbit/s is 1979 packets with 133 usec wakeup time, +64 packet before
      kthread wakeup call (due to xdp_do_flush) worst-case 2043 packets.
      
      Thus, if a packet burst on RX-CPU will enqueue packets to a remote
      cpumap CPU that is in deep-sleep state it can overrun the cpumap queue.
      
      The production system was also configured to avoid deep-sleep via:
       tuned-adm profile network-latency
      
      [1] https://jeremyeder.com/2013/08/30/oh-did-you-expect-the-cpu/Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/162523477604.786243.13372630844944530891.stgit@firesoul
      eff94154
    • Alexei Starovoitov's avatar
      Merge branch 'Generic XDP improvements' · e0bc8927
      Alexei Starovoitov authored
      Kumar Kartikeya says:
      
      ====================
      
      This small series makes some improvements to generic XDP mode and brings it
      closer to native XDP. Patch 1 splits out generic XDP processing into reusable
      parts, patch 2 adds pointer friendly wrappers for bitops (not have to cast back
      and forth the address of local pointer to unsigned long *), patch 3 implements
      generic cpumap support (details in commit) and patch 4 allows devmap bpf prog
      execution before generic_xdp_tx is called.
      
      Patch 5 just updates a couple of selftests to adapt to changes in behavior (in
      that specifying devmap/cpumap prog fd in generic mode is now allowed).
      
      Changelog:
      ----------
      v5 -> v6
      v5: https://lore.kernel.org/bpf/20210701002759.381983-1-memxor@gmail.com
       * Put rcpu->prog check before RCU-bh section to avoid do_softirq (Jesper)
      
      v4 -> v5
      v4: https://lore.kernel.org/bpf/20210628114746.129669-1-memxor@gmail.com
       * Add comments and examples for new bitops macros (Alexei)
      
      v3 -> v4
      v3: https://lore.kernel.org/bpf/20210622202835.1151230-1-memxor@gmail.com
       * Add detach now that attach of XDP program succeeds (Toke)
       * Clean up the test to use new ASSERT macros
      
      v2 -> v3
      v2: https://lore.kernel.org/bpf/20210622195527.1110497-1-memxor@gmail.com
       * list_for_each_entry -> list_for_each_entry_safe (due to deletion of skb)
      
      v1 -> v2
      v1: https://lore.kernel.org/bpf/20210620233200.855534-1-memxor@gmail.com
       * Move __ptr_{set,clear,test}_bit to bitops.h (Toke)
         Also changed argument order to match the bit op they wrap.
       * Remove map value size checking functions for cpumap/devmap (Toke)
       * Rework prog run for skb in cpu_map_kthread_run (Toke)
       * Set skb->dev to dst->dev after devmap prog has run
       * Don't set xdp rxq that will be overwritten in cpumap prog run
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e0bc8927
    • Kumar Kartikeya Dwivedi's avatar
      bpf: Tidy xdp attach selftests · 36246d5a
      Kumar Kartikeya Dwivedi authored
      Support for cpumap and devmap entry progs in previous commits means the
      test needs to be updated for the new semantics. Also take this
      opportunity to convert it from CHECK macros to the new ASSERT macros.
      
      Since xdp_cpumap_attach has no subtest, put the sole test inside the
      test_xdp_cpumap_attach function.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-6-memxor@gmail.com
      36246d5a
    • Kumar Kartikeya Dwivedi's avatar
      bpf: devmap: Implement devmap prog execution for generic XDP · 2ea5eaba
      Kumar Kartikeya Dwivedi authored
      This lifts the restriction on running devmap BPF progs in generic
      redirect mode. To match native XDP behavior, it is invoked right before
      generic_xdp_tx is called, and only supports XDP_PASS/XDP_ABORTED/
      XDP_DROP actions.
      
      We also return 0 even if devmap program drops the packet, as
      semantically redirect has already succeeded and the devmap prog is the
      last point before TX of the packet to device where it can deliver a
      verdict on the packet.
      
      This also means it must take care of freeing the skb, as
      xdp_do_generic_redirect callers only do that in case an error is
      returned.
      
      Since devmap entry prog is supported, remove the check in
      generic_xdp_install entirely.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-5-memxor@gmail.com
      2ea5eaba
    • Kumar Kartikeya Dwivedi's avatar
      bpf: cpumap: Implement generic cpumap · 11941f8a
      Kumar Kartikeya Dwivedi authored
      This change implements CPUMAP redirect support for generic XDP programs.
      The idea is to reuse the cpu map entry's queue that is used to push
      native xdp frames for redirecting skb to a different CPU. This will
      match native XDP behavior (in that RPS is invoked again for packet
      reinjected into networking stack).
      
      To be able to determine whether the incoming skb is from the driver or
      cpumap, we reuse skb->redirected bit that skips generic XDP processing
      when it is set. To always make use of this, CONFIG_NET_REDIRECT guard on
      it has been lifted and it is always available.
      
      >From the redirect side, we add the skb to ptr_ring with its lowest bit
      set to 1.  This should be safe as skb is not 1-byte aligned. This allows
      kthread to discern between xdp_frames and sk_buff. On consumption of the
      ptr_ring item, the lowest bit is unset.
      
      In the end, the skb is simply added to the list that kthread is anyway
      going to maintain for xdp_frames converted to skb, and then received
      again by using netif_receive_skb_list.
      
      Bulking optimization for generic cpumap is left as an exercise for a
      future patch for now.
      
      Since cpumap entry progs are now supported, also remove check in
      generic_xdp_install for the cpumap.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Acked-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-4-memxor@gmail.com
      11941f8a
    • Kumar Kartikeya Dwivedi's avatar
      bitops: Add non-atomic bitops for pointers · cb0f8003
      Kumar Kartikeya Dwivedi authored
      cpumap needs to set, clear, and test the lowest bit in skb pointer in
      various places. To make these checks less noisy, add pointer friendly
      bitop macros that also do some typechecking to sanitize the argument.
      
      These wrap the non-atomic bitops __set_bit, __clear_bit, and test_bit
      but for pointer arguments. Pointer's address has to be passed in and it
      is treated as an unsigned long *, since width and representation of
      pointer and unsigned long match on targets Linux supports. They are
      prefixed with double underscore to indicate lack of atomicity.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-3-memxor@gmail.com
      cb0f8003
    • Kumar Kartikeya Dwivedi's avatar
      net: core: Split out code to run generic XDP prog · fe21cb91
      Kumar Kartikeya Dwivedi authored
      This helper can later be utilized in code that runs cpumap and devmap
      programs in generic redirect mode and adjust skb based on changes made
      to xdp_buff.
      
      When returning XDP_REDIRECT/XDP_TX, it invokes __skb_push, so whenever a
      generic redirect path invokes devmap/cpumap prog if set, it must
      __skb_pull again as we expect mac header to be pulled.
      
      It also drops the skb_reset_mac_len call after do_xdp_generic, as the
      mac_header and network_header are advanced by the same offset, so the
      difference (mac_len) remains constant.
      Signed-off-by: default avatarKumar Kartikeya Dwivedi <memxor@gmail.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20210702111825.491065-2-memxor@gmail.com
      fe21cb91
    • Alexei Starovoitov's avatar
      Merge branch 'bpf: support input xdp_md context in BPF_PROG_TEST_RUN' · a080cdcc
      Alexei Starovoitov authored
      Zvi Effron says:
      
      ====================
      
      This patchset adds support for passing an xdp_md via ctx_in/ctx_out in
      bpf_attr for BPF_PROG_TEST_RUN of XDP programs.
      
      Patch 1 adds a function to validate XDP meta data lengths.
      
      Patch 2 adds initial support for passing XDP meta data in addition to
      packet data.
      
      Patch 3 adds support for also specifying the ingress interface and
      rx queue.
      
      Patch 4 adds selftests to ensure functionality is correct.
      
      Changelog:
      ----------
      v7->v8
      v7: https://lore.kernel.org/bpf/20210624211304.90807-1-zeffron@riotgames.com/
      
       * Fix too long comment line in patch 3
      
      v6->v7
      v6: https://lore.kernel.org/bpf/20210617232904.1899-1-zeffron@riotgames.com/
      
       * Add Yonghong Song's Acked-by to commit message in patch 1
       * Add Yonghong Song's Acked-by to commit message in patch 2
       * Extracted the post-update of the xdp_md context into a function (again)
       * Validate that the rx queue was registered with XDP info
       * Decrement the reference count on a found netdevice on failure to find
        a valid rx queue
       * Decrement the reference count on a found netdevice after the XDP
        program is run
       * Drop Yonghong Song's Acked-By for patch 3 because of patch changes
       * Improve a comment in the selftests
       * Drop Yonghong Song's Acked-By for patch 4 because of patch changes
      
      v5->v6
      v5: https://lore.kernel.org/bpf/20210616224712.3243-1-zeffron@riotgames.com/
      
       * Correct commit messages in patches 1 and 3
       * Add Acked-by to commit message in patch 4
       * Use gotos instead of returns to correctly free resources in
        bpf_prog_test_run_xdp
       * Rename xdp_metalen_valid to xdp_metalen_invalid
       * Improve the function signature for xdp_metalen_invalid
       * Merged declaration of ingress_ifindex and rx_queue_index into one line
      
      v4->v5
      v4: https://lore.kernel.org/bpf/20210604220235.6758-1-zeffron@riotgames.com/
      
       * Add new patch to introduce xdp_metalen_valid inline function to avoid
        duplicated code from net/core/filter.c
       * Correct size of bad_ctx in selftests
       * Make all declarations reverse Christmas tree
       * Move data check from xdp_convert_md_to_buff to bpf_prog_test_run_xdp
       * Merge xdp_convert_buff_to_md into bpf_prog_test_run_xdp
       * Fix line too long
       * Extracted common checks in selftests to a helper function
       * Removed redundant assignment in selftests
       * Reordered test cases in selftests
       * Check data against 0 instead of data_meta in selftests
       * Made selftests use EINVAL instead of hardcoded 22
       * Dropped "_" from XDP function name
       * Changed casts in XDP program from unsigned long to long
       * Added a comment explaining the use of the loopback interface in selftests
       * Change parameter order in xdp_convert_md_to_buff to be input first
       * Assigned xdp->ingress_ifindex and xdp->rx_queue_index to local variables in
        xdp_convert_md_to_buff
       * Made use of "meta data" versus "metadata" consistent in comments and commit
        messages
      
      v3->v4
      v3: https://lore.kernel.org/bpf/20210602190815.8096-1-zeffron@riotgames.com/
      
       * Clean up nits
       * Validate xdp_md->data_end in bpf_prog_test_run_xdp
       * Remove intermediate metalen variables
      
      v2 -> v3
      v2: https://lore.kernel.org/bpf/20210527201341.7128-1-zeffron@riotgames.com/
      
       * Check errno first in selftests
       * Use DECLARE_LIBBPF_OPTS
       * Rename tattr to opts in selftests
       * Remove extra new line
       * Rename convert_xdpmd_to_xdpb to xdp_convert_md_to_buff
       * Rename convert_xdpb_to_xdpmd to xdp_convert_buff_to_md
       * Move declaration of device and rxqueue in xdp_convert_md_to_buff to
        patch 2
       * Reorder the kfree calls in bpf_prog_test_run_xdp
      
      v1 -> v2
      v1: https://lore.kernel.org/bpf/20210524220555.251473-1-zeffron@riotgames.com
      
       * Fix null pointer dereference with no context
       * Use the BPF skeleton and replace CHECK with ASSERT macros
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a080cdcc