1. 24 Oct, 2023 11 commits
    • Paul E. McKenney's avatar
      bpf: Fold smp_mb__before_atomic() into atomic_set_release() · 06646da0
      Paul E. McKenney authored
      The bpf_user_ringbuf_drain() BPF_CALL function uses an atomic_set()
      immediately preceded by smp_mb__before_atomic() so as to order storing
      of ring-buffer consumer and producer positions prior to the atomic_set()
      call's clearing of the ->busy flag, as follows:
      
              smp_mb__before_atomic();
              atomic_set(&rb->busy, 0);
      
      Although this works given current architectures and implementations, and
      given that this only needs to order prior writes against a later write.
      However, it does so by accident because the smp_mb__before_atomic()
      is only guaranteed to work with read-modify-write atomic operations, and
      not at all with things like atomic_set() and atomic_read().
      
      Note especially that smp_mb__before_atomic() will not, repeat *not*,
      order the prior write to "a" before the subsequent non-read-modify-write
      atomic read from "b", even on strongly ordered systems such as x86:
      
              WRITE_ONCE(a, 1);
              smp_mb__before_atomic();
              r1 = atomic_read(&b);
      
      Therefore, replace the smp_mb__before_atomic() and atomic_set() with
      atomic_set_release() as follows:
      
              atomic_set_release(&rb->busy, 0);
      
      This is no slower (and sometimes is faster) than the original, and also
      provides a formal guarantee of ordering that the original lacks.
      Signed-off-by: default avatarPaul E. McKenney <paulmck@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarDavid Vernet <void@manifault.com>
      Link: https://lore.kernel.org/bpf/ec86d38e-cfb4-44aa-8fdb-6c925922d93c@paulmck-laptop
      06646da0
    • Song Liu's avatar
      bpf: Fix unnecessary -EBUSY from htab_lock_bucket · d35381aa
      Song Liu authored
      htab_lock_bucket uses the following logic to avoid recursion:
      
      1. preempt_disable();
      2. check percpu counter htab->map_locked[hash] for recursion;
         2.1. if map_lock[hash] is already taken, return -BUSY;
      3. raw_spin_lock_irqsave();
      
      However, if an IRQ hits between 2 and 3, BPF programs attached to the IRQ
      logic will not able to access the same hash of the hashtab and get -EBUSY.
      
      This -EBUSY is not really necessary. Fix it by disabling IRQ before
      checking map_locked:
      
      1. preempt_disable();
      2. local_irq_save();
      3. check percpu counter htab->map_locked[hash] for recursion;
         3.1. if map_lock[hash] is already taken, return -BUSY;
      4. raw_spin_lock().
      
      Similarly, use raw_spin_unlock() and local_irq_restore() in
      htab_unlock_bucket().
      
      Fixes: 20b6cc34 ("bpf: Avoid hashtab deadlock with map_locked")
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarSong Liu <song@kernel.org>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/7a9576222aa40b1c84ad3a9ba3e64011d1a04d41.camel@linux.ibm.com
      Link: https://lore.kernel.org/bpf/20231012055741.3375999-1-song@kernel.org
      d35381aa
    • Albert Huang's avatar
      xsk: Avoid starving the xsk further down the list · 99b29a49
      Albert Huang authored
      In the previous implementation, when multiple xsk sockets were
      associated with a single xsk_buff_pool, a situation could arise
      where the xsk_tx_list maintained data at the front for one xsk
      socket while starving the xsk sockets at the back of the list.
      This could result in issues such as the inability to transmit packets,
      increased latency, and jitter. To address this problem, we introduce
      a new variable called tx_budget_spent, which limits each xsk to transmit
      a maximum of MAX_PER_SOCKET_BUDGET tx descriptors. This allocation ensures
      equitable opportunities for subsequent xsk sockets to send tx descriptors.
      The value of MAX_PER_SOCKET_BUDGET is set to 32.
      Signed-off-by: default avatarAlbert Huang <huangjie.albert@bytedance.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMagnus Karlsson <magnus.karlsson@intel.com>
      Link: https://lore.kernel.org/bpf/20231023125732.82261-1-huangjie.albert@bytedance.com
      99b29a49
    • Alexei Starovoitov's avatar
      Merge branch 'exact-states-comparison-for-iterator-convergence-checks' · dedd6c89
      Alexei Starovoitov authored
      Eduard Zingerman says:
      
      ====================
      exact states comparison for iterator convergence checks
      
      Iterator convergence logic in is_state_visited() uses state_equals()
      for states with branches counter > 0 to check if iterator based loop
      converges. This is not fully correct because state_equals() relies on
      presence of read and precision marks on registers. These marks are not
      guaranteed to be finalized while state has branches.
      Commit message for patch #3 describes a program that exhibits such
      behavior.
      
      This patch-set aims to fix iterator convergence logic by adding notion
      of exact states comparison. Exact comparison does not rely on presence
      of read or precision marks and thus is more strict.
      As explained in commit message for patch #3 exact comparisons require
      addition of speculative register bounds widening. The end result for
      BPF verifier users could be summarized as follows:
      
      (!) After this update verifier would reject programs that conjure an
          imprecise value on the first loop iteration and use it as precise
          on the second (for iterator based loops).
      
      I urge people to at least skim over the commit message for patch #3.
      
      Patches are organized as follows:
      - patches #1,2: moving/extracting utility functions;
      - patch #3: introduces exact mode for states comparison and adds
        widening heuristic;
      - patch #4: adds test-cases that demonstrate why the series is
        necessary;
      - patch #5: extends patch #3 with a notion of state loop entries,
        these entries have to be tracked to correctly identify that
        different verifier states belong to the same states loop;
      - patch #6: adds a test-case that demonstrates a program
        which requires loop entry tracking for correct verification;
      - patch #7: just adds a few debug prints.
      
      The following actions are planned as a followup for this patch-set:
      - implementation has to be adapted for callbacks handling logic as a
        part of a fix for [1];
      - it is necessary to explore ways to improve widening heuristic to
        handle iters_task_vma test w/o need to insert barrier_var() calls;
      - explored states eviction logic on cache miss has to be extended
        to either:
        - allow eviction of checkpoint states -or-
        - be sped up in case if there are many active checkpoints associated
          with the same instruction.
      
      The patch-set is a followup for mailing list discussion [1].
      
      Changelog:
      - V2 [3] -> V3:
        - correct check for stack spills in widen_imprecise_scalars(),
          added test case progs/iters.c:widen_spill to check the behavior
          (suggested by Andrii);
        - allow eviction of checkpoint states in is_state_visited() to avoid
          pathological verifier performance when iterator based loop does not
          converge (discussion with Alexei).
      - V1 [2] -> V2, applied changes suggested by Alexei offlist:
        - __explored_state() function removed;
        - same_callsites() function is now used in clean_live_states();
        - patches #1,2 are added as preparatory code movement;
        - in process_iter_next_call() a safeguard is added to verify that
          cur_st->parent exists and has expected insn index / call sites.
      
      [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/
      [2] https://lore.kernel.org/bpf/20231021005939.1041-1-eddyz87@gmail.com/
      [3] https://lore.kernel.org/bpf/20231022010812.9201-1-eddyz87@gmail.com/
      ====================
      
      Link: https://lore.kernel.org/r/20231024000917.12153-1-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      dedd6c89
    • Eduard Zingerman's avatar
      bpf: print full verifier states on infinite loop detection · b4d82395
      Eduard Zingerman authored
      Additional logging in is_state_visited(): if infinite loop is detected
      print full verifier state for both current and equivalent states.
      Acked-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-8-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b4d82395
    • Eduard Zingerman's avatar
      selftests/bpf: test if state loops are detected in a tricky case · 64870fee
      Eduard Zingerman authored
      A convoluted test case for iterators convergence logic that
      demonstrates that states with branch count equal to 0 might still be
      a part of not completely explored loop.
      
      E.g. consider the following state diagram:
      
                     initial     Here state 'succ' was processed first,
                       |         it was eventually tracked to produce a
                       V         state identical to 'hdr'.
          .---------> hdr        All branches from 'succ' had been explored
          |            |         and thus 'succ' has its .branches == 0.
          |            V
          |    .------...        Suppose states 'cur' and 'succ' correspond
          |    |       |         to the same instruction + callsites.
          |    V       V         In such case it is necessary to check
          |   ...     ...        whether 'succ' and 'cur' are identical.
          |    |       |         If 'succ' and 'cur' are a part of the same loop
          |    V       V         they have to be compared exactly.
          |   succ <- cur
          |    |
          |    V
          |   ...
          |    |
          '----'
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-7-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      64870fee
    • Eduard Zingerman's avatar
      bpf: correct loop detection for iterators convergence · 2a099282
      Eduard Zingerman authored
      It turns out that .branches > 0 in is_state_visited() is not a
      sufficient condition to identify if two verifier states form a loop
      when iterators convergence is computed. This commit adds logic to
      distinguish situations like below:
      
       (I)            initial       (II)            initial
                        |                             |
                        V                             V
           .---------> hdr                           ..
           |            |                             |
           |            V                             V
           |    .------...                    .------..
           |    |       |                     |       |
           |    V       V                     V       V
           |   ...     ...               .-> hdr     ..
           |    |       |                |    |       |
           |    V       V                |    V       V
           |   succ <- cur               |   succ <- cur
           |    |                        |    |
           |    V                        |    V
           |   ...                       |   ...
           |    |                        |    |
           '----'                        '----'
      
      For both (I) and (II) successor 'succ' of the current state 'cur' was
      previously explored and has branches count at 0. However, loop entry
      'hdr' corresponding to 'succ' might be a part of current DFS path.
      If that is the case 'succ' and 'cur' are members of the same loop
      and have to be compared exactly.
      Co-developed-by: default avatarAndrii Nakryiko <andrii.nakryiko@gmail.com>
      Co-developed-by: default avatarAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Reviewed-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-6-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2a099282
    • Eduard Zingerman's avatar
      selftests/bpf: tests with delayed read/precision makrs in loop body · 389ede06
      Eduard Zingerman authored
      These test cases try to hide read and precision marks from loop
      convergence logic: marks would only be assigned on subsequent loop
      iterations or after exploring states pushed to env->head stack first.
      Without verifier fix to use exact states comparison logic for
      iterators convergence these tests (except 'triple_continue') would be
      errorneously marked as safe.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-5-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      389ede06
    • Eduard Zingerman's avatar
      bpf: exact states comparison for iterator convergence checks · 2793a8b0
      Eduard Zingerman authored
      Convergence for open coded iterators is computed in is_state_visited()
      by examining states with branches count > 1 and using states_equal().
      states_equal() computes sub-state relation using read and precision marks.
      Read and precision marks are propagated from children states,
      thus are not guaranteed to be complete inside a loop when branches
      count > 1. This could be demonstrated using the following unsafe program:
      
           1. r7 = -16
           2. r6 = bpf_get_prandom_u32()
           3. while (bpf_iter_num_next(&fp[-8])) {
           4.   if (r6 != 42) {
           5.     r7 = -32
           6.     r6 = bpf_get_prandom_u32()
           7.     continue
           8.   }
           9.   r0 = r10
          10.   r0 += r7
          11.   r8 = *(u64 *)(r0 + 0)
          12.   r6 = bpf_get_prandom_u32()
          13. }
      
      Here verifier would first visit path 1-3, create a checkpoint at 3
      with r7=-16, continue to 4-7,3 with r7=-32.
      
      Because instructions at 9-12 had not been visitied yet existing
      checkpoint at 3 does not have read or precision mark for r7.
      Thus states_equal() would return true and verifier would discard
      current state, thus unsafe memory access at 11 would not be caught.
      
      This commit fixes this loophole by introducing exact state comparisons
      for iterator convergence logic:
      - registers are compared using regs_exact() regardless of read or
        precision marks;
      - stack slots have to have identical type.
      
      Unfortunately, this is too strict even for simple programs like below:
      
          i = 0;
          while(iter_next(&it))
            i++;
      
      At each iteration step i++ would produce a new distinct state and
      eventually instruction processing limit would be reached.
      
      To avoid such behavior speculatively forget (widen) range for
      imprecise scalar registers, if those registers were not precise at the
      end of the previous iteration and do not match exactly.
      
      This a conservative heuristic that allows to verify wide range of
      programs, however it precludes verification of programs that conjure
      an imprecise value on the first loop iteration and use it as precise
      on the second.
      
      Test case iter_task_vma_for_each() presents one of such cases:
      
              unsigned int seen = 0;
              ...
              bpf_for_each(task_vma, vma, task, 0) {
                      if (seen >= 1000)
                              break;
                      ...
                      seen++;
              }
      
      Here clang generates the following code:
      
      <LBB0_4>:
            24:       r8 = r6                          ; stash current value of
                      ... body ...                       'seen'
            29:       r1 = r10
            30:       r1 += -0x8
            31:       call bpf_iter_task_vma_next
            32:       r6 += 0x1                        ; seen++;
            33:       if r0 == 0x0 goto +0x2 <LBB0_6>  ; exit on next() == NULL
            34:       r7 += 0x10
            35:       if r8 < 0x3e7 goto -0xc <LBB0_4> ; loop on seen < 1000
      
      <LBB0_6>:
            ... exit ...
      
      Note that counter in r6 is copied to r8 and then incremented,
      conditional jump is done using r8. Because of this precision mark for
      r6 lags one state behind of precision mark on r8 and widening logic
      kicks in.
      
      Adding barrier_var(seen) after conditional is sufficient to force
      clang use the same register for both counting and conditional jump.
      
      This issue was discussed in the thread [1] which was started by
      Andrew Werner <awerner32@gmail.com> demonstrating a similar bug
      in callback functions handling. The callbacks would be addressed
      in a followup patch.
      
      [1] https://lore.kernel.org/bpf/97a90da09404c65c8e810cf83c94ac703705dc0e.camel@gmail.com/Co-developed-by: default avatarAndrii Nakryiko <andrii.nakryiko@gmail.com>
      Co-developed-by: default avatarAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-4-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      2793a8b0
    • Eduard Zingerman's avatar
      bpf: extract same_callsites() as utility function · 4c97259a
      Eduard Zingerman authored
      Extract same_callsites() from clean_live_states() as a utility function.
      This function would be used by the next patch in the set.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-3-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4c97259a
    • Eduard Zingerman's avatar
      bpf: move explored_state() closer to the beginning of verifier.c · 3c4e420c
      Eduard Zingerman authored
      Subsequent patches would make use of explored_state() function.
      Move it up to avoid adding unnecessary prototype.
      Signed-off-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20231024000917.12153-2-eddyz87@gmail.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3c4e420c
  2. 23 Oct, 2023 2 commits
  3. 20 Oct, 2023 18 commits
  4. 19 Oct, 2023 2 commits
  5. 18 Oct, 2023 2 commits
  6. 17 Oct, 2023 5 commits