1. 15 Dec, 2023 4 commits
    • Daniel Xu's avatar
      bpf: selftests: test_tunnel: Setup fresh topology for each subtest · 77a7a822
      Daniel Xu authored
      This helps with determinism b/c individual setup/teardown prevents
      leaking state between different subtests.
      Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Link: https://lore.kernel.org/r/0fb59fa16fb58cca7def5239df606005a3e8dd0e.1702593901.git.dxu@dxuuu.xyzSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      77a7a822
    • Daniel Xu's avatar
      bpf: xfrm: Add bpf_xdp_get_xfrm_state() kfunc · 8f0ec8c6
      Daniel Xu authored
      This commit adds an unstable kfunc helper to access internal xfrm_state
      associated with an SA. This is intended to be used for the upcoming
      IPsec pcpu work to assign special pcpu SAs to a particular CPU. In other
      words: for custom software RSS.
      
      That being said, the function that this kfunc wraps is fairly generic
      and used for a lot of xfrm tasks. I'm sure people will find uses
      elsewhere over time.
      
      This commit also adds a corresponding bpf_xdp_xfrm_state_release() kfunc
      to release the refcnt acquired by bpf_xdp_get_xfrm_state(). The verifier
      will require that all acquired xfrm_state's are released.
      Co-developed-by: default avatarAntony Antony <antony.antony@secunet.com>
      Signed-off-by: default avatarAntony Antony <antony.antony@secunet.com>
      Acked-by: default avatarSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: default avatarDaniel Xu <dxu@dxuuu.xyz>
      Link: https://lore.kernel.org/r/a29699c42f5fad456b875c98dd11c6afc3ffb707.1702593901.git.dxu@dxuuu.xyzSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      8f0ec8c6
    • Yonghong Song's avatar
      selftests/bpf: Remove flaky test_btf_id test · 56925f38
      Yonghong Song authored
      With previous patch, one of subtests in test_btf_id becomes
      flaky and may fail. The following is a failing example:
      
        Error: #26 btf
        Error: #26/174 btf/BTF ID
          Error: #26/174 btf/BTF ID
          btf_raw_create:PASS:check 0 nsec
          btf_raw_create:PASS:check 0 nsec
          test_btf_id:PASS:check 0 nsec
          ...
          test_btf_id:PASS:check 0 nsec
          test_btf_id:FAIL:check BTF lingersdo_test_get_info:FAIL:check failed: -1
      
      The test tries to prove a btf_id not available after the map is closed.
      But btf_id is freed only after workqueue and a rcu grace period, compared
      to previous case just after a rcu grade period.
      Depending on system workload, workqueue could take quite some time
      to execute function bpf_map_free_deferred() which may cause the test failure.
      Instead of adding arbitrary delays, let us remove the logic to
      check btf_id availability after map is closed.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20231214203820.1469402-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      56925f38
    • Yonghong Song's avatar
      bpf: Fix a race condition between btf_put() and map_free() · 59e5791f
      Yonghong Song authored
      When running `./test_progs -j` in my local vm with latest kernel,
      I once hit a kasan error like below:
      
        [ 1887.184724] BUG: KASAN: slab-use-after-free in bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.185599] Read of size 4 at addr ffff888106806910 by task kworker/u12:2/2830
        [ 1887.186498]
        [ 1887.186712] CPU: 3 PID: 2830 Comm: kworker/u12:2 Tainted: G           OEL     6.7.0-rc3-00699-g90679706-dirty #494
        [ 1887.188034] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
        [ 1887.189618] Workqueue: events_unbound bpf_map_free_deferred
        [ 1887.190341] Call Trace:
        [ 1887.190666]  <TASK>
        [ 1887.190949]  dump_stack_lvl+0xac/0xe0
        [ 1887.191423]  ? nf_tcp_handle_invalid+0x1b0/0x1b0
        [ 1887.192019]  ? panic+0x3c0/0x3c0
        [ 1887.192449]  print_report+0x14f/0x720
        [ 1887.192930]  ? preempt_count_sub+0x1c/0xd0
        [ 1887.193459]  ? __virt_addr_valid+0xac/0x120
        [ 1887.194004]  ? bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.194572]  kasan_report+0xc3/0x100
        [ 1887.195085]  ? bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.195668]  bpf_rb_root_free+0x1f8/0x2b0
        [ 1887.196183]  ? __bpf_obj_drop_impl+0xb0/0xb0
        [ 1887.196736]  ? preempt_count_sub+0x1c/0xd0
        [ 1887.197270]  ? preempt_count_sub+0x1c/0xd0
        [ 1887.197802]  ? _raw_spin_unlock+0x1f/0x40
        [ 1887.198319]  bpf_obj_free_fields+0x1d4/0x260
        [ 1887.198883]  array_map_free+0x1a3/0x260
        [ 1887.199380]  bpf_map_free_deferred+0x7b/0xe0
        [ 1887.199943]  process_scheduled_works+0x3a2/0x6c0
        [ 1887.200549]  worker_thread+0x633/0x890
        [ 1887.201047]  ? __kthread_parkme+0xd7/0xf0
        [ 1887.201574]  ? kthread+0x102/0x1d0
        [ 1887.202020]  kthread+0x1ab/0x1d0
        [ 1887.202447]  ? pr_cont_work+0x270/0x270
        [ 1887.202954]  ? kthread_blkcg+0x50/0x50
        [ 1887.203444]  ret_from_fork+0x34/0x50
        [ 1887.203914]  ? kthread_blkcg+0x50/0x50
        [ 1887.204397]  ret_from_fork_asm+0x11/0x20
        [ 1887.204913]  </TASK>
        [ 1887.204913]  </TASK>
        [ 1887.205209]
        [ 1887.205416] Allocated by task 2197:
        [ 1887.205881]  kasan_set_track+0x3f/0x60
        [ 1887.206366]  __kasan_kmalloc+0x6e/0x80
        [ 1887.206856]  __kmalloc+0xac/0x1a0
        [ 1887.207293]  btf_parse_fields+0xa15/0x1480
        [ 1887.207836]  btf_parse_struct_metas+0x566/0x670
        [ 1887.208387]  btf_new_fd+0x294/0x4d0
        [ 1887.208851]  __sys_bpf+0x4ba/0x600
        [ 1887.209292]  __x64_sys_bpf+0x41/0x50
        [ 1887.209762]  do_syscall_64+0x4c/0xf0
        [ 1887.210222]  entry_SYSCALL_64_after_hwframe+0x63/0x6b
        [ 1887.210868]
        [ 1887.211074] Freed by task 36:
        [ 1887.211460]  kasan_set_track+0x3f/0x60
        [ 1887.211951]  kasan_save_free_info+0x28/0x40
        [ 1887.212485]  ____kasan_slab_free+0x101/0x180
        [ 1887.213027]  __kmem_cache_free+0xe4/0x210
        [ 1887.213514]  btf_free+0x5b/0x130
        [ 1887.213918]  rcu_core+0x638/0xcc0
        [ 1887.214347]  __do_softirq+0x114/0x37e
      
      The error happens at bpf_rb_root_free+0x1f8/0x2b0:
      
        00000000000034c0 <bpf_rb_root_free>:
        ; {
          34c0: f3 0f 1e fa                   endbr64
          34c4: e8 00 00 00 00                callq   0x34c9 <bpf_rb_root_free+0x9>
          34c9: 55                            pushq   %rbp
          34ca: 48 89 e5                      movq    %rsp, %rbp
        ...
        ;       if (rec && rec->refcount_off >= 0 &&
          36aa: 4d 85 ed                      testq   %r13, %r13
          36ad: 74 a9                         je      0x3658 <bpf_rb_root_free+0x198>
          36af: 49 8d 7d 10                   leaq    0x10(%r13), %rdi
          36b3: e8 00 00 00 00                callq   0x36b8 <bpf_rb_root_free+0x1f8>
                                              <==== kasan function
          36b8: 45 8b 7d 10                   movl    0x10(%r13), %r15d
                                              <==== use-after-free load
          36bc: 45 85 ff                      testl   %r15d, %r15d
          36bf: 78 8c                         js      0x364d <bpf_rb_root_free+0x18d>
      
      So the problem is at rec->refcount_off in the above.
      
      I did some source code analysis and find the reason.
                                        CPU A                        CPU B
        bpf_map_put:
          ...
          btf_put with rcu callback
          ...
          bpf_map_free_deferred
            with system_unbound_wq
          ...                          ...                           ...
          ...                          btf_free_rcu:                 ...
          ...                          ...                           bpf_map_free_deferred:
          ...                          ...
          ...         --------->       btf_struct_metas_free()
          ...         | race condition ...
          ...         --------->                                     map->ops->map_free()
          ...
          ...                          btf->struct_meta_tab = NULL
      
      In the above, map_free() corresponds to array_map_free() and eventually
      calling bpf_rb_root_free() which calls:
        ...
        __bpf_obj_drop_impl(obj, field->graph_root.value_rec, false);
        ...
      
      Here, 'value_rec' is assigned in btf_check_and_fixup_fields() with following code:
      
        meta = btf_find_struct_meta(btf, btf_id);
        if (!meta)
          return -EFAULT;
        rec->fields[i].graph_root.value_rec = meta->record;
      
      So basically, 'value_rec' is a pointer to the record in struct_metas_tab.
      And it is possible that that particular record has been freed by
      btf_struct_metas_free() and hence we have a kasan error here.
      
      Actually it is very hard to reproduce the failure with current bpf/bpf-next
      code, I only got the above error once. To increase reproducibility, I added
      a delay in bpf_map_free_deferred() to delay map->ops->map_free(), which
      significantly increased reproducibility.
      
        diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
        index 5e43ddd1b83f..aae5b5213e93 100644
        --- a/kernel/bpf/syscall.c
        +++ b/kernel/bpf/syscall.c
        @@ -695,6 +695,7 @@ static void bpf_map_free_deferred(struct work_struct *work)
              struct bpf_map *map = container_of(work, struct bpf_map, work);
              struct btf_record *rec = map->record;
      
        +     mdelay(100);
              security_bpf_map_free(map);
              bpf_map_release_memcg(map);
              /* implementation dependent freeing */
      
      Hao also provided test cases ([1]) for easily reproducing the above issue.
      
      There are two ways to fix the issue, the v1 of the patch ([2]) moving
      btf_put() after map_free callback, and the v5 of the patch ([3]) using
      a kptr style fix which tries to get a btf reference during
      map_check_btf(). Each approach has its pro and cons. The first approach
      delays freeing btf while the second approach needs to acquire reference
      depending on context which makes logic not very elegant and may
      complicate things with future new data structures. Alexei
      suggested in [4] going back to v1 which is what this patch
      tries to do.
      
      Rerun './test_progs -j' with the above mdelay() hack for a couple
      of times and didn't observe the error for the above rb_root test cases.
      Running Hou's test ([1]) is also successful.
      
        [1] https://lore.kernel.org/bpf/20231207141500.917136-1-houtao@huaweicloud.com/
        [2] v1: https://lore.kernel.org/bpf/20231204173946.3066377-1-yonghong.song@linux.dev/
        [3] v5: https://lore.kernel.org/bpf/20231208041621.2968241-1-yonghong.song@linux.dev/
        [4] v4: https://lore.kernel.org/bpf/CAADnVQJ3FiXUhZJwX_81sjZvSYYKCFB3BT6P8D59RS2Gu+0Z7g@mail.gmail.com/
      
      Cc: Hou Tao <houtao@huaweicloud.com>
      Fixes: 958cf2e2 ("bpf: Introduce bpf_obj_new")
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20231214203815.1469107-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      59e5791f
  2. 14 Dec, 2023 26 commits
  3. 13 Dec, 2023 10 commits