1. 05 Dec, 2023 9 commits
    • Dave Marchevsky's avatar
      selftests/bpf: Test bpf_kptr_xchg stashing of bpf_rb_root · 1b4c7e20
      Dave Marchevsky authored
      There was some confusion amongst Meta sched_ext folks regarding whether
      stashing bpf_rb_root - the tree itself, rather than a single node - was
      supported. This patch adds a small test which demonstrates this
      functionality: a local kptr with rb_root is created, a node is created
      and added to the tree, then the tree is kptr_xchg'd into a mapval.
      Signed-off-by: default avatarDave Marchevsky <davemarchevsky@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/bpf/20231204211722.571346-1-davemarchevsky@fb.com
      1b4c7e20
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-fix-the-release-of-inner-map' · ce3c49da
      Alexei Starovoitov authored
      Hou Tao says:
      
      ====================
      bpf: Fix the release of inner map
      
      From: Hou Tao <houtao1@huawei.com>
      
      Hi,
      
      The patchset aims to fix the release of inner map in map array or map
      htab. The release of inner map is different with normal map. For normal
      map, the map is released after the bpf program which uses the map is
      destroyed, because the bpf program tracks the used maps. However bpf
      program can not track the used inner map because these inner map may be
      updated or deleted dynamically, and for now the ref-counter of inner map
      is decreased after the inner map is remove from outer map, so the inner
      map may be freed before the bpf program, which is accessing the inner
      map, exits and there will be use-after-free problem as demonstrated by
      patch #6.
      
      The patchset fixes the problem by deferring the release of inner map.
      The freeing of inner map is deferred according to the sleepable
      attributes of the bpf programs which own the outer map. Patch #1 fixes
      the warning when running the newly-added selftest under interpreter
      mode. Patch #2 adds more parameters to .map_fd_put_ptr() to prepare for
      the fix. Patch #3 fixes the incorrect value of need_defer when freeing
      the fd array. Patch #4 fixes the potential use-after-free problem by
      using call_rcu_tasks_trace() and call_rcu() to wait for one tasks trace
      RCU GP and one RCU GP unconditionally. Patch #5 optimizes the free of
      inner map by removing the unnecessary RCU GP waiting. Patch #6 adds a
      selftest to demonstrate the potential use-after-free problem. Patch #7
      updates a selftest to update outer map in syscall bpf program.
      
      Please see individual patches for more details. And comments are always
      welcome.
      
      Change Log:
      v5:
       * patch #3: rename fd_array_map_delete_elem_with_deferred_free() to
                   __fd_array_map_delete_elem() (Alexei)
       * patch #5: use atomic64_t instead of atomic_t to prevent potential
                   overflow (Alexei)
       * patch #7: use ptr_to_u64() helper instead of force casting to initialize
                   pointers in bpf_attr (Alexei)
      
      v4: https://lore.kernel.org/bpf/20231130140120.1736235-1-houtao@huaweicloud.com
        * patch #2: don't use "deferred", use "need_defer" uniformly
        * patch #3: newly-added, fix the incorrect value of need_defer during
                    fd array free.
        * patch #4: doesn't consider the case in which bpf map is not used by
                    any bpf program and only use sleepable_refcnt to remove
      	      unnecessary tasks trace RCU GP (Alexei)
        * patch #4: remove memory barriers added due to cautiousness (Alexei)
      
      v3: https://lore.kernel.org/bpf/20231124113033.503338-1-houtao@huaweicloud.com
        * multiple variable renamings (Martin)
        * define BPF_MAP_RCU_GP/BPF_MAP_RCU_TT_GP as bit (Martin)
        * use call_rcu() and its variants instead of synchronize_rcu() (Martin)
        * remove unnecessary mask in bpf_map_free_deferred() (Martin)
        * place atomic_or() and the related smp_mb() together (Martin)
        * add patch #6 to demonstrate that updating outer map in syscall
          program is dead-lock free (Alexei)
        * update comments about the memory barrier in bpf_map_fd_put_ptr()
        * update commit message for patch #3 and #4 to describe more details
      
      v2: https://lore.kernel.org/bpf/20231113123324.3914612-1-houtao@huaweicloud.com
        * defer the invocation of ops->map_free() instead of bpf_map_put() (Martin)
        * update selftest to make it being reproducible under JIT mode (Martin)
        * remove unnecessary preparatory patches
      
      v1: https://lore.kernel.org/bpf/20231107140702.1891778-1-houtao@huaweicloud.com
      ====================
      
      Link: https://lore.kernel.org/r/20231204140425.1480317-1-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ce3c49da
    • Hou Tao's avatar
      selftests/bpf: Test outer map update operations in syscall program · e3dd4082
      Hou Tao authored
      Syscall program is running with rcu_read_lock_trace being held, so if
      bpf_map_update_elem() or bpf_map_delete_elem() invokes
      synchronize_rcu_tasks_trace() when operating on an outer map, there will
      be dead-lock, so add a test to guarantee that it is dead-lock free.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-8-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e3dd4082
    • Hou Tao's avatar
      selftests/bpf: Add test cases for inner map · 1624918b
      Hou Tao authored
      Add test cases to test the race between the destroy of inner map due to
      map-in-map update and the access of inner map in bpf program. The
      following 4 combinations are added:
      (1) array map in map array + bpf program
      (2) array map in map array + sleepable bpf program
      (3) array map in map htab + bpf program
      (4) array map in map htab + sleepable bpf program
      
      Before applying the fixes, when running `./test_prog -a map_in_map`, the
      following error was reported:
      
        ==================================================================
        BUG: KASAN: slab-use-after-free in array_map_update_elem+0x48/0x3e0
        Read of size 4 at addr ffff888114f33824 by task test_progs/1858
      
        CPU: 1 PID: 1858 Comm: test_progs Tainted: G           O     6.6.0+ #7
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
        Call Trace:
         <TASK>
         dump_stack_lvl+0x4a/0x90
         print_report+0xd2/0x620
         kasan_report+0xd1/0x110
         __asan_load4+0x81/0xa0
         array_map_update_elem+0x48/0x3e0
         bpf_prog_be94a9f26772f5b7_access_map_in_array+0xe6/0xf6
         trace_call_bpf+0x1aa/0x580
         kprobe_perf_func+0xdd/0x430
         kprobe_dispatcher+0xa0/0xb0
         kprobe_ftrace_handler+0x18b/0x2e0
         0xffffffffc02280f7
        RIP: 0010:__x64_sys_getpgid+0x1/0x30
        ......
         </TASK>
      
        Allocated by task 1857:
         kasan_save_stack+0x26/0x50
         kasan_set_track+0x25/0x40
         kasan_save_alloc_info+0x1e/0x30
         __kasan_kmalloc+0x98/0xa0
         __kmalloc_node+0x6a/0x150
         __bpf_map_area_alloc+0x141/0x170
         bpf_map_area_alloc+0x10/0x20
         array_map_alloc+0x11f/0x310
         map_create+0x28a/0xb40
         __sys_bpf+0x753/0x37c0
         __x64_sys_bpf+0x44/0x60
         do_syscall_64+0x36/0xb0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
      
        Freed by task 11:
         kasan_save_stack+0x26/0x50
         kasan_set_track+0x25/0x40
         kasan_save_free_info+0x2b/0x50
         __kasan_slab_free+0x113/0x190
         slab_free_freelist_hook+0xd7/0x1e0
         __kmem_cache_free+0x170/0x260
         kfree+0x9b/0x160
         kvfree+0x2d/0x40
         bpf_map_area_free+0xe/0x20
         array_map_free+0x120/0x2c0
         bpf_map_free_deferred+0xd7/0x1e0
         process_one_work+0x462/0x990
         worker_thread+0x370/0x670
         kthread+0x1b0/0x200
         ret_from_fork+0x3a/0x70
         ret_from_fork_asm+0x1b/0x30
      
        Last potentially related work creation:
         kasan_save_stack+0x26/0x50
         __kasan_record_aux_stack+0x94/0xb0
         kasan_record_aux_stack_noalloc+0xb/0x20
         __queue_work+0x331/0x950
         queue_work_on+0x75/0x80
         bpf_map_put+0xfa/0x160
         bpf_map_fd_put_ptr+0xe/0x20
         bpf_fd_array_map_update_elem+0x174/0x1b0
         bpf_map_update_value+0x2b7/0x4a0
         __sys_bpf+0x2551/0x37c0
         __x64_sys_bpf+0x44/0x60
         do_syscall_64+0x36/0xb0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-7-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      1624918b
    • Hou Tao's avatar
      bpf: Optimize the free of inner map · af66bfd3
      Hou Tao authored
      When removing the inner map from the outer map, the inner map will be
      freed after one RCU grace period and one RCU tasks trace grace
      period, so it is certain that the bpf program, which may access the
      inner map, has exited before the inner map is freed.
      
      However there is no need to wait for one RCU tasks trace grace period if
      the outer map is only accessed by non-sleepable program. So adding
      sleepable_refcnt in bpf_map and increasing sleepable_refcnt when adding
      the outer map into env->used_maps for sleepable program. Although the
      max number of bpf program is INT_MAX - 1, the number of bpf programs
      which are being loaded may be greater than INT_MAX, so using atomic64_t
      instead of atomic_t for sleepable_refcnt. When removing the inner map
      from the outer map, using sleepable_refcnt to decide whether or not a
      RCU tasks trace grace period is needed before freeing the inner map.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-6-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      af66bfd3
    • Hou Tao's avatar
      bpf: Defer the free of inner map when necessary · 87667336
      Hou Tao authored
      When updating or deleting an inner map in map array or map htab, the map
      may still be accessed by non-sleepable program or sleepable program.
      However bpf_map_fd_put_ptr() decreases the ref-counter of the inner map
      directly through bpf_map_put(), if the ref-counter is the last one
      (which is true for most cases), the inner map will be freed by
      ops->map_free() in a kworker. But for now, most .map_free() callbacks
      don't use synchronize_rcu() or its variants to wait for the elapse of a
      RCU grace period, so after the invocation of ops->map_free completes,
      the bpf program which is accessing the inner map may incur
      use-after-free problem.
      
      Fix the free of inner map by invoking bpf_map_free_deferred() after both
      one RCU grace period and one tasks trace RCU grace period if the inner
      map has been removed from the outer map before. The deferment is
      accomplished by using call_rcu() or call_rcu_tasks_trace() when
      releasing the last ref-counter of bpf map. The newly-added rcu_head
      field in bpf_map shares the same storage space with work field to
      reduce the size of bpf_map.
      
      Fixes: bba1dc0b ("bpf: Remove redundant synchronize_rcu.")
      Fixes: 638e4b82 ("bpf: Allows per-cpu maps and map-in-map in sleepable programs")
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-5-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      87667336
    • Hou Tao's avatar
      bpf: Set need_defer as false when clearing fd array during map free · 79d93b3c
      Hou Tao authored
      Both map deletion operation, map release and map free operation use
      fd_array_map_delete_elem() to remove the element from fd array and
      need_defer is always true in fd_array_map_delete_elem(). For the map
      deletion operation and map release operation, need_defer=true is
      necessary, because the bpf program, which accesses the element in fd
      array, may still alive. However for map free operation, it is certain
      that the bpf program which owns the fd array has already been exited, so
      setting need_defer as false is appropriate for map free operation.
      
      So fix it by adding need_defer parameter to bpf_fd_array_map_clear() and
      adding a new helper __fd_array_map_delete_elem() to handle the map
      deletion, map release and map free operations correspondingly.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-4-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      79d93b3c
    • Hou Tao's avatar
      bpf: Add map and need_defer parameters to .map_fd_put_ptr() · 20c20bd1
      Hou Tao authored
      map is the pointer of outer map, and need_defer needs some explanation.
      need_defer tells the implementation to defer the reference release of
      the passed element and ensure that the element is still alive before
      the bpf program, which may manipulate it, exits.
      
      The following three cases will invoke map_fd_put_ptr() and different
      need_defer values will be passed to these callers:
      
      1) release the reference of the old element in the map during map update
         or map deletion. The release must be deferred, otherwise the bpf
         program may incur use-after-free problem, so need_defer needs to be
         true.
      2) release the reference of the to-be-added element in the error path of
         map update. The to-be-added element is not visible to any bpf
         program, so it is OK to pass false for need_defer parameter.
      3) release the references of all elements in the map during map release.
         Any bpf program which has access to the map must have been exited and
         released, so need_defer=false will be OK.
      
      These two parameters will be used by the following patches to fix the
      potential use-after-free problem for map-in-map.
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-3-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      20c20bd1
    • Hou Tao's avatar
      bpf: Check rcu_read_lock_trace_held() before calling bpf map helpers · 169410eb
      Hou Tao authored
      These three bpf_map_{lookup,update,delete}_elem() helpers are also
      available for sleepable bpf program, so add the corresponding lock
      assertion for sleepable bpf program, otherwise the following warning
      will be reported when a sleepable bpf program manipulates bpf map under
      interpreter mode (aka bpf_jit_enable=0):
      
        WARNING: CPU: 3 PID: 4985 at kernel/bpf/helpers.c:40 ......
        CPU: 3 PID: 4985 Comm: test_progs Not tainted 6.6.0+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ......
        RIP: 0010:bpf_map_lookup_elem+0x54/0x60
        ......
        Call Trace:
         <TASK>
         ? __warn+0xa5/0x240
         ? bpf_map_lookup_elem+0x54/0x60
         ? report_bug+0x1ba/0x1f0
         ? handle_bug+0x40/0x80
         ? exc_invalid_op+0x18/0x50
         ? asm_exc_invalid_op+0x1b/0x20
         ? __pfx_bpf_map_lookup_elem+0x10/0x10
         ? rcu_lockdep_current_cpu_online+0x65/0xb0
         ? rcu_is_watching+0x23/0x50
         ? bpf_map_lookup_elem+0x54/0x60
         ? __pfx_bpf_map_lookup_elem+0x10/0x10
         ___bpf_prog_run+0x513/0x3b70
         __bpf_prog_run32+0x9d/0xd0
         ? __bpf_prog_enter_sleepable_recur+0xad/0x120
         ? __bpf_prog_enter_sleepable_recur+0x3e/0x120
         bpf_trampoline_6442580665+0x4d/0x1000
         __x64_sys_getpgid+0x5/0x30
         ? do_syscall_64+0x36/0xb0
         entry_SYSCALL_64_after_hwframe+0x6e/0x76
         </TASK>
      Signed-off-by: default avatarHou Tao <houtao1@huawei.com>
      Link: https://lore.kernel.org/r/20231204140425.1480317-2-houtao@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      169410eb
  2. 04 Dec, 2023 2 commits
  3. 02 Dec, 2023 19 commits
  4. 01 Dec, 2023 10 commits