1. 05 Sep, 2024 1 commit
  2. 04 Sep, 2024 9 commits
    • Yonghong Song's avatar
      selftests/bpf: Add a selftest for x86 jit convergence issues · eff5b5ff
      Yonghong Song authored
      The core part of the selftest, i.e., the je <-> jmp cycle, mimics the
      original sched-ext bpf program. The test will fail without the
      previous patch.
      
      I tried to create some cases for other potential cycles
      (je <-> je, jmp <-> je and jmp <-> jmp) with similar pattern
      to the test in this patch, but failed. So this patch
      only contains one test for je <-> jmp cycle.
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240904221256.37389-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      eff5b5ff
    • Yonghong Song's avatar
      bpf, x64: Fix a jit convergence issue · c8831bdb
      Yonghong Song authored
      Daniel Hodges reported a jit error when playing with a sched-ext program.
      The error message is:
        unexpected jmp_cond padding: -4 bytes
      
      But further investigation shows the error is actual due to failed
      convergence. The following are some analysis:
      
        ...
        pass4, final_proglen=4391:
          ...
          20e:    48 85 ff                test   rdi,rdi
          211:    74 7d                   je     0x290
          213:    48 8b 77 00             mov    rsi,QWORD PTR [rdi+0x0]
          ...
          289:    48 85 ff                test   rdi,rdi
          28c:    74 17                   je     0x2a5
          28e:    e9 7f ff ff ff          jmp    0x212
          293:    bf 03 00 00 00          mov    edi,0x3
      
      Note that insn at 0x211 is 2-byte cond jump insn for offset 0x7d (-125)
      and insn at 0x28e is 5-byte jmp insn with offset -129.
      
        pass5, final_proglen=4392:
          ...
          20e:    48 85 ff                test   rdi,rdi
          211:    0f 84 80 00 00 00       je     0x297
          217:    48 8b 77 00             mov    rsi,QWORD PTR [rdi+0x0]
          ...
          28d:    48 85 ff                test   rdi,rdi
          290:    74 1a                   je     0x2ac
          292:    eb 84                   jmp    0x218
          294:    bf 03 00 00 00          mov    edi,0x3
      
      Note that insn at 0x211 is 6-byte cond jump insn now since its offset
      becomes 0x80 based on previous round (0x293 - 0x213 = 0x80). At the same
      time, insn at 0x292 is a 2-byte insn since its offset is -124.
      
      pass6 will repeat the same code as in pass4. pass7 will repeat the same
      code as in pass5, and so on. This will prevent eventual convergence.
      
      Passes 1-14 are with padding = 0. At pass15, padding is 1 and related
      insn looks like:
      
          211:    0f 84 80 00 00 00       je     0x297
          217:    48 8b 77 00             mov    rsi,QWORD PTR [rdi+0x0]
          ...
          24d:    48 85 d2                test   rdx,rdx
      
      The similar code in pass14:
          211:    74 7d                   je     0x290
          213:    48 8b 77 00             mov    rsi,QWORD PTR [rdi+0x0]
          ...
          249:    48 85 d2                test   rdx,rdx
          24c:    74 21                   je     0x26f
          24e:    48 01 f7                add    rdi,rsi
          ...
      
      Before generating the following insn,
        250:    74 21                   je     0x273
      "padding = 1" enables some checking to ensure nops is either 0 or 4
      where
        #define INSN_SZ_DIFF (((addrs[i] - addrs[i - 1]) - (prog - temp)))
        nops = INSN_SZ_DIFF - 2
      
      In this specific case,
        addrs[i] = 0x24e // from pass14
        addrs[i-1] = 0x24d // from pass15
        prog - temp = 3 // from 'test rdx,rdx' in pass15
      so
        nops = -4
      and this triggers the failure.
      
      To fix the issue, we need to break cycles of je <-> jmp. For example,
      in the above case, we have
        211:    74 7d                   je     0x290
      the offset is 0x7d. If 2-byte je insn is generated only if
      the offset is less than 0x7d (<= 0x7c), the cycle can be
      break and we can achieve the convergence.
      
      I did some study on other cases like je <-> je, jmp <-> je and
      jmp <-> jmp which may cause cycles. Those cases are not from actual
      reproducible cases since it is pretty hard to construct a test case
      for them. the results show that the offset <= 0x7b (0x7b = 123) should
      be enough to cover all cases. This patch added a new helper to generate 8-bit
      cond/uncond jmp insns only if the offset range is [-128, 123].
      Reported-by: default avatarDaniel Hodges <hodgesd@meta.com>
      Signed-off-by: default avatarYonghong Song <yonghong.song@linux.dev>
      Link: https://lore.kernel.org/r/20240904221251.37109-1-yonghong.song@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c8831bdb
    • Feng Yang's avatar
      selftests: bpf: Replace sizeof(arr)/sizeof(arr[0]) with ARRAY_SIZE · 23457b37
      Feng Yang authored
      The ARRAY_SIZE macro is more compact and more formal in linux source.
      Signed-off-by: default avatarFeng Yang <yangfeng@kylinos.cn>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/bpf/20240903072559.292607-1-yangfeng59949@163.com
      23457b37
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-follow-up-on-gen_epilogue' · 6fee7a7e
      Alexei Starovoitov authored
      Martin KaFai Lau says:
      
      ====================
      bpf: Follow up on gen_epilogue
      
      From: Martin KaFai Lau <martin.lau@kernel.org>
      
      The set addresses some follow ups on the earlier gen_epilogue
      patch set.
      ====================
      Acked-by: default avatarEduard Zingerman <eddyz87@gmail.com>
      Link: https://lore.kernel.org/r/20240904180847.56947-1-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6fee7a7e
    • Martin KaFai Lau's avatar
      bpf: Fix indentation issue in epilogue_idx · 00750788
      Martin KaFai Lau authored
      There is a report on new indentation issue in epilogue_idx.
      This patch fixed it.
      
      Fixes: 169c3176 ("bpf: Add gen_epilogue to bpf_verifier_ops")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Closes: https://lore.kernel.org/oe-kbuild-all/202408311622.4GzlzN33-lkp@intel.com/Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240904180847.56947-3-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      00750788
    • Martin KaFai Lau's avatar
      bpf: Remove the insn_buf array stack usage from the inline_bpf_loop() · 940ce73b
      Martin KaFai Lau authored
      This patch removes the insn_buf array stack usage from the
      inline_bpf_loop(). Instead, the env->insn_buf is used. The
      usage in inline_bpf_loop() needs more than 16 insn, so the
      INSN_BUF_SIZE needs to be increased from 16 to 32.
      The compiler stack size warning on the verifier is gone
      after this change.
      
      Cc: Eduard Zingerman <eddyz87@gmail.com>
      Signed-off-by: default avatarMartin KaFai Lau <martin.lau@kernel.org>
      Link: https://lore.kernel.org/r/20240904180847.56947-2-martin.lau@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      940ce73b
    • Rong Tao's avatar
      samples/bpf: Remove sample tracex2 · 46f4ea04
      Rong Tao authored
      In commit ba8de796 ("net: introduce sk_skb_reason_drop function")
      kfree_skb_reason() becomes an inline function and cannot be traced.
      
      samples/bpf is abandonware by now, and we should slowly but surely
      convert whatever makes sense into BPF selftests under
      tools/testing/selftests/bpf and just get rid of the rest.
      
      Link: https://github.com/torvalds/linux/commit/ba8de796baf4bdc03530774fb284fe3c97875566Signed-off-by: default avatarRong Tao <rongtao@cestc.cn>
      Link: https://lore.kernel.org/r/tencent_30ADAC88CB2915CA57E9512D4460035BA107@qq.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      46f4ea04
    • Yuan Chen's avatar
      selftests/bpf: Fix procmap_query()'s params mismatch and compilation warning · 02baa0a2
      Yuan Chen authored
      When the PROCMAP_QUERY is not defined, a compilation error occurs due to the
      mismatch of the procmap_query()'s params, procmap_query() only be called in
      the file where the function is defined, modify the params so they can match.
      
      We get a warning when build samples/bpf:
          trace_helpers.c:252:5: warning: no previous prototype for ‘procmap_query’ [-Wmissing-prototypes]
            252 | int procmap_query(int fd, const void *addr, __u32 query_flags, size_t *start, size_t *offset, int *flags)
                |     ^~~~~~~~~~~~~
      As this function is only used in the file, mark it as 'static'.
      
      Fixes: 4e9e0760 ("selftests/bpf: make use of PROCMAP_QUERY ioctl if available")
      Signed-off-by: default avatarYuan Chen <chenyuan@kylinos.cn>
      Link: https://lore.kernel.org/r/20240903012839.3178-1-chenyuan_fl@163.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      02baa0a2
    • Xu Kuohai's avatar
      bpf, arm64: Jit BPF_CALL to direct call when possible · ddbe9ec5
      Xu Kuohai authored
      Currently, BPF_CALL is always jited to indirect call. When target is
      within the range of direct call, BPF_CALL can be jited to direct call.
      
      For example, the following BPF_CALL
      
          call __htab_map_lookup_elem
      
      is always jited to indirect call:
      
          mov     x10, #0xffffffffffff18f4
          movk    x10, #0x821, lsl #16
          movk    x10, #0x8000, lsl #32
          blr     x10
      
      When the address of target __htab_map_lookup_elem is within the range of
      direct call, the BPF_CALL can be jited to:
      
          bl      0xfffffffffd33bc98
      
      This patch does such jit optimization by emitting arm64 direct calls for
      BPF_CALL when possible, indirect calls otherwise.
      
      Without this patch, the jit works as follows.
      
      1. First pass
         A. Determine jited position and size for each bpf instruction.
         B. Computed the jited image size.
      
      2. Allocate jited image with size computed in step 1.
      
      3. Second pass
         A. Adjust jump offset for jump instructions
         B. Write the final image.
      
      This works because, for a given bpf prog, regardless of where the jited
      image is allocated, the jited result for each instruction is fixed. The
      second pass differs from the first only in adjusting the jump offsets,
      like changing "jmp imm1" to "jmp imm2", while the position and size of
      the "jmp" instruction remain unchanged.
      
      Now considering whether to jit BPF_CALL to arm64 direct or indirect call
      instruction. The choice depends solely on the jump offset: direct call
      if the jump offset is within 128MB, indirect call otherwise.
      
      For a given BPF_CALL, the target address is known, so the jump offset is
      decided by the jited address of the BPF_CALL instruction. In other words,
      for a given bpf prog, the jited result for each BPF_CALL is determined
      by its jited address.
      
      The jited address for a BPF_CALL is the jited image address plus the
      total jited size of all preceding instructions. For a given bpf prog,
      there are clearly no BPF_CALL instructions before the first BPF_CALL
      instruction. Since the jited result for all other instructions other
      than BPF_CALL are fixed, the total jited size preceding the first
      BPF_CALL is also fixed. Therefore, once the jited image is allocated,
      the jited address for the first BPF_CALL is fixed.
      
      Now that the jited result for the first BPF_CALL is fixed, the jited
      results for all instructions preceding the second BPF_CALL are fixed.
      So the jited address and result for the second BPF_CALL are also fixed.
      
      Similarly, we can conclude that the jited addresses and results for all
      subsequent BPF_CALL instructions are fixed.
      
      This means that, for a given bpf prog, once the jited image is allocated,
      the jited address and result for all instructions, including all BPF_CALL
      instructions, are fixed.
      
      Based on the observation, with this patch, the jit works as follows.
      
      1. First pass
         Estimate the maximum jited image size. In this pass, all BPF_CALLs
         are jited to arm64 indirect calls since the jump offsets are unknown
         because the jited image is not allocated.
      
      2. Allocate jited image with size estimated in step 1.
      
      3. Second pass
         A. Determine the jited result for each BPF_CALL.
         B. Determine jited address and size for each bpf instruction.
      
      4. Third pass
         A. Adjust jump offset for jump instructions.
         B. Write the final image.
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Reviewed-by: default avatarPuranjay Mohan <puranjay@kernel.org>
      Link: https://lore.kernel.org/r/20240903094407.601107-1-xukuohai@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ddbe9ec5
  3. 02 Sep, 2024 2 commits
  4. 30 Aug, 2024 18 commits
  5. 29 Aug, 2024 6 commits
  6. 28 Aug, 2024 4 commits
    • Hao Ge's avatar
      selftests/bpf: Fix incorrect parameters in NULL pointer checking · c264487e
      Hao Ge authored
      Smatch reported the following warning:
          ./tools/testing/selftests/bpf/testing_helpers.c:455 get_xlated_program()
          warn: variable dereferenced before check 'buf' (see line 454)
      
      It seems correct,so let's modify it based on it's suggestion.
      
      Actually,commit b23ed4d7 ("selftests/bpf: Fix invalid pointer
      check in get_xlated_program()") fixed an issue in the test_verifier.c
      once,but it was reverted this time.
      
      Let's solve this issue with the minimal changes possible.
      Reported-by: default avatarDan Carpenter <dan.carpenter@linaro.org>
      Closes: https://lore.kernel.org/all/1eb3732f-605a-479d-ba64-cd14250cbf91@stanley.mountain/
      Fixes: b4b7a409 ("selftests/bpf: Factor out get_xlated_program() helper")
      Signed-off-by: default avatarHao Ge <gehao@kylinos.cn>
      Link: https://lore.kernel.org/r/20240820023622.29190-1-hao.ge@linux.devSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      c264487e
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-arm64-simplify-jited-prologue-epilogue' · 4961d8f4
      Alexei Starovoitov authored
      Xu Kuohai says:
      
      ====================
      bpf, arm64: Simplify jited prologue/epilogue
      
      From: Xu Kuohai <xukuohai@huawei.com>
      
      The arm64 jit blindly saves/restores all callee-saved registers, making
      the jited result looks a bit too compliated. For example, for an empty
      prog, the jited result is:
      
         0:   bti jc
         4:   mov     x9, lr
         8:   nop
         c:   paciasp
        10:   stp     fp, lr, [sp, #-16]!
        14:   mov     fp, sp
        18:   stp     x19, x20, [sp, #-16]!
        1c:   stp     x21, x22, [sp, #-16]!
        20:   stp     x26, x25, [sp, #-16]!
        24:   mov     x26, #0
        28:   stp     x26, x25, [sp, #-16]!
        2c:   mov     x26, sp
        30:   stp     x27, x28, [sp, #-16]!
        34:   mov     x25, sp
        38:   bti j 		// tailcall target
        3c:   sub     sp, sp, #0
        40:   mov     x7, #0
        44:   add     sp, sp, #0
        48:   ldp     x27, x28, [sp], #16
        4c:   ldp     x26, x25, [sp], #16
        50:   ldp     x26, x25, [sp], #16
        54:   ldp     x21, x22, [sp], #16
        58:   ldp     x19, x20, [sp], #16
        5c:   ldp     fp, lr, [sp], #16
        60:   mov     x0, x7
        64:   autiasp
        68:   ret
      
      Clearly, there is no need to save/restore unused callee-saved registers.
      This patch does this change, making the jited image to only save/restore
      the callee-saved registers it uses.
      
      Now the jited result of empty prog is:
      
         0:   bti jc
         4:   mov     x9, lr
         8:   nop
         c:   paciasp
        10:   stp     fp, lr, [sp, #-16]!
        14:   mov     fp, sp
        18:   stp     xzr, x26, [sp, #-16]!
        1c:   mov     x26, sp
        20:   bti j		// tailcall target
        24:   mov     x7, #0
        28:   ldp     xzr, x26, [sp], #16
        2c:   ldp     fp, lr, [sp], #16
        30:   mov     x0, x7
        34:   autiasp
        38:   ret
      ====================
      Acked-by: default avatarPuranjay Mohan <puranjay@kernel.org>
      Link: https://lore.kernel.org/r/20240826071624.350108-1-xukuohai@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4961d8f4
    • Xu Kuohai's avatar
      bpf, arm64: Avoid blindly saving/restoring all callee-saved registers · 5d4fa9ec
      Xu Kuohai authored
      The arm64 jit blindly saves/restores all callee-saved registers, making
      the jited result looks a bit too compliated. For example, for an empty
      prog, the jited result is:
      
         0:   bti jc
         4:   mov     x9, lr
         8:   nop
         c:   paciasp
        10:   stp     fp, lr, [sp, #-16]!
        14:   mov     fp, sp
        18:   stp     x19, x20, [sp, #-16]!
        1c:   stp     x21, x22, [sp, #-16]!
        20:   stp     x26, x25, [sp, #-16]!
        24:   mov     x26, #0
        28:   stp     x26, x25, [sp, #-16]!
        2c:   mov     x26, sp
        30:   stp     x27, x28, [sp, #-16]!
        34:   mov     x25, sp
        38:   bti j 		// tailcall target
        3c:   sub     sp, sp, #0
        40:   mov     x7, #0
        44:   add     sp, sp, #0
        48:   ldp     x27, x28, [sp], #16
        4c:   ldp     x26, x25, [sp], #16
        50:   ldp     x26, x25, [sp], #16
        54:   ldp     x21, x22, [sp], #16
        58:   ldp     x19, x20, [sp], #16
        5c:   ldp     fp, lr, [sp], #16
        60:   mov     x0, x7
        64:   autiasp
        68:   ret
      
      Clearly, there is no need to save/restore unused callee-saved registers.
      This patch does this change, making the jited image to only save/restore
      the callee-saved registers it uses.
      
      Now the jited result of empty prog is:
      
         0:   bti jc
         4:   mov     x9, lr
         8:   nop
         c:   paciasp
        10:   stp     fp, lr, [sp, #-16]!
        14:   mov     fp, sp
        18:   stp     xzr, x26, [sp, #-16]!
        1c:   mov     x26, sp
        20:   bti j		// tailcall target
        24:   mov     x7, #0
        28:   ldp     xzr, x26, [sp], #16
        2c:   ldp     fp, lr, [sp], #16
        30:   mov     x0, x7
        34:   autiasp
        38:   ret
      
      Since bpf prog saves/restores its own callee-saved registers as needed,
      to make tailcall work correctly, the caller needs to restore its saved
      registers before tailcall, and the callee needs to save its callee-saved
      registers after tailcall. This extra restoring/saving instructions
      increases preformance overhead.
      
      [1] provides 2 benchmarks for tailcall scenarios. Below is the perf
      number measured in an arm64 KVM guest. The result indicates that the
      performance difference before and after the patch in typical tailcall
      scenarios is negligible.
      
      - Before:
      
       Performance counter stats for './test_progs -t tailcalls' (5 runs):
      
                 4313.43 msec task-clock                       #    0.874 CPUs utilized               ( +-  0.16% )
                     574      context-switches                 #  133.073 /sec                        ( +-  1.14% )
                       0      cpu-migrations                   #    0.000 /sec
                     538      page-faults                      #  124.727 /sec                        ( +-  0.57% )
             10697772784      cycles                           #    2.480 GHz                         ( +-  0.22% )  (61.19%)
             25511241955      instructions                     #    2.38  insn per cycle              ( +-  0.08% )  (66.70%)
              5108910557      branches                         #    1.184 G/sec                       ( +-  0.08% )  (72.38%)
                 2800459      branch-misses                    #    0.05% of all branches             ( +-  0.51% )  (72.36%)
                              TopDownL1                 #     0.60 retiring                    ( +-  0.09% )  (66.84%)
                                                        #     0.21 frontend_bound              ( +-  0.15% )  (61.31%)
                                                        #     0.12 bad_speculation             ( +-  0.08% )  (50.11%)
                                                        #     0.07 backend_bound               ( +-  0.16% )  (33.30%)
              8274201819      L1-dcache-loads                  #    1.918 G/sec                       ( +-  0.18% )  (33.15%)
                  468268      L1-dcache-load-misses            #    0.01% of all L1-dcache accesses   ( +-  4.69% )  (33.16%)
                  385383      LLC-loads                        #   89.345 K/sec                       ( +-  5.22% )  (33.16%)
                   38296      LLC-load-misses                  #    9.94% of all LL-cache accesses    ( +- 42.52% )  (38.69%)
              6886576501      L1-icache-loads                  #    1.597 G/sec                       ( +-  0.35% )  (38.69%)
                 1848585      L1-icache-load-misses            #    0.03% of all L1-icache accesses   ( +-  4.52% )  (44.23%)
              9043645883      dTLB-loads                       #    2.097 G/sec                       ( +-  0.10% )  (44.33%)
                  416672      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  5.15% )  (49.89%)
              6925626111      iTLB-loads                       #    1.606 G/sec                       ( +-  0.35% )  (55.46%)
                   66220      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  1.88% )  (55.50%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                  4.9372 +- 0.0526 seconds time elapsed  ( +-  1.07% )
      
       Performance counter stats for './test_progs -t flow_dissector' (5 runs):
      
                10924.50 msec task-clock                       #    0.945 CPUs utilized               ( +-  0.08% )
                     603      context-switches                 #   55.197 /sec                        ( +-  1.13% )
                       0      cpu-migrations                   #    0.000 /sec
                     566      page-faults                      #   51.810 /sec                        ( +-  0.42% )
             27381270695      cycles                           #    2.506 GHz                         ( +-  0.18% )  (60.46%)
             56996583922      instructions                     #    2.08  insn per cycle              ( +-  0.21% )  (66.11%)
             10321647567      branches                         #  944.816 M/sec                       ( +-  0.17% )  (71.79%)
                 3347735      branch-misses                    #    0.03% of all branches             ( +-  3.72% )  (72.15%)
                              TopDownL1                 #     0.52 retiring                    ( +-  0.13% )  (66.74%)
                                                        #     0.27 frontend_bound              ( +-  0.14% )  (61.27%)
                                                        #     0.14 bad_speculation             ( +-  0.19% )  (50.36%)
                                                        #     0.07 backend_bound               ( +-  0.42% )  (33.89%)
             18740797617      L1-dcache-loads                  #    1.715 G/sec                       ( +-  0.43% )  (33.71%)
                13715669      L1-dcache-load-misses            #    0.07% of all L1-dcache accesses   ( +- 32.85% )  (33.34%)
                 4087551      LLC-loads                        #  374.164 K/sec                       ( +- 29.53% )  (33.26%)
                  267906      LLC-load-misses                  #    6.55% of all LL-cache accesses    ( +- 23.90% )  (38.76%)
             15811864229      L1-icache-loads                  #    1.447 G/sec                       ( +-  0.12% )  (38.73%)
                 2976833      L1-icache-load-misses            #    0.02% of all L1-icache accesses   ( +-  9.73% )  (44.22%)
             20138907471      dTLB-loads                       #    1.843 G/sec                       ( +-  0.18% )  (44.15%)
                  732850      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +- 11.18% )  (49.64%)
             15895726702      iTLB-loads                       #    1.455 G/sec                       ( +-  0.15% )  (55.13%)
                  152075      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  4.71% )  (54.98%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                 11.5613 +- 0.0317 seconds time elapsed  ( +-  0.27% )
      
      - After:
      
       Performance counter stats for './test_progs -t tailcalls' (5 runs):
      
                 4278.78 msec task-clock                       #    0.871 CPUs utilized               ( +-  0.15% )
                     569      context-switches                 #  132.982 /sec                        ( +-  0.58% )
                       0      cpu-migrations                   #    0.000 /sec
                     539      page-faults                      #  125.970 /sec                        ( +-  0.43% )
             10588986432      cycles                           #    2.475 GHz                         ( +-  0.20% )  (60.91%)
             25303825043      instructions                     #    2.39  insn per cycle              ( +-  0.08% )  (66.48%)
              5110756256      branches                         #    1.194 G/sec                       ( +-  0.07% )  (72.03%)
                 2719569      branch-misses                    #    0.05% of all branches             ( +-  2.42% )  (72.03%)
                              TopDownL1                 #     0.60 retiring                    ( +-  0.22% )  (66.31%)
                                                        #     0.22 frontend_bound              ( +-  0.21% )  (60.83%)
                                                        #     0.12 bad_speculation             ( +-  0.26% )  (50.25%)
                                                        #     0.06 backend_bound               ( +-  0.17% )  (33.52%)
              8163648527      L1-dcache-loads                  #    1.908 G/sec                       ( +-  0.33% )  (33.52%)
                  694979      L1-dcache-load-misses            #    0.01% of all L1-dcache accesses   ( +- 30.53% )  (33.52%)
                 1902347      LLC-loads                        #  444.600 K/sec                       ( +- 48.84% )  (33.69%)
                   96677      LLC-load-misses                  #    5.08% of all LL-cache accesses    ( +- 43.48% )  (39.30%)
              6863517589      L1-icache-loads                  #    1.604 G/sec                       ( +-  0.37% )  (39.17%)
                 1871519      L1-icache-load-misses            #    0.03% of all L1-icache accesses   ( +-  6.78% )  (44.56%)
              8927782813      dTLB-loads                       #    2.087 G/sec                       ( +-  0.14% )  (44.37%)
                  438237      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  6.00% )  (49.75%)
              6886906831      iTLB-loads                       #    1.610 G/sec                       ( +-  0.36% )  (55.08%)
                   67568      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  3.27% )  (54.86%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                  4.9114 +- 0.0309 seconds time elapsed  ( +-  0.63% )
      
       Performance counter stats for './test_progs -t flow_dissector' (5 runs):
      
                10948.40 msec task-clock                       #    0.942 CPUs utilized               ( +-  0.05% )
                     615      context-switches                 #   56.173 /sec                        ( +-  1.65% )
                       1      cpu-migrations                   #    0.091 /sec                        ( +- 31.62% )
                     567      page-faults                      #   51.788 /sec                        ( +-  0.44% )
             27334194328      cycles                           #    2.497 GHz                         ( +-  0.08% )  (61.05%)
             56656528828      instructions                     #    2.07  insn per cycle              ( +-  0.08% )  (66.67%)
             10270389422      branches                         #  938.072 M/sec                       ( +-  0.10% )  (72.21%)
                 3453837      branch-misses                    #    0.03% of all branches             ( +-  3.75% )  (72.27%)
                              TopDownL1                 #     0.52 retiring                    ( +-  0.16% )  (66.55%)
                                                        #     0.27 frontend_bound              ( +-  0.09% )  (60.91%)
                                                        #     0.14 bad_speculation             ( +-  0.08% )  (49.85%)
                                                        #     0.07 backend_bound               ( +-  0.16% )  (33.33%)
             18982866028      L1-dcache-loads                  #    1.734 G/sec                       ( +-  0.24% )  (33.34%)
                 8802454      L1-dcache-load-misses            #    0.05% of all L1-dcache accesses   ( +- 52.30% )  (33.31%)
                 2612962      LLC-loads                        #  238.661 K/sec                       ( +- 29.78% )  (33.45%)
                  264107      LLC-load-misses                  #   10.11% of all LL-cache accesses    ( +- 18.34% )  (39.07%)
             15793205997      L1-icache-loads                  #    1.443 G/sec                       ( +-  0.15% )  (39.09%)
                 3930802      L1-icache-load-misses            #    0.02% of all L1-icache accesses   ( +-  3.72% )  (44.66%)
             20097828496      dTLB-loads                       #    1.836 G/sec                       ( +-  0.09% )  (44.68%)
                  961757      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  3.32% )  (50.15%)
             15838728506      iTLB-loads                       #    1.447 G/sec                       ( +-  0.09% )  (55.62%)
                  167652      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  1.28% )  (55.52%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
                 11.6173 +- 0.0268 seconds time elapsed  ( +-  0.23% )
      
      [1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d4fa9ec
    • Xu Kuohai's avatar
      bpf, arm64: Get rid of fpb · bd737fcb
      Xu Kuohai authored
      bpf prog accesses stack using BPF_FP as the base address and a negative
      immediate number as offset. But arm64 ldr/str instructions only support
      non-negative immediate number as offset. To simplify the jited result,
      commit 5b3d19b9 ("bpf, arm64: Adjust the offset of str/ldr(immediate)
      to positive number") introduced FPB to represent the lowest stack address
      that the bpf prog being jited may access, and with this address as the
      baseline, it converts BPF_FP plus negative immediate offset number to FPB
      plus non-negative immediate offset.
      
      Considering that for a given bpf prog, the jited stack space is fixed
      with A64_SP as the lowest address and BPF_FP as the highest address.
      Thus we can get rid of FPB and converts BPF_FP plus negative immediate
      offset to A64_SP plus non-negative immediate offset.
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Link: https://lore.kernel.org/r/20240826071624.350108-2-xukuohai@huaweicloud.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      bd737fcb