1. 13 Dec, 2019 22 commits
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-dispatcher' · 02620d9e
      Alexei Starovoitov authored
      Björn Töpel says:
      
      ====================
      Overview
      ========
      
      This is the 6th iteration of the series that introduces the BPF
      dispatcher, which is a mechanism to avoid indirect calls.
      
      The BPF dispatcher is a multi-way branch code generator, targeted for
      BPF programs. E.g. when an XDP program is executed via the
      bpf_prog_run_xdp(), it is invoked via an indirect call. With
      retpolines enabled, the indirect call has a substantial performance
      impact. The dispatcher is a mechanism that transform indirect calls to
      direct calls, and therefore avoids the retpoline. The dispatcher is
      generated using the BPF JIT, and relies on text poking provided by
      bpf_arch_text_poke().
      
      The dispatcher hijacks a trampoline function it via the __fentry__ nop
      of the trampoline. One dispatcher instance currently supports up to 48
      dispatch points. This can be extended in the future.
      
      In this series, only one dispatcher instance is supported, and the
      only user is XDP. The dispatcher is updated when an XDP program is
      attached/detached to/from a netdev. An alternative to this could have
      been to update the dispatcher at program load point, but as there are
      usually more XDP programs loaded than attached, so the latter was
      picked.
      
      The XDP dispatcher is always enabled, if available, because it helps
      even when retpolines are disabled. Please refer to the "Performance"
      section below.
      
      The first patch refactors the image allocation from the BPF trampoline
      code. Patch two introduces the dispatcher, and patch three adds a
      dispatcher for XDP, and wires up the XDP control-/ fast-path. Patch
      four adds the dispatcher to BPF_TEST_RUN. Patch five adds a simple
      selftest, and the last adds alignment to jump targets.
      
      I have rebased the series on commit 679152d3 ("libbpf: Fix printf
      compilation warnings on ppc64le arch").
      
      Generated code, x86-64
      ======================
      
      The dispatcher currently has a maximum of 48 entries, where one entry
      is a unique BPF program. Multiple users of a dispatcher instance using
      the same BPF program will share that entry.
      
      The program/slot lookup is performed by a binary search, O(log
      n). Let's have a look at the generated code.
      
      The trampoline function has the following signature:
      
        unsigned int tramp(const void *ctx,
                           const struct bpf_insn *insnsi,
                           unsigned int (*bpf_func)(const void *,
                                                    const struct bpf_insn *))
      
      On Intel x86-64 this means that rdx will contain the bpf_func. To,
      make it easier to read, I've let the BPF programs have the following
      range: 0xffffffffffffffff (-1) to 0xfffffffffffffff0
      (-16). 0xffffffff81c00f10 is the retpoline thunk, in this case
      __x86_indirect_thunk_rdx. If retpolines are disabled the thunk will be
      a regular indirect call.
      
      The minimal dispatcher will then look like this:
      
      ffffffffc0002000: cmp    rdx,0xffffffffffffffff
      ffffffffc0002007: je     0xffffffffffffffff ; -1
      ffffffffc000200d: jmp    0xffffffff81c00f10
      
      A 16 entry dispatcher looks like this:
      
      ffffffffc0020000: cmp    rdx,0xfffffffffffffff7 ; -9
      ffffffffc0020007: jg     0xffffffffc0020130
      ffffffffc002000d: cmp    rdx,0xfffffffffffffff3 ; -13
      ffffffffc0020014: jg     0xffffffffc00200a0
      ffffffffc002001a: cmp    rdx,0xfffffffffffffff1 ; -15
      ffffffffc0020021: jg     0xffffffffc0020060
      ffffffffc0020023: cmp    rdx,0xfffffffffffffff0 ; -16
      ffffffffc002002a: jg     0xffffffffc0020040
      ffffffffc002002c: cmp    rdx,0xfffffffffffffff0 ; -16
      ffffffffc0020033: je     0xfffffffffffffff0 ; -16
      ffffffffc0020039: jmp    0xffffffff81c00f10
      ffffffffc002003e: xchg   ax,ax
      ffffffffc0020040: cmp    rdx,0xfffffffffffffff1 ; -15
      ffffffffc0020047: je     0xfffffffffffffff1 ; -15
      ffffffffc002004d: jmp    0xffffffff81c00f10
      ffffffffc0020052: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc002005a: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc0020060: cmp    rdx,0xfffffffffffffff2 ; -14
      ffffffffc0020067: jg     0xffffffffc0020080
      ffffffffc0020069: cmp    rdx,0xfffffffffffffff2 ; -14
      ffffffffc0020070: je     0xfffffffffffffff2 ; -14
      ffffffffc0020076: jmp    0xffffffff81c00f10
      ffffffffc002007b: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc0020080: cmp    rdx,0xfffffffffffffff3 ; -13
      ffffffffc0020087: je     0xfffffffffffffff3 ; -13
      ffffffffc002008d: jmp    0xffffffff81c00f10
      ffffffffc0020092: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc002009a: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc00200a0: cmp    rdx,0xfffffffffffffff5 ; -11
      ffffffffc00200a7: jg     0xffffffffc00200f0
      ffffffffc00200a9: cmp    rdx,0xfffffffffffffff4 ; -12
      ffffffffc00200b0: jg     0xffffffffc00200d0
      ffffffffc00200b2: cmp    rdx,0xfffffffffffffff4 ; -12
      ffffffffc00200b9: je     0xfffffffffffffff4 ; -12
      ffffffffc00200bf: jmp    0xffffffff81c00f10
      ffffffffc00200c4: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc00200cc: nop    DWORD PTR [rax+0x0]
      ffffffffc00200d0: cmp    rdx,0xfffffffffffffff5 ; -11
      ffffffffc00200d7: je     0xfffffffffffffff5 ; -11
      ffffffffc00200dd: jmp    0xffffffff81c00f10
      ffffffffc00200e2: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc00200ea: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc00200f0: cmp    rdx,0xfffffffffffffff6 ; -10
      ffffffffc00200f7: jg     0xffffffffc0020110
      ffffffffc00200f9: cmp    rdx,0xfffffffffffffff6 ; -10
      ffffffffc0020100: je     0xfffffffffffffff6 ; -10
      ffffffffc0020106: jmp    0xffffffff81c00f10
      ffffffffc002010b: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc0020110: cmp    rdx,0xfffffffffffffff7 ; -9
      ffffffffc0020117: je     0xfffffffffffffff7 ; -9
      ffffffffc002011d: jmp    0xffffffff81c00f10
      ffffffffc0020122: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc002012a: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc0020130: cmp    rdx,0xfffffffffffffffb ; -5
      ffffffffc0020137: jg     0xffffffffc00201d0
      ffffffffc002013d: cmp    rdx,0xfffffffffffffff9 ; -7
      ffffffffc0020144: jg     0xffffffffc0020190
      ffffffffc0020146: cmp    rdx,0xfffffffffffffff8 ; -8
      ffffffffc002014d: jg     0xffffffffc0020170
      ffffffffc002014f: cmp    rdx,0xfffffffffffffff8 ; -8
      ffffffffc0020156: je     0xfffffffffffffff8 ; -8
      ffffffffc002015c: jmp    0xffffffff81c00f10
      ffffffffc0020161: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc0020169: nop    DWORD PTR [rax+0x0]
      ffffffffc0020170: cmp    rdx,0xfffffffffffffff9 ; -7
      ffffffffc0020177: je     0xfffffffffffffff9 ; -7
      ffffffffc002017d: jmp    0xffffffff81c00f10
      ffffffffc0020182: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc002018a: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc0020190: cmp    rdx,0xfffffffffffffffa ; -6
      ffffffffc0020197: jg     0xffffffffc00201b0
      ffffffffc0020199: cmp    rdx,0xfffffffffffffffa ; -6
      ffffffffc00201a0: je     0xfffffffffffffffa ; -6
      ffffffffc00201a6: jmp    0xffffffff81c00f10
      ffffffffc00201ab: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc00201b0: cmp    rdx,0xfffffffffffffffb ; -5
      ffffffffc00201b7: je     0xfffffffffffffffb ; -5
      ffffffffc00201bd: jmp    0xffffffff81c00f10
      ffffffffc00201c2: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc00201ca: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc00201d0: cmp    rdx,0xfffffffffffffffd ; -3
      ffffffffc00201d7: jg     0xffffffffc0020220
      ffffffffc00201d9: cmp    rdx,0xfffffffffffffffc ; -4
      ffffffffc00201e0: jg     0xffffffffc0020200
      ffffffffc00201e2: cmp    rdx,0xfffffffffffffffc ; -4
      ffffffffc00201e9: je     0xfffffffffffffffc ; -4
      ffffffffc00201ef: jmp    0xffffffff81c00f10
      ffffffffc00201f4: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc00201fc: nop    DWORD PTR [rax+0x0]
      ffffffffc0020200: cmp    rdx,0xfffffffffffffffd ; -3
      ffffffffc0020207: je     0xfffffffffffffffd ; -3
      ffffffffc002020d: jmp    0xffffffff81c00f10
      ffffffffc0020212: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc002021a: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc0020220: cmp    rdx,0xfffffffffffffffe ; -2
      ffffffffc0020227: jg     0xffffffffc0020240
      ffffffffc0020229: cmp    rdx,0xfffffffffffffffe ; -2
      ffffffffc0020230: je     0xfffffffffffffffe ; -2
      ffffffffc0020236: jmp    0xffffffff81c00f10
      ffffffffc002023b: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc0020240: cmp    rdx,0xffffffffffffffff ; -1
      ffffffffc0020247: je     0xffffffffffffffff ; -1
      ffffffffc002024d: jmp    0xffffffff81c00f10
      
      The nops are there to align jump targets to 16 B.
      
      Performance
      ===========
      
      The tests were performed using the xdp_rxq_info sample program with
      the following command-line:
      
      1. XDP_DRV:
        # xdp_rxq_info --dev eth0 --action XDP_DROP
      2. XDP_SKB:
        # xdp_rxq_info --dev eth0 -S --action XDP_DROP
      3. xdp-perf, from selftests/bpf:
        # test_progs -v -t xdp_perf
      
      Run with mitigations=auto
      -------------------------
      
      Baseline:
      1. 21.7 Mpps (21736190)
      2. 3.8 Mpps   (3837582)
      3. 15 ns
      
      Dispatcher:
      1. 30.2 Mpps (30176320)
      2. 4.0 Mpps   (4015579)
      3. 5 ns
      
      Dispatcher (full; walk all entries, and fallback):
      1. 22.0 Mpps (21986704)
      2. 3.8 Mpps   (3831298)
      3. 17 ns
      
      Run with mitigations=off
      ------------------------
      
      Baseline:
      1. 29.9 Mpps (29875135)
      2. 4.1 Mpps   (4100179)
      3. 4 ns
      
      Dispatcher:
      1. 30.4 Mpps (30439241)
      2. 4.1 Mpps   (4109350)
      1. 4 ns
      
      Dispatcher (full; walk all entries, and fallback):
      1. 28.9 Mpps (28903269)
      2. 4.1 Mpps   (4080078)
      3. 5 ns
      
      xdp-perf runs, aliged vs non-aligned jump targets
      -------------------------------------------------
      
      In this test dispatchers of different sizes, with and without jump
      target alignment, were exercised. As outlined above the function
      lookup is performed via binary search. This means that depending on
      the pointer value of the function, it can reside in the upper or lower
      part of the search table. The performed tests were:
      
      1. aligned, mititations=auto, function entry < other entries
      2. aligned, mititations=auto, function entry > other entries
      3. non-aligned, mititations=auto, function entry < other entries
      4. non-aligned, mititations=auto, function entry > other entries
      5. aligned, mititations=off, function entry < other entries
      6. aligned, mititations=off, function entry > other entries
      7. non-aligned, mititations=off, function entry < other entries
      8. non-aligned, mititations=off, function entry > other entries
      
      The micro benchmarks showed that alignment of jump target has some
      positive impact.
      
      A reply to this cover letter will contain complete data for all runs.
      
      Multiple xdp-perf baseline with mitigations=auto
      ------------------------------------------------
      
       Performance counter stats for './test_progs -v -t xdp_perf' (1024 runs):
      
                   16.69 msec task-clock                #    0.984 CPUs utilized            ( +-  0.08% )
                       2      context-switches          #    0.123 K/sec                    ( +-  1.11% )
                       0      cpu-migrations            #    0.000 K/sec                    ( +- 70.68% )
                      97      page-faults               #    0.006 M/sec                    ( +-  0.05% )
              49,254,635      cycles                    #    2.951 GHz                      ( +-  0.09% )  (12.28%)
              42,138,558      instructions              #    0.86  insn per cycle           ( +-  0.02% )  (36.15%)
               7,315,291      branches                  #  438.300 M/sec                    ( +-  0.01% )  (59.43%)
               1,011,201      branch-misses             #   13.82% of all branches          ( +-  0.01% )  (83.31%)
              15,440,788      L1-dcache-loads           #  925.143 M/sec                    ( +-  0.00% )  (99.40%)
                  39,067      L1-dcache-load-misses     #    0.25% of all L1-dcache hits    ( +-  0.04% )
                   6,531      LLC-loads                 #    0.391 M/sec                    ( +-  0.05% )
                     442      LLC-load-misses           #    6.76% of all LL-cache hits     ( +-  0.77% )
         <not supported>      L1-icache-loads
                  57,964      L1-icache-load-misses                                         ( +-  0.06% )
              15,442,496      dTLB-loads                #  925.246 M/sec                    ( +-  0.00% )
                     514      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  0.73% )  (40.57%)
                     130      iTLB-loads                #    0.008 M/sec                    ( +-  2.75% )  (16.69%)
           <not counted>      iTLB-load-misses                                              ( +-  8.71% )  (0.60%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
               0.0169558 +- 0.0000127 seconds time elapsed  ( +-  0.07% )
      
      Multiple xdp-perf dispatcher with mitigations=auto
      --------------------------------------------------
      
      Note that this includes generating the dispatcher.
      
       Performance counter stats for './test_progs -v -t xdp_perf' (1024 runs):
      
                    4.80 msec task-clock                #    0.953 CPUs utilized            ( +-  0.06% )
                       1      context-switches          #    0.258 K/sec                    ( +-  1.57% )
                       0      cpu-migrations            #    0.000 K/sec
                      97      page-faults               #    0.020 M/sec                    ( +-  0.05% )
              14,185,861      cycles                    #    2.955 GHz                      ( +-  0.17% )  (50.49%)
              45,691,935      instructions              #    3.22  insn per cycle           ( +-  0.01% )  (99.19%)
               8,346,008      branches                  # 1738.709 M/sec                    ( +-  0.00% )
                  13,046      branch-misses             #    0.16% of all branches          ( +-  0.10% )
              15,443,735      L1-dcache-loads           # 3217.365 M/sec                    ( +-  0.00% )
                  39,585      L1-dcache-load-misses     #    0.26% of all L1-dcache hits    ( +-  0.05% )
                   7,138      LLC-loads                 #    1.487 M/sec                    ( +-  0.06% )
                     671      LLC-load-misses           #    9.40% of all LL-cache hits     ( +-  0.73% )
         <not supported>      L1-icache-loads
                  56,213      L1-icache-load-misses                                         ( +-  0.08% )
              15,443,735      dTLB-loads                # 3217.365 M/sec                    ( +-  0.00% )
           <not counted>      dTLB-load-misses                                              (0.00%)
           <not counted>      iTLB-loads                                                    (0.00%)
           <not counted>      iTLB-load-misses                                              (0.00%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
              0.00503705 +- 0.00000546 seconds time elapsed  ( +-  0.11% )
      
      Revisions
      =========
      
      v4->v5: [1]
        * Fixed s/xdp_ctx/ctx/ type-o (Toke)
        * Marked dispatcher trampoline with noinline attribute (Alexei)
      
      v3->v4: [2]
        * Moved away from doing dispatcher lookup based on the trampoline
          function, to a model where the dispatcher instance is explicitly
          passed to the bpf_dispatcher_change_prog() (Alexei)
      
      v2->v3: [3]
        * Removed xdp_call, and instead make the dispatcher available to all
          XDP users via bpf_prog_run_xdp() and dev_xdp_install(). (Toke)
        * Always enable the dispatcher, if available (Alexei)
        * Reuse BPF trampoline image allocator (Alexei)
        * Make sure the dispatcher is exercised in selftests (Alexei)
        * Only allow one dispatcher, and wire it to XDP
      
      v1->v2: [4]
        * Fixed i386 build warning (kbuild robot)
        * Made bpf_dispatcher_lookup() static (kbuild robot)
        * Make sure xdp_call.h is only enabled for builtins
        * Add xdp_call() to ixgbe, mlx4, and mlx5
      
      RFC->v1: [5]
        * Improved error handling (Edward and Andrii)
        * Explicit cleanup (Andrii)
        * Use 32B with sext cmp (Alexei)
        * Align jump targets to 16B (Alexei)
        * 4 to 16 entries (Toke)
        * Added stats to xdp_call_run()
      
      [1] https://lore.kernel.org/bpf/20191211123017.13212-1-bjorn.topel@gmail.com/
      [2] https://lore.kernel.org/bpf/20191209135522.16576-1-bjorn.topel@gmail.com/
      [3] https://lore.kernel.org/bpf/20191123071226.6501-1-bjorn.topel@gmail.com/
      [4] https://lore.kernel.org/bpf/20191119160757.27714-1-bjorn.topel@gmail.com/
      [5] https://lore.kernel.org/bpf/20191113204737.31623-1-bjorn.topel@gmail.com/
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      02620d9e
    • Björn Töpel's avatar
      bpf, x86: Align dispatcher branch targets to 16B · 116eb788
      Björn Töpel authored
      >From Intel 64 and IA-32 Architectures Optimization Reference Manual,
      3.4.1.4 Code Alignment, Assembly/Compiler Coding Rule 11: All branch
      targets should be 16-byte aligned.
      
      This commits aligns branch targets according to the Intel manual.
      
      The nops used to align branch targets make the dispatcher larger, and
      therefore the number of supported dispatch points/programs are
      descreased from 64 to 48.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191213175112.30208-7-bjorn.topel@gmail.com
      116eb788
    • Björn Töpel's avatar
      selftests: bpf: Add xdp_perf test · e754f5a6
      Björn Töpel authored
      The xdp_perf is a dummy XDP test, only used to measure the the cost of
      jumping into a naive XDP program one million times.
      
      To build and run the program:
        $ cd tools/testing/selftests/bpf
        $ make
        $ ./test_progs -v -t xdp_perf
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191213175112.30208-6-bjorn.topel@gmail.com
      e754f5a6
    • Björn Töpel's avatar
      bpf: Start using the BPF dispatcher in BPF_TEST_RUN · f23c4b39
      Björn Töpel authored
      In order to properly exercise the BPF dispatcher, this commit adds BPF
      dispatcher usage to BPF_TEST_RUN when executing XDP programs.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191213175112.30208-5-bjorn.topel@gmail.com
      f23c4b39
    • Björn Töpel's avatar
      bpf, xdp: Start using the BPF dispatcher for XDP · 7e6897f9
      Björn Töpel authored
      This commit adds a BPF dispatcher for XDP. The dispatcher is updated
      from the XDP control-path, dev_xdp_install(), and used when an XDP
      program is run via bpf_prog_run_xdp().
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191213175112.30208-4-bjorn.topel@gmail.com
      7e6897f9
    • Björn Töpel's avatar
      bpf: Introduce BPF dispatcher · 75ccbef6
      Björn Töpel authored
      The BPF dispatcher is a multi-way branch code generator, mainly
      targeted for XDP programs. When an XDP program is executed via the
      bpf_prog_run_xdp(), it is invoked via an indirect call. The indirect
      call has a substantial performance impact, when retpolines are
      enabled. The dispatcher transform indirect calls to direct calls, and
      therefore avoids the retpoline. The dispatcher is generated using the
      BPF JIT, and relies on text poking provided by bpf_arch_text_poke().
      
      The dispatcher hijacks a trampoline function it via the __fentry__ nop
      of the trampoline. One dispatcher instance currently supports up to 64
      dispatch points. A user creates a dispatcher with its corresponding
      trampoline with the DEFINE_BPF_DISPATCHER macro.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191213175112.30208-3-bjorn.topel@gmail.com
      75ccbef6
    • Björn Töpel's avatar
      bpf: Move trampoline JIT image allocation to a function · 98e8627e
      Björn Töpel authored
      Refactor the image allocation in the BPF trampoline code into a
      separate function, so it can be shared with the BPF dispatcher in
      upcoming commits.
      Signed-off-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191213175112.30208-2-bjorn.topel@gmail.com
      98e8627e
    • Andrii Nakryiko's avatar
      selftests/bpf: Fix perf_buffer test on systems w/ offline CPUs · 91cbdf74
      Andrii Nakryiko authored
      Fix up perf_buffer.c selftest to take into account offline/missing CPUs.
      
      Fixes: ee5cf82c ("selftests/bpf: test perf buffer API")
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212013621.1691858-1-andriin@fb.com
      91cbdf74
    • Andrii Nakryiko's avatar
      libbpf: Don't attach perf_buffer to offline/missing CPUs · 783b8f01
      Andrii Nakryiko authored
      It's quite common on some systems to have more CPUs enlisted as "possible",
      than there are (and could ever be) present/online CPUs. In such cases,
      perf_buffer creationg will fail due to inability to create perf event on
      missing CPU with error like this:
      
      libbpf: failed to open perf buffer event on cpu #16: No such device
      
      This patch fixes the logic of perf_buffer__new() to ignore CPUs that are
      missing or currently offline. In rare cases where user explicitly listed
      specific CPUs to connect to, behavior is unchanged: libbpf will try to open
      perf event buffer on specified CPU(s) anyways.
      
      Fixes: fb84b822 ("libbpf: add perf buffer API")
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212013609.1691168-1-andriin@fb.com
      783b8f01
    • Andrii Nakryiko's avatar
      selftests/bpf: Add CPU mask parsing tests · 65bc4c40
      Andrii Nakryiko authored
      Add a bunch of test validating CPU mask parsing logic and error handling.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212013559.1690898-1-andriin@fb.com
      65bc4c40
    • Andrii Nakryiko's avatar
      libbpf: Extract and generalize CPU mask parsing logic · 6803ee25
      Andrii Nakryiko authored
      This logic is re-used for parsing a set of online CPUs. Having it as an
      isolated piece of code working with input string makes it conveninent to test
      this logic as well. While refactoring, also improve the robustness of original
      implementation.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212013548.1690564-1-andriin@fb.com
      6803ee25
    • Alexei Starovoitov's avatar
      Merge branch 'reuseport_to_test_progs' · 7708bd43
      Alexei Starovoitov authored
      Jakub Sitnicki says:
      
      ====================
      This change has been suggested by Martin Lau [0] during a review of a
      related patch set that extends reuseport tests [1].
      
      Patches 1 & 2 address a warning due to unrecognized section name from
      libbpf when running reuseport tests. We don't want to carry this warning
      into test_progs.
      
      Patches 3-8 massage the reuseport tests to ease the switch to test_progs
      framework. The intention here is to show the work. Happy to squash these,
      if needed.
      
      Patches 9-10 do the actual move and conversion to test_progs.
      
      Output from a test_progs run after changes pasted below.
      
      Thanks,
      Jakub
      
      [0] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/T/#m607d822caeb1eb5db101172821a78cc3896ff1c3
      [1] https://lore.kernel.org/bpf/20191123110751.6729-1-jakub@cloudflare.com/T/#m55881bae9fb6e34837d07a0c0a7ffbc138f8d06f
      ====================
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      7708bd43
    • Jakub Sitnicki's avatar
      selftests/bpf: Switch reuseport tests for test_progs framework · 7ee0d4e9
      Jakub Sitnicki authored
      The tests were originally written in abort-on-error style. With the switch
      to test_progs we can no longer do that. So at the risk of not cleaning up
      some resource on failure, we now return to the caller on error.
      
      That said, failure inside one test should not affect others because we run
      setup/cleanup before/after every test.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212102259.418536-11-jakub@cloudflare.com
      7ee0d4e9
    • Jakub Sitnicki's avatar
      selftests/bpf: Move reuseport tests under prog_tests/ · 415bb4e1
      Jakub Sitnicki authored
      Do a pure move the show the actual work needed to adapt the tests in
      subsequent patch at the cost of breaking test_progs build for the moment.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212102259.418536-10-jakub@cloudflare.com
      415bb4e1
    • Jakub Sitnicki's avatar
      selftests/bpf: Pull up printing the test name into test runner · 250a91d4
      Jakub Sitnicki authored
      Again, prepare for switching reuseport tests to test_progs framework.
      test_progs framework will print the subtest name for us if we set it.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212102259.418536-9-jakub@cloudflare.com
      250a91d4
    • Jakub Sitnicki's avatar
      selftests/bpf: Propagate errors during setup for reuseport tests · 9af6c844
      Jakub Sitnicki authored
      Prepare for switching reuseport tests to test_progs framework, where we
      don't have the luxury to terminate the process on failure.
      
      Modify setup helpers to signal failure via the return value with the help
      of a macro similar to the one currently in use by the tests.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212102259.418536-8-jakub@cloudflare.com
      9af6c844
    • Jakub Sitnicki's avatar
      selftests/bpf: Run reuseport tests in a loop · ce7cb5f3
      Jakub Sitnicki authored
      Prepare for switching reuseport tests to test_progs framework. Loop over
      the tests and perform setup/cleanup for each test separately, remembering
      that with test_progs we can select tests to run.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212102259.418536-7-jakub@cloudflare.com
      ce7cb5f3
    • Jakub Sitnicki's avatar
      selftests/bpf: Unroll the main loop in reuseport test · 99363382
      Jakub Sitnicki authored
      Prepare for iterating over individual tests without introducing another
      nested loop in the main test function.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212102259.418536-6-jakub@cloudflare.com
      99363382
    • Jakub Sitnicki's avatar
      selftests/bpf: Add helpers for getting socket family & type name · a9ce4cf4
      Jakub Sitnicki authored
      Having string arrays to map socket family & type to a name prevents us from
      unrolling the test runner loop in the subsequent patch. Introduce helpers
      that do the same thing.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212102259.418536-5-jakub@cloudflare.com
      a9ce4cf4
    • Jakub Sitnicki's avatar
      selftests/bpf: Use sa_family_t everywhere in reuseport tests · 11f80355
      Jakub Sitnicki authored
      Update the only function that is not using sa_family_t in this source file.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212102259.418536-4-jakub@cloudflare.com
      11f80355
    • Jakub Sitnicki's avatar
      selftests/bpf: Let libbpf determine program type from section name · 1fbcef92
      Jakub Sitnicki authored
      Now that libbpf can recognize SK_REUSEPORT programs, we no longer have to
      pass a prog_type hint before loading the object file.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212102259.418536-3-jakub@cloudflare.com
      1fbcef92
    • Jakub Sitnicki's avatar
      libbpf: Recognize SK_REUSEPORT programs from section name · 67d69ccd
      Jakub Sitnicki authored
      Allow loading BPF object files that contain SK_REUSEPORT programs without
      having to manually set the program type before load if the the section name
      is set to "sk_reuseport".
      
      Makes user-space code needed to load SK_REUSEPORT BPF program more concise.
      Signed-off-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191212102259.418536-2-jakub@cloudflare.com
      67d69ccd
  2. 12 Dec, 2019 2 commits
  3. 11 Dec, 2019 13 commits
    • Daniel Borkmann's avatar
      bpf: Emit audit messages upon successful prog load and unload · bae141f5
      Daniel Borkmann authored
      Allow for audit messages to be emitted upon BPF program load and
      unload for having a timeline of events. The load itself is in
      syscall context, so additional info about the process initiating
      the BPF prog creation can be logged and later directly correlated
      to the unload event.
      
      The only info really needed from BPF side is the globally unique
      prog ID where then audit user space tooling can query / dump all
      info needed about the specific BPF program right upon load event
      and enrich the record, thus these changes needed here can be kept
      small and non-intrusive to the core.
      
      Raw example output:
      
        # auditctl -D
        # auditctl -a always,exit -F arch=x86_64 -S bpf
        # ausearch --start recent -m 1334
        ...
        ----
        time->Wed Nov 27 16:04:13 2019
        type=PROCTITLE msg=audit(1574867053.120:84664): proctitle="./bpf"
        type=SYSCALL msg=audit(1574867053.120:84664): arch=c000003e syscall=321   \
          success=yes exit=3 a0=5 a1=7ffea484fbe0 a2=70 a3=0 items=0 ppid=7477    \
          pid=12698 auid=1001 uid=1001 gid=1001 euid=1001 suid=1001 fsuid=1001    \
          egid=1001 sgid=1001 fsgid=1001 tty=pts2 ses=4 comm="bpf"                \
          exe="/home/jolsa/auditd/audit-testsuite/tests/bpf/bpf"                  \
          subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key=(null)
        type=UNKNOWN[1334] msg=audit(1574867053.120:84664): prog-id=76 op=LOAD
        ----
        time->Wed Nov 27 16:04:13 2019
        type=UNKNOWN[1334] msg=audit(1574867053.120:84665): prog-id=76 op=UNLOAD
        ...
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Co-developed-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Acked-by: default avatarPaul Moore <paul@paul-moore.com>
      Link: https://lore.kernel.org/bpf/20191206214934.11319-1-jolsa@kernel.org
      bae141f5
    • Stanislav Fomichev's avatar
      bpf: Switch to offsetofend in BPF_PROG_TEST_RUN · b590cb5f
      Stanislav Fomichev authored
      Switch existing pattern of "offsetof(..., member) + FIELD_SIZEOF(...,
      member)' to "offsetofend(..., member)" which does exactly what
      we need without all the copy-paste.
      Suggested-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20191210191933.105321-1-sdf@google.com
      b590cb5f
    • Andrii Nakryiko's avatar
      libbpf: Bump libpf current version to v0.0.7 · 09c4708d
      Andrii Nakryiko authored
      New development cycles starts, bump to v0.0.7 proactively.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191209224022.3544519-1-andriin@fb.com
      09c4708d
    • Russell King's avatar
      ARM: net: bpf: Improve prologue code sequence · c4533128
      Russell King authored
      Improve the prologue code sequence to be able to take advantage of
      64-bit stores, changing the code from:
      
        push    {r4, r5, r6, r7, r8, r9, fp, lr}
        mov     fp, sp
        sub     ip, sp, #80     ; 0x50
        sub     sp, sp, #600    ; 0x258
        str     ip, [fp, #-100] ; 0xffffff9c
        mov     r6, #0
        str     r6, [fp, #-96]  ; 0xffffffa0
        mov     r4, #0
        mov     r3, r4
        mov     r2, r0
        str     r4, [fp, #-104] ; 0xffffff98
        str     r4, [fp, #-108] ; 0xffffff94
      
      to the tighter:
      
        push    {r4, r5, r6, r7, r8, r9, fp, lr}
        mov     fp, sp
        mov     r3, #0
        sub     r2, sp, #80     ; 0x50
        sub     sp, sp, #600    ; 0x258
        strd    r2, [fp, #-100] ; 0xffffff9c
        mov     r2, #0
        strd    r2, [fp, #-108] ; 0xffffff94
        mov     r2, r0
      
      resulting in a saving of three instructions.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/E1ieH2g-0004ih-Rb@rmk-PC.armlinux.org.uk
      c4533128
    • Shahjada Abul Husain's avatar
      cxgb4: add support for high priority filters · c2193999
      Shahjada Abul Husain authored
      T6 has a separate region known as high priority filter region
      that allows classifying packets going through ULD path. So,
      query firmware for HPFILTER resources and enable the high
      priority offload filter support when it is available.
      Signed-off-by: default avatarShahjada Abul Husain <shahjada@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c2193999
    • Chen Wandun's avatar
      enetc: remove variable 'tc_max_sized_frame' set but not used · 6525b5ef
      Chen Wandun authored
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      drivers/net/ethernet/freescale/enetc/enetc_qos.c: In function enetc_setup_tc_cbs:
      drivers/net/ethernet/freescale/enetc/enetc_qos.c:195:6: warning: variable tc_max_sized_frame set but not used [-Wunused-but-set-variable]
      
      Fixes: c431047c ("enetc: add support Credit Based Shaper(CBS) for hardware offload")
      Signed-off-by: default avatarChen Wandun <chenwandun@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6525b5ef
    • Jakub Kicinski's avatar
      nfp: add support for TLV device stats · ca866ee8
      Jakub Kicinski authored
      Device stats are currently hard coded in the PCI BAR0 layout.
      Add a ability to read them from the TLV area instead.
      Names for the stats are maintained by the driver, and their
      meaning documented. This allows us to more easily add and
      remove device stats.
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca866ee8
    • Kuniyuki Iwashima's avatar
      tcp: Cleanup duplicate initialization of sk->sk_state. · 5000b28b
      Kuniyuki Iwashima authored
      When a TCP socket is created, sk->sk_state is initialized twice as
      TCP_CLOSE in sock_init_data() and tcp_init_sock(). The tcp_init_sock() is
      always called after the sock_init_data(), so it is not necessary to update
      sk->sk_state in the tcp_init_sock().
      
      Before v2.1.8, the code of the two functions was in the inet_create(). In
      the patch of v2.1.8, the tcp_v4/v6_init_sock() were added and the code of
      initialization of sk->state was duplicated.
      Signed-off-by: default avatarKuniyuki Iwashima <kuni1840@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5000b28b
    • Michael Walle's avatar
      enetc: add software timestamping · 4caefbce
      Michael Walle authored
      Provide a software TX timestamp and add it to the ethtool query
      interface.
      
      skb_tx_timestamp() is also needed if one would like to use PHY
      timestamping.
      Signed-off-by: default avatarMichael Walle <michael@walle.cc>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4caefbce
    • David S. Miller's avatar
      Merge branch 'tipc-introduce-variable-window-congestion-control' · bb9d8454
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: introduce variable window congestion control
      
      We improve thoughput greatly by introducing a variety of the Reno
      congestion control algorithm at the link level.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bb9d8454
    • Jon Maloy's avatar
      tipc: introduce variable window congestion control · 16ad3f40
      Jon Maloy authored
      We introduce a simple variable window congestion control for links.
      The algorithm is inspired by the Reno algorithm, covering both 'slow
      start', 'congestion avoidance', and 'fast recovery' modes.
      
      - We introduce hard lower and upper window limits per link, still
        different and configurable per bearer type.
      
      - We introduce a 'slow start theshold' variable, initially set to
        the maximum window size.
      
      - We let a link start at the minimum congestion window, i.e. in slow
        start mode, and then let is grow rapidly (+1 per rceived ACK) until
        it reaches the slow start threshold and enters congestion avoidance
        mode.
      
      - In congestion avoidance mode we increment the congestion window for
        each window-size number of acked packets, up to a possible maximum
        equal to the configured maximum window.
      
      - For each non-duplicate NACK received, we drop back to fast recovery
        mode, by setting the both the slow start threshold to and the
        congestion window to (current_congestion_window / 2).
      
      - If the timeout handler finds that the transmit queue has not moved
        since the previous timeout, it drops the link back to slow start
        and forces a probe containing the last sent sequence number to the
        sent to the peer, so that this can discover the stale situation.
      
      This change does in reality have effect only on unicast ethernet
      transport, as we have seen that there is no room whatsoever for
      increasing the window max size for the UDP bearer.
      For now, we also choose to keep the limits for the broadcast link
      unchanged and equal.
      
      This algorithm seems to give a 50-100% throughput improvement for
      messages larger than MTU.
      Suggested-by: default avatarXin Long <lucien.xin@gmail.com>
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16ad3f40
    • Jon Maloy's avatar
      tipc: eliminate more unnecessary nacks and retransmissions · d3b09995
      Jon Maloy authored
      When we increase the link tranmsit window we often observe the following
      scenario:
      
      1) A STATE message bypasses a sequence of traffic packets and arrives
         far ahead of those to the receiver. STATE messages contain a
         'peers_nxt_snt' field to indicate which was the last packet sent
         from the peer. This mechanism is intended as a last resort for the
         receiver to detect missing packets, e.g., during very low traffic
         when there is no packet flow to help early loss detection.
      3) The receiving link compares the 'peer_nxt_snt' field to its own
         'rcv_nxt', finds that there is a gap, and immediately sends a
         NACK message back to the peer.
      4) When this NACKs arrives at the sender, all the requested
         retransmissions are performed, since it is a first-time request.
      
      Just like in the scenario described in the previous commit this leads
      to many redundant retransmissions, with decreased throughput as a
      consequence.
      
      We fix this by adding two more conditions before we send a NACK in
      this sitution. First, the deferred queue must be empty, so we cannot
      assume that the potential packet loss has already been detected by
      other means. Second, we check the 'peers_snd_nxt' field only in probe/
      probe_reply messages, thus turning this into a true mechanism of last
      resort as it was really meant to be.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d3b09995
    • Jon Maloy's avatar
      tipc: eliminate gap indicator from ACK messages · 02288248
      Jon Maloy authored
      When we increase the link send window we sometimes observe the
      following scenario:
      
      1) A packet #N arrives out of order far ahead of a sequence of older
         packets which are still under way. The packet is added to the
         deferred queue.
      2) The missing packets arrive in sequence, and for each 16th of them
         an ACK is sent back to the receiver, as it should be.
      3) When building those ACK messages, it is checked if there is a gap
         between the link's 'rcv_nxt' and the first packet in the deferred
         queue. This is always the case until packet number #N-1 arrives, and
         a 'gap' indicator is added, effectively turning them into NACK
         messages.
      4) When those NACKs arrive at the sender, all the requested
         retransmissions are done, since it is a first-time request.
      
      This sometimes leads to a huge amount of redundant retransmissions,
      causing a drop in max throughput. This problem gets worse when we
      in a later commit introduce variable window congestion control,
      since it drops the link back to 'fast recovery' much more often
      than necessary.
      
      We now fix this by not sending any 'gap' indicator in regular ACK
      messages. We already have a mechanism for sending explicit NACKs
      in place, and this is sufficient to keep up the packet flow.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02288248
  4. 10 Dec, 2019 3 commits