1. 25 Jun, 2020 4 commits
  2. 24 Jun, 2020 7 commits
  3. 23 Jun, 2020 15 commits
    • Alexei Starovoitov's avatar
    • Tobias Klauser's avatar
      tools, bpftool: Correctly evaluate $(BUILD_BPF_SKELS) in Makefile · 9d9d8cc2
      Tobias Klauser authored
      Currently, if the clang-bpf-co-re feature is not available, the build
      fails with e.g.
      
        CC       prog.o
      prog.c:1462:10: fatal error: profiler.skel.h: No such file or directory
       1462 | #include "profiler.skel.h"
            |          ^~~~~~~~~~~~~~~~~
      
      This is due to the fact that the BPFTOOL_WITHOUT_SKELETONS macro is not
      defined, despite BUILD_BPF_SKELS not being set. Fix this by correctly
      evaluating $(BUILD_BPF_SKELS) when deciding on whether to add
      -DBPFTOOL_WITHOUT_SKELETONS to CFLAGS.
      
      Fixes: 05aca6da ("tools/bpftool: Generalize BPF skeleton support and generate vmlinux.h")
      Signed-off-by: default avatarTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200623103710.10370-1-tklauser@distanz.ch
      9d9d8cc2
    • John Fastabend's avatar
      selftests/bpf: Add variable-length data concat pattern less than test · 2fde1747
      John Fastabend authored
      Extend original variable-length tests with a case to catch a common
      existing pattern of testing for < 0 for errors. Note because
      verifier also tracks upper bounds and we know it can not be greater
      than MAX_LEN here we can skip upper bound check.
      
      In ALU64 enabled compilation converting from long->int return types
      in probe helpers results in extra instruction pattern, <<= 32, s >>= 32.
      The trade-off is the non-ALU64 case works. If you really care about
      every extra insn (XDP case?) then you probably should be using original
      int type.
      
      In addition adding a sext insn to bpf might help the verifier in the
      general case to avoid these types of tricks.
      Signed-off-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200623032224.4020118-3-andriin@fb.com
      2fde1747
    • Andrii Nakryiko's avatar
      selftests/bpf: Add variable-length data concatenation pattern test · 5e85c6bb
      Andrii Nakryiko authored
      Add selftest that validates variable-length data reading and concatentation
      with one big shared data array. This is a common pattern in production use for
      monitoring and tracing applications, that potentially can read a lot of data,
      but overall read much less. Such pattern allows to determine precisely what
      amount of data needs to be sent over perfbuf/ringbuf and maximize efficiency.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200623032224.4020118-2-andriin@fb.com
      5e85c6bb
    • Andrii Nakryiko's avatar
      bpf: Switch most helper return values from 32-bit int to 64-bit long · bdb7b79b
      Andrii Nakryiko authored
      Switch most of BPF helper definitions from returning int to long. These
      definitions are coming from comments in BPF UAPI header and are used to
      generate bpf_helper_defs.h (under libbpf) to be later included and used from
      BPF programs.
      
      In actual in-kernel implementation, all the helpers are defined as returning
      u64, but due to some historical reasons, most of them are actually defined as
      returning int in UAPI (usually, to return 0 on success, and negative value on
      error).
      
      This actually causes Clang to quite often generate sub-optimal code, because
      compiler believes that return value is 32-bit, and in a lot of cases has to be
      up-converted (usually with a pair of 32-bit bit shifts) to 64-bit values,
      before they can be used further in BPF code.
      
      Besides just "polluting" the code, these 32-bit shifts quite often cause
      problems for cases in which return value matters. This is especially the case
      for the family of bpf_probe_read_str() functions. There are few other similar
      helpers (e.g., bpf_read_branch_records()), in which return value is used by
      BPF program logic to record variable-length data and process it. For such
      cases, BPF program logic carefully manages offsets within some array or map to
      read variable-length data. For such uses, it's crucial for BPF verifier to
      track possible range of register values to prove that all the accesses happen
      within given memory bounds. Those extraneous zero-extending bit shifts,
      inserted by Clang (and quite often interleaved with other code, which makes
      the issues even more challenging and sometimes requires employing extra
      per-variable compiler barriers), throws off verifier logic and makes it mark
      registers as having unknown variable offset. We'll study this pattern a bit
      later below.
      
      Another common pattern is to check return of BPF helper for non-zero state to
      detect error conditions and attempt alternative actions in such case. Even in
      this simple and straightforward case, this 32-bit vs BPF's native 64-bit mode
      quite often leads to sub-optimal and unnecessary extra code. We'll look at
      this pattern as well.
      
      Clang's BPF target supports two modes of code generation: ALU32, in which it
      is capable of using lower 32-bit parts of registers, and no-ALU32, in which
      only full 64-bit registers are being used. ALU32 mode somewhat mitigates the
      above described problems, but not in all cases.
      
      This patch switches all the cases in which BPF helpers return 0 or negative
      error from returning int to returning long. It is shown below that such change
      in definition leads to equivalent or better code. No-ALU32 mode benefits more,
      but ALU32 mode doesn't degrade or still gets improved code generation.
      
      Another class of cases switched from int to long are bpf_probe_read_str()-like
      helpers, which encode successful case as non-negative values, while still
      returning negative value for errors.
      
      In all of such cases, correctness is preserved due to two's complement
      encoding of negative values and the fact that all helpers return values with
      32-bit absolute value. Two's complement ensures that for negative values
      higher 32 bits are all ones and when truncated, leave valid negative 32-bit
      value with the same value. Non-negative values have upper 32 bits set to zero
      and similarly preserve value when high 32 bits are truncated. This means that
      just casting to int/u32 is correct and efficient (and in ALU32 mode doesn't
      require any extra shifts).
      
      To minimize the chances of regressions, two code patterns were investigated,
      as mentioned above. For both patterns, BPF assembly was analyzed in
      ALU32/NO-ALU32 compiler modes, both with current 32-bit int return type and
      new 64-bit long return type.
      
      Case 1. Variable-length data reading and concatenation. This is quite
      ubiquitous pattern in tracing/monitoring applications, reading data like
      process's environment variables, file path, etc. In such case, many pieces of
      string-like variable-length data are read into a single big buffer, and at the
      end of the process, only a part of array containing actual data is sent to
      user-space for further processing. This case is tested in test_varlen.c
      selftest (in the next patch). Code flow is roughly as follows:
      
        void *payload = &sample->payload;
        u64 len;
      
        len = bpf_probe_read_kernel_str(payload, MAX_SZ1, &source_data1);
        if (len <= MAX_SZ1) {
            payload += len;
            sample->len1 = len;
        }
        len = bpf_probe_read_kernel_str(payload, MAX_SZ2, &source_data2);
        if (len <= MAX_SZ2) {
            payload += len;
            sample->len2 = len;
        }
        /* and so on */
        sample->total_len = payload - &sample->payload;
        /* send over, e.g., perf buffer */
      
      There could be two variations with slightly different code generated: when len
      is 64-bit integer and when it is 32-bit integer. Both variations were analysed.
      BPF assembly instructions between two successive invocations of
      bpf_probe_read_kernel_str() were used to check code regressions. Results are
      below, followed by short analysis. Left side is using helpers with int return
      type, the right one is after the switch to long.
      
      ALU32 + INT                                ALU32 + LONG
      ===========                                ============
      
      64-BIT (13 insns):                         64-BIT (10 insns):
      ------------------------------------       ------------------------------------
        17:   call 115                             17:   call 115
        18:   if w0 > 256 goto +9 <LBB0_4>         18:   if r0 > 256 goto +6 <LBB0_4>
        19:   w1 = w0                              19:   r1 = 0 ll
        20:   r1 <<= 32                            21:   *(u64 *)(r1 + 0) = r0
        21:   r1 s>>= 32                           22:   r6 = 0 ll
        22:   r2 = 0 ll                            24:   r6 += r0
        24:   *(u64 *)(r2 + 0) = r1              00000000000000c8 <LBB0_4>:
        25:   r6 = 0 ll                            25:   r1 = r6
        27:   r6 += r1                             26:   w2 = 256
      00000000000000e0 <LBB0_4>:                   27:   r3 = 0 ll
        28:   r1 = r6                              29:   call 115
        29:   w2 = 256
        30:   r3 = 0 ll
        32:   call 115
      
      32-BIT (11 insns):                         32-BIT (12 insns):
      ------------------------------------       ------------------------------------
        17:   call 115                             17:   call 115
        18:   if w0 > 256 goto +7 <LBB1_4>         18:   if w0 > 256 goto +8 <LBB1_4>
        19:   r1 = 0 ll                            19:   r1 = 0 ll
        21:   *(u32 *)(r1 + 0) = r0                21:   *(u32 *)(r1 + 0) = r0
        22:   w1 = w0                              22:   r0 <<= 32
        23:   r6 = 0 ll                            23:   r0 >>= 32
        25:   r6 += r1                             24:   r6 = 0 ll
      00000000000000d0 <LBB1_4>:                   26:   r6 += r0
        26:   r1 = r6                            00000000000000d8 <LBB1_4>:
        27:   w2 = 256                             27:   r1 = r6
        28:   r3 = 0 ll                            28:   w2 = 256
        30:   call 115                             29:   r3 = 0 ll
                                                   31:   call 115
      
      In ALU32 mode, the variant using 64-bit length variable clearly wins and
      avoids unnecessary zero-extension bit shifts. In practice, this is even more
      important and good, because BPF code won't need to do extra checks to "prove"
      that payload/len are within good bounds.
      
      32-bit len is one instruction longer. Clang decided to do 64-to-32 casting
      with two bit shifts, instead of equivalent `w1 = w0` assignment. The former
      uses extra register. The latter might potentially lose some range information,
      but not for 32-bit value. So in this case, verifier infers that r0 is [0, 256]
      after check at 18:, and shifting 32 bits left/right keeps that range intact.
      We should probably look into Clang's logic and see why it chooses bitshifts
      over sub-register assignments for this.
      
      NO-ALU32 + INT                             NO-ALU32 + LONG
      ==============                             ===============
      
      64-BIT (14 insns):                         64-BIT (10 insns):
      ------------------------------------       ------------------------------------
        17:   call 115                             17:   call 115
        18:   r0 <<= 32                            18:   if r0 > 256 goto +6 <LBB0_4>
        19:   r1 = r0                              19:   r1 = 0 ll
        20:   r1 >>= 32                            21:   *(u64 *)(r1 + 0) = r0
        21:   if r1 > 256 goto +7 <LBB0_4>         22:   r6 = 0 ll
        22:   r0 s>>= 32                           24:   r6 += r0
        23:   r1 = 0 ll                          00000000000000c8 <LBB0_4>:
        25:   *(u64 *)(r1 + 0) = r0                25:   r1 = r6
        26:   r6 = 0 ll                            26:   r2 = 256
        28:   r6 += r0                             27:   r3 = 0 ll
      00000000000000e8 <LBB0_4>:                   29:   call 115
        29:   r1 = r6
        30:   r2 = 256
        31:   r3 = 0 ll
        33:   call 115
      
      32-BIT (13 insns):                         32-BIT (13 insns):
      ------------------------------------       ------------------------------------
        17:   call 115                             17:   call 115
        18:   r1 = r0                              18:   r1 = r0
        19:   r1 <<= 32                            19:   r1 <<= 32
        20:   r1 >>= 32                            20:   r1 >>= 32
        21:   if r1 > 256 goto +6 <LBB1_4>         21:   if r1 > 256 goto +6 <LBB1_4>
        22:   r2 = 0 ll                            22:   r2 = 0 ll
        24:   *(u32 *)(r2 + 0) = r0                24:   *(u32 *)(r2 + 0) = r0
        25:   r6 = 0 ll                            25:   r6 = 0 ll
        27:   r6 += r1                             27:   r6 += r1
      00000000000000e0 <LBB1_4>:                 00000000000000e0 <LBB1_4>:
        28:   r1 = r6                              28:   r1 = r6
        29:   r2 = 256                             29:   r2 = 256
        30:   r3 = 0 ll                            30:   r3 = 0 ll
        32:   call 115                             32:   call 115
      
      In NO-ALU32 mode, for the case of 64-bit len variable, Clang generates much
      superior code, as expected, eliminating unnecessary bit shifts. For 32-bit
      len, code is identical.
      
      So overall, only ALU-32 32-bit len case is more-or-less equivalent and the
      difference stems from internal Clang decision, rather than compiler lacking
      enough information about types.
      
      Case 2. Let's look at the simpler case of checking return result of BPF helper
      for errors. The code is very simple:
      
        long bla;
        if (bpf_probe_read_kenerl(&bla, sizeof(bla), 0))
            return 1;
        else
            return 0;
      
      ALU32 + CHECK (9 insns)                    ALU32 + CHECK (9 insns)
      ====================================       ====================================
        0:    r1 = r10                             0:    r1 = r10
        1:    r1 += -8                             1:    r1 += -8
        2:    w2 = 8                               2:    w2 = 8
        3:    r3 = 0                               3:    r3 = 0
        4:    call 113                             4:    call 113
        5:    w1 = w0                              5:    r1 = r0
        6:    w0 = 1                               6:    w0 = 1
        7:    if w1 != 0 goto +1 <LBB2_2>          7:    if r1 != 0 goto +1 <LBB2_2>
        8:    w0 = 0                               8:    w0 = 0
      0000000000000048 <LBB2_2>:                 0000000000000048 <LBB2_2>:
        9:    exit                                 9:    exit
      
      Almost identical code, the only difference is the use of full register
      assignment (r1 = r0) vs half-registers (w1 = w0) in instruction #5. On 32-bit
      architectures, new BPF assembly might be slightly less optimal, in theory. But
      one can argue that's not a big issue, given that use of full registers is
      still prevalent (e.g., for parameter passing).
      
      NO-ALU32 + CHECK (11 insns)                NO-ALU32 + CHECK (9 insns)
      ====================================       ====================================
        0:    r1 = r10                             0:    r1 = r10
        1:    r1 += -8                             1:    r1 += -8
        2:    r2 = 8                               2:    r2 = 8
        3:    r3 = 0                               3:    r3 = 0
        4:    call 113                             4:    call 113
        5:    r1 = r0                              5:    r1 = r0
        6:    r1 <<= 32                            6:    r0 = 1
        7:    r1 >>= 32                            7:    if r1 != 0 goto +1 <LBB2_2>
        8:    r0 = 1                               8:    r0 = 0
        9:    if r1 != 0 goto +1 <LBB2_2>        0000000000000048 <LBB2_2>:
       10:    r0 = 0                               9:    exit
      0000000000000058 <LBB2_2>:
       11:    exit
      
      NO-ALU32 is a clear improvement, getting rid of unnecessary zero-extension bit
      shifts.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20200623032224.4020118-1-andriin@fb.com
      bdb7b79b
    • Alexei Starovoitov's avatar
      Merge branch 'bpftool-show-pid' · b3eece09
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      This patch set implements libbpf support for a second kind of special externs,
      kernel symbols, in addition to existing Kconfig externs.
      
      Right now, only untyped (const void) externs are supported, which, in
      C language, allow only to take their address. In the future, with kernel BTF
      getting type info about its own global and per-cpu variables, libbpf will
      extend this support with BTF type info, which will allow to also directly
      access variable's contents and follow its internal pointers, similarly to how
      it's possible today in fentry/fexit programs.
      
      As a first practical use of this functionality, bpftool gained ability to show
      PIDs of processes that have open file descriptors for BPF map/program/link/BTF
      object. It relies on iter/task_file BPF iterator program to extract this
      information efficiently.
      
      There was a bunch of bpftool refactoring (especially Makefile) necessary to
      generalize bpftool's internal BPF program use. This includes generalization of
      BPF skeletons support, addition of a vmlinux.h generation, extracting and
      building minimal subset of bpftool for bootstrapping.
      
      v2->v3:
      - fix sec_btf_id check (Hao);
      
      v1->v2:
      - docs fixes (Quentin);
      - dual GPL/BSD license for pid_inter.bpf.c (Quentin);
      - NULL-init kcfg_data (Hao Luo);
      
      rfc->v1:
      - show pids, if supported by kernel, always (Alexei);
      - switched iter output to binary to support showing process names;
      - update man pages;
      - fix few minor bugs in libbpf w.r.t. extern iteration.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b3eece09
    • Andrii Nakryiko's avatar
      tools/bpftool: Add documentation and sample output for process info · 075c7766
      Andrii Nakryiko authored
      Add statements about bpftool being able to discover process info, holding
      reference to BPF map, prog, link, or BTF. Show example output as well.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200619231703.738941-10-andriin@fb.com
      075c7766
    • Andrii Nakryiko's avatar
      tools/bpftool: Show info for processes holding BPF map/prog/link/btf FDs · d53dee3f
      Andrii Nakryiko authored
      Add bpf_iter-based way to find all the processes that hold open FDs against
      BPF object (map, prog, link, btf). bpftool always attempts to discover this,
      but will silently give up if kernel doesn't yet support bpf_iter BPF programs.
      Process name and PID are emitted for each process (task group).
      
      Sample output for each of 4 BPF objects:
      
      $ sudo ./bpftool prog show
      2694: cgroup_device  tag 8c42dee26e8cd4c2  gpl
              loaded_at 2020-06-16T15:34:32-0700  uid 0
              xlated 648B  jited 409B  memlock 4096B
              pids systemd(1)
      2907: cgroup_skb  name egress  tag 9ad187367cf2b9e8  gpl
              loaded_at 2020-06-16T18:06:54-0700  uid 0
              xlated 48B  jited 59B  memlock 4096B  map_ids 2436
              btf_id 1202
              pids test_progs(2238417), test_progs(22384459)
      
      $ sudo ./bpftool map show
      2436: array  name test_cgr.bss  flags 0x400
              key 4B  value 8B  max_entries 1  memlock 8192B
              btf_id 1202
              pids test_progs(2238417), test_progs(22384459)
      2445: array  name pid_iter.rodata  flags 0x480
              key 4B  value 4B  max_entries 1  memlock 8192B
              btf_id 1214  frozen
              pids bpftool(2239612)
      
      $ sudo ./bpftool link show
      61: cgroup  prog 2908
              cgroup_id 375301  attach_type egress
              pids test_progs(2238417), test_progs(22384459)
      62: cgroup  prog 2908
              cgroup_id 375344  attach_type egress
              pids test_progs(2238417), test_progs(22384459)
      
      $ sudo ./bpftool btf show
      1202: size 1527B  prog_ids 2908,2907  map_ids 2436
              pids test_progs(2238417), test_progs(22384459)
      1242: size 34684B
              pids bpftool(2258892)
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200619231703.738941-9-andriin@fb.com
      d53dee3f
    • Andrii Nakryiko's avatar
      libbpf: Wrap source argument of BPF_CORE_READ macro in parentheses · bd9bedf8
      Andrii Nakryiko authored
      Wrap source argument of BPF_CORE_READ family of macros into parentheses to
      allow uses like this:
      
      BPF_CORE_READ((struct cast_struct *)src, a, b, c);
      
      Fixes: 7db3822a ("libbpf: Add BPF_CORE_READ/BPF_CORE_READ_INTO helpers")
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200619231703.738941-8-andriin@fb.com
      bd9bedf8
    • Andrii Nakryiko's avatar
      tools/bpftool: Generalize BPF skeleton support and generate vmlinux.h · 05aca6da
      Andrii Nakryiko authored
      Adapt Makefile to support BPF skeleton generation beyond single profiler.bpf.c
      case. Also add vmlinux.h generation and switch profiler.bpf.c to use it.
      
      clang-bpf-global-var feature is extended and renamed to clang-bpf-co-re to
      check for support of preserve_access_index attribute, which, together with BTF
      for global variables, is the minimum requirement for modern BPF programs.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200619231703.738941-7-andriin@fb.com
      05aca6da
    • Andrii Nakryiko's avatar
      tools/bpftool: Minimize bootstrap bpftool · 16e9b187
      Andrii Nakryiko authored
      Build minimal "bootstrap mode" bpftool to enable skeleton (and, later,
      vmlinux.h generation), instead of building almost complete, but slightly
      different (w/o skeletons, etc) bpftool to bootstrap complete bpftool build.
      
      Current approach doesn't scale well (engineering-wise) when adding more BPF
      programs to bpftool and other complicated functionality, as it requires
      constant adjusting of the code to work in both bootstrapped mode and normal
      mode.
      
      So it's better to build only minimal bpftool version that supports only BPF
      skeleton code generation and BTF-to-C conversion. Thankfully, this is quite
      easy to accomplish due to internal modularity of bpftool commands. This will
      also allow to keep adding new functionality to bpftool in general, without the
      need to care about bootstrap mode for those new parts of bpftool.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200619231703.738941-6-andriin@fb.com
      16e9b187
    • Andrii Nakryiko's avatar
      tools/bpftool: Move map/prog parsing logic into common · a479b8ce
      Andrii Nakryiko authored
      Move functions that parse map and prog by id/tag/name/etc outside of
      map.c/prog.c, respectively. These functions are used outside of those files
      and are generic enough to be in common. This also makes heavy-weight map.c and
      prog.c more decoupled from the rest of bpftool files and facilitates more
      lightweight bootstrap bpftool variant.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarQuentin Monnet <quentin@isovalent.com>
      Link: https://lore.kernel.org/bpf/20200619231703.738941-5-andriin@fb.com
      a479b8ce
    • Andrii Nakryiko's avatar
      selftests/bpf: Add __ksym extern selftest · b7ddfab2
      Andrii Nakryiko authored
      Validate libbpf is able to handle weak and strong kernel symbol externs in BPF
      code correctly.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/bpf/20200619231703.738941-4-andriin@fb.com
      b7ddfab2
    • Andrii Nakryiko's avatar
      libbpf: Add support for extracting kernel symbol addresses · 1c0c7074
      Andrii Nakryiko authored
      Add support for another (in addition to existing Kconfig) special kind of
      externs in BPF code, kernel symbol externs. Such externs allow BPF code to
      "know" kernel symbol address and either use it for comparisons with kernel
      data structures (e.g., struct file's f_op pointer, to distinguish different
      kinds of file), or, with the help of bpf_probe_user_kernel(), to follow
      pointers and read data from global variables. Kernel symbol addresses are
      found through /proc/kallsyms, which should be present in the system.
      
      Currently, such kernel symbol variables are typeless: they have to be defined
      as `extern const void <symbol>` and the only operation you can do (in C code)
      with them is to take its address. Such extern should reside in a special
      section '.ksyms'. bpf_helpers.h header provides __ksym macro for this. Strong
      vs weak semantics stays the same as with Kconfig externs. If symbol is not
      found in /proc/kallsyms, this will be a failure for strong (non-weak) extern,
      but will be defaulted to 0 for weak externs.
      
      If the same symbol is defined multiple times in /proc/kallsyms, then it will
      be error if any of the associated addresses differs. In that case, address is
      ambiguous, so libbpf falls on the side of caution, rather than confusing user
      with randomly chosen address.
      
      In the future, once kernel is extended with variables BTF information, such
      ksym externs will be supported in a typed version, which will allow BPF
      program to read variable's contents directly, similarly to how it's done for
      fentry/fexit input arguments.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/bpf/20200619231703.738941-3-andriin@fb.com
      1c0c7074
    • Andrii Nakryiko's avatar
      libbpf: Generalize libbpf externs support · 2e33efe3
      Andrii Nakryiko authored
      Switch existing Kconfig externs to be just one of few possible kinds of more
      generic externs. This refactoring is in preparation for ksymbol extern
      support, added in the follow up patch. There are no functional changes
      intended.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: default avatarHao Luo <haoluo@google.com>
      Link: https://lore.kernel.org/bpf/20200619231703.738941-2-andriin@fb.com
      2e33efe3
  4. 22 Jun, 2020 6 commits
    • Andrii Nakryiko's avatar
      libbpf: Add a bunch of attribute getters/setters for map definitions · 1bdb6c9a
      Andrii Nakryiko authored
      Add a bunch of getter for various aspects of BPF map. Some of these attribute
      (e.g., key_size, value_size, type, etc) are available right now in struct
      bpf_map_def, but this patch adds getter allowing to fetch them individually.
      bpf_map_def approach isn't very scalable, when ABI stability requirements are
      taken into account. It's much easier to extend libbpf and add support for new
      features, when each aspect of BPF map has separate getter/setter.
      
      Getters follow the common naming convention of not explicitly having "get" in
      its name: bpf_map__type() returns map type, bpf_map__key_size() returns
      key_size. Setters, though, explicitly have set in their name:
      bpf_map__set_type(), bpf_map__set_key_size().
      
      This patch ensures we now have a getter and a setter for the following
      map attributes:
        - type;
        - max_entries;
        - map_flags;
        - numa_node;
        - key_size;
        - value_size;
        - ifindex.
      
      bpf_map__resize() enforces unnecessary restriction of max_entries > 0. It is
      unnecessary, because libbpf actually supports zero max_entries for some cases
      (e.g., for PERF_EVENT_ARRAY map) and treats it specially during map creation
      time. To allow setting max_entries=0, new bpf_map__set_max_entries() setter is
      added. bpf_map__resize()'s behavior is preserved for backwards compatibility
      reasons.
      
      Map ifindex getter is added as well. There is a setter already, but no
      corresponding getter. Fix this assymetry as well. bpf_map__set_ifindex()
      itself is converted from void function into error-returning one, similar to
      other setters. The only error returned right now is -EBUSY, if BPF map is
      already loaded and has corresponding FD.
      
      One lacking attribute with no ability to get/set or even specify it
      declaratively is numa_node. This patch fixes this gap and both adds
      programmatic getter/setter, as well as adds support for numa_node field in
      BTF-defined map.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20200621062112.3006313-1-andriin@fb.com
      1bdb6c9a
    • Andrey Ignatov's avatar
      selftests/bpf: Test access to bpf map pointer · b1b53d41
      Andrey Ignatov authored
      Add selftests to test access to map pointers from bpf program for all
      map types except struct_ops (that one would need additional work).
      
      verifier test focuses mostly on scenarios that must be rejected.
      
      prog_tests test focuses on accessing multiple fields both scalar and a
      nested struct from bpf program and verifies that those fields have
      expected values.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/139a6a17f8016491e39347849b951525335c6eb4.1592600985.git.rdna@fb.com
      b1b53d41
    • Andrey Ignatov's avatar
      bpf: Set map_btf_{name, id} for all map types · 2872e9ac
      Andrey Ignatov authored
      Set map_btf_name and map_btf_id for all map types so that map fields can
      be accessed by bpf programs.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/a825f808f22af52b018dbe82f1c7d29dab5fc978.1592600985.git.rdna@fb.com
      2872e9ac
    • Andrey Ignatov's avatar
      bpf: Support access to bpf map fields · 41c48f3a
      Andrey Ignatov authored
      There are multiple use-cases when it's convenient to have access to bpf
      map fields, both `struct bpf_map` and map type specific struct-s such as
      `struct bpf_array`, `struct bpf_htab`, etc.
      
      For example while working with sock arrays it can be necessary to
      calculate the key based on map->max_entries (some_hash % max_entries).
      Currently this is solved by communicating max_entries via "out-of-band"
      channel, e.g. via additional map with known key to get info about target
      map. That works, but is not very convenient and error-prone while
      working with many maps.
      
      In other cases necessary data is dynamic (i.e. unknown at loading time)
      and it's impossible to get it at all. For example while working with a
      hash table it can be convenient to know how much capacity is already
      used (bpf_htab.count.counter for BPF_F_NO_PREALLOC case).
      
      At the same time kernel knows this info and can provide it to bpf
      program.
      
      Fill this gap by adding support to access bpf map fields from bpf
      program for both `struct bpf_map` and map type specific fields.
      
      Support is implemented via btf_struct_access() so that a user can define
      their own `struct bpf_map` or map type specific struct in their program
      with only necessary fields and preserve_access_index attribute, cast a
      map to this struct and use a field.
      
      For example:
      
      	struct bpf_map {
      		__u32 max_entries;
      	} __attribute__((preserve_access_index));
      
      	struct bpf_array {
      		struct bpf_map map;
      		__u32 elem_size;
      	} __attribute__((preserve_access_index));
      
      	struct {
      		__uint(type, BPF_MAP_TYPE_ARRAY);
      		__uint(max_entries, 4);
      		__type(key, __u32);
      		__type(value, __u32);
      	} m_array SEC(".maps");
      
      	SEC("cgroup_skb/egress")
      	int cg_skb(void *ctx)
      	{
      		struct bpf_array *array = (struct bpf_array *)&m_array;
      		struct bpf_map *map = (struct bpf_map *)&m_array;
      
      		/* .. use map->max_entries or array->map.max_entries .. */
      	}
      
      Similarly to other btf_struct_access() use-cases (e.g. struct tcp_sock
      in net/ipv4/bpf_tcp_ca.c) the patch allows access to any fields of
      corresponding struct. Only reading from map fields is supported.
      
      For btf_struct_access() to work there should be a way to know btf id of
      a struct that corresponds to a map type. To get btf id there should be a
      way to get a stringified name of map-specific struct, such as
      "bpf_array", "bpf_htab", etc for a map type. Two new fields are added to
      `struct bpf_map_ops` to handle it:
      * .map_btf_name keeps a btf name of a struct returned by map_alloc();
      * .map_btf_id is used to cache btf id of that struct.
      
      To make btf ids calculation cheaper they're calculated once while
      preparing btf_vmlinux and cached same way as it's done for btf_id field
      of `struct bpf_func_proto`
      
      While calculating btf ids, struct names are NOT checked for collision.
      Collisions will be checked as a part of the work to prepare btf ids used
      in verifier in compile time that should land soon. The only known
      collision for `struct bpf_htab` (kernel/bpf/hashtab.c vs
      net/core/sock_map.c) was fixed earlier.
      
      Both new fields .map_btf_name and .map_btf_id must be set for a map type
      for the feature to work. If neither is set for a map type, verifier will
      return ENOTSUPP on a try to access map_ptr of corresponding type. If
      just one of them set, it's verifier misconfiguration.
      
      Only `struct bpf_array` for BPF_MAP_TYPE_ARRAY and `struct bpf_htab` for
      BPF_MAP_TYPE_HASH are supported by this patch. Other map types will be
      supported separately.
      
      The feature is available only for CONFIG_DEBUG_INFO_BTF=y and gated by
      perfmon_capable() so that unpriv programs won't have access to bpf map
      fields.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/6479686a0cd1e9067993df57b4c3eef0e276fec9.1592600985.git.rdna@fb.com
      41c48f3a
    • Andrey Ignatov's avatar
      bpf: Rename bpf_htab to bpf_shtab in sock_map · 032a6b35
      Andrey Ignatov authored
      There are two different `struct bpf_htab` in bpf code in the following
      files:
      - kernel/bpf/hashtab.c
      - net/core/sock_map.c
      
      It makes it impossible to find proper btf_id by name = "bpf_htab" and
      kind = BTF_KIND_STRUCT what is needed to support access to map ptr so
      that bpf program can access `struct bpf_htab` fields.
      
      To make it possible one of the struct-s should be renamed, sock_map.c
      looks like a better candidate for rename since it's specialized version
      of hashtab.
      
      Rename it to bpf_shtab ("sh" stands for Sock Hash).
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/c006a639e03c64ca50fc87c4bb627e0bfba90f4e.1592600985.git.rdna@fb.com
      032a6b35
    • Andrey Ignatov's avatar
      bpf: Switch btf_parse_vmlinux to btf_find_by_name_kind · a2d0d62f
      Andrey Ignatov authored
      btf_parse_vmlinux() implements manual search for struct bpf_ctx_convert
      since at the time of implementing btf_find_by_name_kind() was not
      available.
      
      Later btf_find_by_name_kind() was introduced in 27ae7997 ("bpf:
      Introduce BPF_PROG_TYPE_STRUCT_OPS"). It provides similar search
      functionality and can be leveraged in btf_parse_vmlinux(). Do it.
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/6e12d5c3e8a3d552925913ef73a695dd1bb27800.1592600985.git.rdna@fb.com
      a2d0d62f
  5. 19 Jun, 2020 3 commits
  6. 18 Jun, 2020 1 commit
  7. 17 Jun, 2020 4 commits
    • Andrii Nakryiko's avatar
      libbpf: Bump version to 0.1.0 · 7bd3a33a
      Andrii Nakryiko authored
      Bump libbpf version to 0.1.0, as new development cycle starts.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200617183132.1970836-1-andriin@fb.com
      7bd3a33a
    • Andrii Nakryiko's avatar
      bpf: bpf_probe_read_kernel_str() has to return amount of data read on success · 02553b91
      Andrii Nakryiko authored
      During recent refactorings, bpf_probe_read_kernel_str() started returning 0 on
      success, instead of amount of data successfully read. This majorly breaks
      applications relying on bpf_probe_read_kernel_str() and bpf_probe_read_str()
      and their results. Fix this by returning actual number of bytes read.
      
      Fixes: 8d92db5c ("bpf: rework the compat kernel probe handling")
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200616050432.1902042-1-andriin@fb.com
      02553b91
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 69119673
      Linus Torvalds authored
      Pull networking fixes from David Miller:
      
       1) Don't get per-cpu pointer with preemption enabled in nft_set_pipapo,
          fix from Stefano Brivio.
      
       2) Fix memory leak in ctnetlink, from Pablo Neira Ayuso.
      
       3) Multiple definitions of MPTCP_PM_MAX_ADDR, from Geliang Tang.
      
       4) Accidently disabling NAPI in non-error paths of macb_open(), from
          Charles Keepax.
      
       5) Fix races between alx_stop and alx_remove, from Zekun Shen.
      
       6) We forget to re-enable SRIOV during resume in bnxt_en driver, from
          Michael Chan.
      
       7) Fix memory leak in ipv6_mc_destroy_dev(), from Wang Hai.
      
       8) rxtx stats use wrong index in mvpp2 driver, from Sven Auhagen.
      
       9) Fix memory leak in mptcp_subflow_create_socket error path, from Wei
          Yongjun.
      
      10) We should not adjust the TCP window advertised when sending dup acks
          in non-SACK mode, because it won't be counted as a dup by the sender
          if the window size changes. From Eric Dumazet.
      
      11) Destroy the right number of queues during remove in mvpp2 driver,
          from Sven Auhagen.
      
      12) Various WOL and PM fixes to e1000 driver, from Chen Yu, Vaibhav
          Gupta, and Arnd Bergmann.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (35 commits)
        e1000e: fix unused-function warning
        e1000: use generic power management
        e1000e: Do not wake up the system via WOL if device wakeup is disabled
        lan743x: add MODULE_DEVICE_TABLE for module loading alias
        mlxsw: spectrum: Adjust headroom buffers for 8x ports
        bareudp: Fixed configuration to avoid having garbage values
        mvpp2: remove module bugfix
        tcp: grow window for OOO packets only for SACK flows
        mptcp: fix memory leak in mptcp_subflow_create_socket()
        netfilter: flowtable: Make nf_flow_table_offload_add/del_cb inline
        net/sched: act_ct: Make tcf_ct_flow_table_restore_skb inline
        net: dsa: sja1105: fix PTP timestamping with large tc-taprio cycles
        mvpp2: ethtool rxtx stats fix
        MAINTAINERS: switch to my private email for Renesas Ethernet drivers
        rocker: fix incorrect error handling in dma_rings_init
        test_objagg: Fix potential memory leak in error handling
        net: ethernet: mtk-star-emac: simplify interrupt handling
        mld: fix memory leak in ipv6_mc_destroy_dev()
        bnxt_en: Return from timer if interface is not in open state.
        bnxt_en: Fix AER reset logic on 57500 chips.
        ...
      69119673
    • Linus Torvalds's avatar
      Merge tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs · 26c20ffc
      Linus Torvalds authored
      Pull AFS fixes from David Howells:
       "I've managed to get xfstests kind of working with afs. Here are a set
        of patches that fix most of the bugs found.
      
        There are a number of primary issues:
      
         - Incorrect handling of mtime and non-handling of ctime. It might be
           argued, that the latter isn't a bug since the AFS protocol doesn't
           support ctime, but I should probably still update it locally.
      
         - Shared-write mmap, truncate and writeback bugs. This includes not
           changing i_size under the callback lock, overwriting local i_size
           with the reply from the server after a partial writeback, not
           limiting the writeback from an mmapped page to EOF.
      
         - Checks for an abort code indicating that the primary vnode in an
           operation was deleted by a third-party are done in the wrong place.
      
         - Silly rename bugs. This includes an incomplete conversion to the
           new operation handling, duplicate nlink handling, nlink changing
           not being done inside the callback lock and insufficient handling
           of third-party conflicting directory changes.
      
        And some secondary ones:
      
         - The UAEOVERFLOW abort code should map to EOVERFLOW not EREMOTEIO.
      
         - Remove a couple of unused or incompletely used bits.
      
         - Remove a couple of redundant success checks.
      
        These seem to fix all the data-corruption bugs found by
      
      	./check -afs -g quick
      
        along with the obvious silly rename bugs and time bugs.
      
        There are still some test failures, but they seem to fall into two
        classes: firstly, the authentication/security model is different to
        the standard UNIX model and permission is arbitrated by the server and
        cached locally; and secondly, there are a number of features that AFS
        does not support (such as mknod). But in these cases, the tests
        themselves need to be adapted or skipped.
      
        Using the in-kernel afs client with xfstests also found a bug in the
        AuriStor AFS server that has been fixed for a future release"
      
      * tag 'afs-fixes-20200616' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
        afs: Fix silly rename
        afs: afs_vnode_commit_status() doesn't need to check the RPC error
        afs: Fix use of afs_check_for_remote_deletion()
        afs: Remove afs_operation::abort_code
        afs: Fix yfs_fs_fetch_status() to honour vnode selector
        afs: Remove yfs_fs_fetch_file_status() as it's not used
        afs: Fix the mapping of the UAEOVERFLOW abort code
        afs: Fix truncation issues and mmap writeback size
        afs: Concoct ctimes
        afs: Fix EOF corruption
        afs: afs_write_end() should change i_size under the right lock
        afs: Fix non-setting of mtime when writing into mmap
      26c20ffc