1. 19 Jul, 2022 7 commits
    • Stanislav Fomichev's avatar
      bpf: fix lsm_cgroup build errors on esoteric configs · 3908fcdd
      Stanislav Fomichev authored
      This particular ones is about having the following:
       CONFIG_BPF_LSM=y
       # CONFIG_CGROUP_BPF is not set
      
      Also, add __maybe_unused to the args for the !CONFIG_NET cases.
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Acked-by: default avatarYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/r/20220714185404.3647772-1-sdf@google.comSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      3908fcdd
    • Alexei Starovoitov's avatar
      Merge branch 'Add SEC("ksyscall") support' · ab850abb
      Alexei Starovoitov authored
      Andrii Nakryiko says:
      
      ====================
      
      Add SEC("ksyscall")/SEC("kretsyscall") sections and corresponding
      bpf_program__attach_ksyscall() API that simplifies tracing kernel syscalls
      through kprobe mechanism. Kprobing syscalls isn't trivial due to varying
      syscall handler names in the kernel and various ways syscall argument are
      passed, depending on kernel architecture and configuration. SEC("ksyscall")
      allows user to not care about such details and just get access to syscall
      input arguments, while libbpf takes care of necessary feature detection logic.
      
      There are still more quirks that are not straightforward to hide completely
      (see comments about mmap(), clone() and compat syscalls), so in such more
      advanced scenarios user might need to fall back to plain SEC("kprobe")
      approach, but for absolute majority of users SEC("ksyscall") is a big
      improvement.
      
      As part of this patch set libbpf adds two more virtual __kconfig externs, in
      addition to existing LINUX_KERNEL_VERSION: LINUX_HAS_BPF_COOKIE and
      LINUX_HAS_SYSCALL_WRAPPER, which let's libbpf-provided BPF-side code minimize
      external dependencies and assumptions and let's user-space part of libbpf to
      perform all the feature detection logic. This benefits USDT support code,
      which now doesn't depend on BPF CO-RE for its functionality.
      
      v1->v2:
        - normalize extern variable-related warn and debug message formats (Alan);
      rfc->v1:
        - drop dependency on kallsyms and speed up SYSCALL_WRAPPER detection (Alexei);
        - drop dependency on /proc/config.gz in bpf_tracing.h (Yaniv);
        - add doc comment and ephasize mmap(), clone() and compat quirks that are
          not supported (Ilya);
        - use mechanism similar to LINUX_KERNEL_VERSION to also improve USDT code.
      ====================
      Reviewed-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ab850abb
    • Andrii Nakryiko's avatar
      selftests/bpf: use BPF_KSYSCALL and SEC("ksyscall") in selftests · d814ed62
      Andrii Nakryiko authored
      Convert few selftest that used plain SEC("kprobe") with arch-specific
      syscall wrapper prefix to ksyscall/kretsyscall and corresponding
      BPF_KSYSCALL macro. test_probe_user.c is especially benefiting from this
      simplification.
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220714070755.3235561-6-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      d814ed62
    • Andrii Nakryiko's avatar
      libbpf: add ksyscall/kretsyscall sections support for syscall kprobes · 708ac5be
      Andrii Nakryiko authored
      Add SEC("ksyscall")/SEC("ksyscall/<syscall_name>") and corresponding
      kretsyscall variants (for return kprobes) to allow users to kprobe
      syscall functions in kernel. These special sections allow to ignore
      complexities and differences between kernel versions and host
      architectures when it comes to syscall wrapper and corresponding
      __<arch>_sys_<syscall> vs __se_sys_<syscall> differences, depending on
      whether host kernel has CONFIG_ARCH_HAS_SYSCALL_WRAPPER (though libbpf
      itself doesn't rely on /proc/config.gz for detecting this, see
      BPF_KSYSCALL patch for how it's done internally).
      
      Combined with the use of BPF_KSYSCALL() macro, this allows to just
      specify intended syscall name and expected input arguments and leave
      dealing with all the variations to libbpf.
      
      In addition to SEC("ksyscall+") and SEC("kretsyscall+") add
      bpf_program__attach_ksyscall() API which allows to specify syscall name
      at runtime and provide associated BPF cookie value.
      
      At the moment SEC("ksyscall") and bpf_program__attach_ksyscall() do not
      handle all the calling convention quirks for mmap(), clone() and compat
      syscalls. It also only attaches to "native" syscall interfaces. If host
      system supports compat syscalls or defines 32-bit syscalls in 64-bit
      kernel, such syscall interfaces won't be attached to by libbpf.
      
      These limitations may or may not change in the future. Therefore it is
      recommended to use SEC("kprobe") for these syscalls or if working with
      compat and 32-bit interfaces is required.
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220714070755.3235561-5-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      708ac5be
    • Andrii Nakryiko's avatar
      libbpf: improve BPF_KPROBE_SYSCALL macro and rename it to BPF_KSYSCALL · 6f5d467d
      Andrii Nakryiko authored
      Improve BPF_KPROBE_SYSCALL (and rename it to shorter BPF_KSYSCALL to
      match libbpf's SEC("ksyscall") section name, added in next patch) to use
      __kconfig variable to determine how to properly fetch syscall arguments.
      
      Instead of relying on hard-coded knowledge of whether kernel's
      architecture uses syscall wrapper or not (which only reflects the latest
      kernel versions, but is not necessarily true for older kernels and won't
      necessarily hold for later kernel versions on some particular host
      architecture), determine this at runtime by attempting to create
      perf_event (with fallback to kprobe event creation through tracefs on
      legacy kernels, just like kprobe attachment code is doing) for kernel
      function that would correspond to bpf() syscall on a system that has
      CONFIG_ARCH_HAS_SYSCALL_WRAPPER set (e.g., for x86-64 it would try
      '__x64_sys_bpf').
      
      If host kernel uses syscall wrapper, syscall kernel function's first
      argument is a pointer to struct pt_regs that then contains syscall
      arguments. In such case we need to use bpf_probe_read_kernel() to fetch
      actual arguments (which we do through BPF_CORE_READ() macro) from inner
      pt_regs.
      
      But if the kernel doesn't use syscall wrapper approach, input
      arguments can be read from struct pt_regs directly with no probe reading.
      
      All this feature detection is done without requiring /proc/config.gz
      existence and parsing, and BPF-side helper code uses newly added
      LINUX_HAS_SYSCALL_WRAPPER virtual __kconfig extern to keep in sync with
      user-side feature detection of libbpf.
      
      BPF_KSYSCALL() macro can be used both with SEC("kprobe") programs that
      define syscall function explicitly (e.g., SEC("kprobe/__x64_sys_bpf"))
      and SEC("ksyscall") program added in the next patch (which are the same
      kprobe program with added benefit of libbpf determining correct kernel
      function name automatically).
      
      Kretprobe and kretsyscall (added in next patch) programs don't need
      BPF_KSYSCALL as they don't provide access to input arguments. Normal
      BPF_KRETPROBE is completely sufficient and is recommended.
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220714070755.3235561-4-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      6f5d467d
    • Andrii Nakryiko's avatar
      selftests/bpf: add test of __weak unknown virtual __kconfig extern · ce6dc74a
      Andrii Nakryiko authored
      Exercise libbpf's logic for unknown __weak virtual __kconfig externs.
      USDT selftests are already excercising non-weak known virtual extern
      already (LINUX_HAS_BPF_COOKIE), so no need to add explicit tests for it.
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220714070755.3235561-3-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ce6dc74a
    • Andrii Nakryiko's avatar
      libbpf: generalize virtual __kconfig externs and use it for USDT · 55d00c37
      Andrii Nakryiko authored
      Libbpf supports single virtual __kconfig extern currently: LINUX_KERNEL_VERSION.
      LINUX_KERNEL_VERSION isn't coming from /proc/kconfig.gz and is intead
      customly filled out by libbpf.
      
      This patch generalizes this approach to support more such virtual
      __kconfig externs. One such extern added in this patch is
      LINUX_HAS_BPF_COOKIE which is used for BPF-side USDT supporting code in
      usdt.bpf.h instead of using CO-RE-based enum detection approach for
      detecting bpf_get_attach_cookie() BPF helper. This allows to remove
      otherwise not needed CO-RE dependency and keeps user-space and BPF-side
      parts of libbpf's USDT support strictly in sync in terms of their
      feature detection.
      
      We'll use similar approach for syscall wrapper detection for
      BPF_KSYSCALL() BPF-side macro in follow up patch.
      
      Generally, currently libbpf reserves CONFIG_ prefix for Kconfig values
      and LINUX_ for virtual libbpf-backed externs. In the future we might
      extend the set of prefixes that are supported. This can be done without
      any breaking changes, as currently any __kconfig extern with
      unrecognized name is rejected.
      
      For LINUX_xxx externs we support the normal "weak rule": if libbpf
      doesn't recognize given LINUX_xxx extern but such extern is marked as
      __weak, it is not rejected and defaults to zero.  This follows
      CONFIG_xxx handling logic and will allow BPF applications to
      opportunistically use newer libbpf virtual externs without breaking on
      older libbpf versions unnecessarily.
      Tested-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Reviewed-by: default avatarAlan Maguire <alan.maguire@oracle.com>
      Signed-off-by: default avatarAndrii Nakryiko <andrii@kernel.org>
      Link: https://lore.kernel.org/r/20220714070755.3235561-2-andrii@kernel.orgSigned-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      55d00c37
  2. 15 Jul, 2022 8 commits
  3. 14 Jul, 2022 7 commits
  4. 13 Jul, 2022 4 commits
  5. 12 Jul, 2022 6 commits
  6. 11 Jul, 2022 8 commits
    • Xu Kuohai's avatar
      bpf, arm64: Add bpf trampoline for arm64 · efc9909f
      Xu Kuohai authored
      This is arm64 version of commit fec56f58 ("bpf: Introduce BPF
      trampoline"). A bpf trampoline converts native calling convention to bpf
      calling convention and is used to implement various bpf features, such
      as fentry, fexit, fmod_ret and struct_ops.
      
      This patch does essentially the same thing that bpf trampoline does on x86.
      
      Tested on Raspberry Pi 4B and qemu:
      
       #18 /1     bpf_tcp_ca/dctcp:OK
       #18 /2     bpf_tcp_ca/cubic:OK
       #18 /3     bpf_tcp_ca/invalid_license:OK
       #18 /4     bpf_tcp_ca/dctcp_fallback:OK
       #18 /5     bpf_tcp_ca/rel_setsockopt:OK
       #18        bpf_tcp_ca:OK
       #51 /1     dummy_st_ops/dummy_st_ops_attach:OK
       #51 /2     dummy_st_ops/dummy_init_ret_value:OK
       #51 /3     dummy_st_ops/dummy_init_ptr_arg:OK
       #51 /4     dummy_st_ops/dummy_multiple_args:OK
       #51        dummy_st_ops:OK
       #57 /1     fexit_bpf2bpf/target_no_callees:OK
       #57 /2     fexit_bpf2bpf/target_yes_callees:OK
       #57 /3     fexit_bpf2bpf/func_replace:OK
       #57 /4     fexit_bpf2bpf/func_replace_verify:OK
       #57 /5     fexit_bpf2bpf/func_sockmap_update:OK
       #57 /6     fexit_bpf2bpf/func_replace_return_code:OK
       #57 /7     fexit_bpf2bpf/func_map_prog_compatibility:OK
       #57 /8     fexit_bpf2bpf/func_replace_multi:OK
       #57 /9     fexit_bpf2bpf/fmod_ret_freplace:OK
       #57        fexit_bpf2bpf:OK
       #237       xdp_bpf2bpf:OK
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJean-Philippe Brucker <jean-philippe@linaro.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Acked-by: default avatarKP Singh <kpsingh@kernel.org>
      Link: https://lore.kernel.org/bpf/20220711150823.2128542-5-xukuohai@huawei.com
      efc9909f
    • Xu Kuohai's avatar
      bpf, arm64: Implement bpf_arch_text_poke() for arm64 · b2ad54e1
      Xu Kuohai authored
      Implement bpf_arch_text_poke() for arm64, so bpf prog or bpf trampoline
      can be patched with it.
      
      When the target address is NULL, the original instruction is patched to
      a NOP.
      
      When the target address and the source address are within the branch
      range, the original instruction is patched to a bl instruction to the
      target address directly.
      
      To support attaching bpf trampoline to both regular kernel function and
      bpf prog, we follow the ftrace patchsite way for bpf prog. That is, two
      instructions are inserted at the beginning of bpf prog, the first one
      saves the return address to x9, and the second is a nop which will be
      patched to a bl instruction when a bpf trampoline is attached.
      
      However, when a bpf trampoline is attached to bpf prog, the distance
      between target address and source address may exceed 128MB, the maximum
      branch range, because bpf trampoline and bpf prog are allocated
      separately with vmalloc. So long jump should be handled.
      
      When a bpf prog is constructed, a plt pointing to empty trampoline
      dummy_tramp is placed at the end:
      
              bpf_prog:
                      mov x9, lr
                      nop // patchsite
                      ...
                      ret
      
              plt:
                      ldr x10, target
                      br x10
              target:
                      .quad dummy_tramp // plt target
      
      This is also the state when no trampoline is attached.
      
      When a short-jump bpf trampoline is attached, the patchsite is patched to
      a bl instruction to the trampoline directly:
      
              bpf_prog:
                      mov x9, lr
                      bl <short-jump bpf trampoline address> // patchsite
                      ...
                      ret
      
              plt:
                      ldr x10, target
                      br x10
              target:
                      .quad dummy_tramp // plt target
      
      When a long-jump bpf trampoline is attached, the plt target is filled with
      the trampoline address and the patchsite is patched to a bl instruction to
      the plt:
      
              bpf_prog:
                      mov x9, lr
                      bl plt // patchsite
                      ...
                      ret
      
              plt:
                      ldr x10, target
                      br x10
              target:
                      .quad <long-jump bpf trampoline address>
      
      dummy_tramp is used to prevent another CPU from jumping to an unknown
      location during the patching process, making the patching process easier.
      
      The patching process is as follows:
      
      1. when neither the old address or the new address is a long jump, the
         patchsite is replaced with a bl to the new address, or nop if the new
         address is NULL;
      
      2. when the old address is not long jump but the new one is, the
         branch target address is written to plt first, then the patchsite
         is replaced with a bl instruction to the plt;
      
      3. when the old address is long jump but the new one is not, the address
         of dummy_tramp is written to plt first, then the patchsite is replaced
         with a bl to the new address, or a nop if the new address is NULL;
      
      4. when both the old address and the new address are long jump, the
         new address is written to plt and the patchsite is not changed.
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJakub Sitnicki <jakub@cloudflare.com>
      Reviewed-by: default avatarKP Singh <kpsingh@kernel.org>
      Reviewed-by: default avatarJean-Philippe Brucker <jean-philippe@linaro.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20220711150823.2128542-4-xukuohai@huawei.com
      b2ad54e1
    • Xu Kuohai's avatar
      arm64: Add LDR (literal) instruction · f1e8a24e
      Xu Kuohai authored
      Add LDR (literal) instruction to load data from address relative to PC.
      This instruction will be used to implement long jump from bpf prog to
      bpf trampoline in the follow-up patch.
      
      The instruction encoding:
      
          3       2   2     2                                     0        0
          0       7   6     4                                     5        0
      +-----+-------+---+-----+-------------------------------------+--------+
      | 0 x | 0 1 1 | 0 | 0 0 |                imm19                |   Rt   |
      +-----+-------+---+-----+-------------------------------------+--------+
      
      for 32-bit, variant x == 0; for 64-bit, x == 1.
      
      branch_imm_common() is used to check the distance between pc and target
      address, since it's reused by this patch and LDR (literal) is not a branch
      instruction, rename it to label_imm_common().
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJean-Philippe Brucker <jean-philippe@linaro.org>
      Acked-by: default avatarWill Deacon <will@kernel.org>
      Link: https://lore.kernel.org/bpf/20220711150823.2128542-3-xukuohai@huawei.com
      f1e8a24e
    • Xu Kuohai's avatar
      bpf: Remove is_valid_bpf_tramp_flags() · 535a57a7
      Xu Kuohai authored
      Before generating bpf trampoline, x86 calls is_valid_bpf_tramp_flags()
      to check the input flags. This check is architecture independent.
      So, to be consistent with x86, arm64 should also do this check
      before generating bpf trampoline.
      
      However, the BPF_TRAMP_F_XXX flags are not used by user code and the
      flags argument is almost constant at compile time, so this run time
      check is a bit redundant.
      
      Remove is_valid_bpf_tramp_flags() and add some comments to the usage of
      BPF_TRAMP_F_XXX flags, as suggested by Alexei.
      Signed-off-by: default avatarXu Kuohai <xukuohai@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: default avatarJean-Philippe Brucker <jean-philippe@linaro.org>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20220711150823.2128542-2-xukuohai@huawei.com
      535a57a7
    • Liu Jian's avatar
      skmsg: Fix invalid last sg check in sk_msg_recvmsg() · 9974d37e
      Liu Jian authored
      In sk_psock_skb_ingress_enqueue function, if the linear area + nr_frags +
      frag_list of the SKB has NR_MSG_FRAG_IDS blocks in total, skb_to_sgvec
      will return NR_MSG_FRAG_IDS, then msg->sg.end will be set to
      NR_MSG_FRAG_IDS, and in addition, (NR_MSG_FRAG_IDS - 1) is set to the last
      SG of msg. Recv the msg in sk_msg_recvmsg, when i is (NR_MSG_FRAG_IDS - 1),
      the sk_msg_iter_var_next(i) will change i to 0 (not NR_MSG_FRAG_IDS), the
      judgment condition "msg_rx->sg.start==msg_rx->sg.end" and
      "i != msg_rx->sg.end" can not work.
      
      As a result, the processed msg cannot be deleted from ingress_msg list.
      But the length of all the sge of the msg has changed to 0. Then the next
      recvmsg syscall will process the msg repeatedly, because the length of sge
      is 0, the -EFAULT error is always returned.
      
      Fixes: 604326b4 ("bpf, sockmap: convert to generic sk_msg interface")
      Signed-off-by: default avatarLiu Jian <liujian56@huawei.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20220628123616.186950-1-liujian56@huawei.com
      9974d37e
    • Jilin Yuan's avatar
      fddi/skfp: fix repeated words in comments · edb2c347
      Jilin Yuan authored
      Delete the redundant word 'test'.
      Signed-off-by: default avatarJilin Yuan <yuanjilin@cdjrlc.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      edb2c347
    • Jilin Yuan's avatar
      ethernet/via: fix repeated words in comments · 1377a5b2
      Jilin Yuan authored
      Delete the redundant word 'driver'.
      Signed-off-by: default avatarJilin Yuan <yuanjilin@cdjrlc.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1377a5b2
    • sewookseo's avatar
      net: Find dst with sk's xfrm policy not ctl_sk · e22aa148
      sewookseo authored
      If we set XFRM security policy by calling setsockopt with option
      IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock'
      struct. However tcp_v6_send_response doesn't look up dst_entry with the
      actual socket but looks up with tcp control socket. This may cause a
      problem that a RST packet is sent without ESP encryption & peer's TCP
      socket can't receive it.
      This patch will make the function look up dest_entry with actual socket,
      if the socket has XFRM policy(sock_policy), so that the TCP response
      packet via this function can be encrypted, & aligned on the encrypted
      TCP socket.
      
      Tested: We encountered this problem when a TCP socket which is encrypted
      in ESP transport mode encryption, receives challenge ACK at SYN_SENT
      state. After receiving challenge ACK, TCP needs to send RST to
      establish the socket at next SYN try. But the RST was not encrypted &
      peer TCP socket still remains on ESTABLISHED state.
      So we verified this with test step as below.
      [Test step]
      1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED).
      2. Client tries a new connection on the same TCP ports(src & dst).
      3. Server will return challenge ACK instead of SYN,ACK.
      4. Client will send RST to server to clear the SOCKET.
      5. Client will retransmit SYN to server on the same TCP ports.
      [Expected result]
      The TCP connection should be established.
      
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Sehee Lee <seheele@google.com>
      Signed-off-by: default avatarSewook Seo <sewookseo@google.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e22aa148