1. 14 Aug, 2019 6 commits
  2. 13 Aug, 2019 10 commits
    • Vlad Buslov's avatar
      net: devlink: remove redundant rtnl lock assert · 043b8413
      Vlad Buslov authored
      It is enough for caller of devlink_compat_switch_id_get() to hold the net
      device to guarantee that devlink port is not destroyed concurrently. Remove
      rtnl lock assertion and modify comment to warn user that they must hold
      either rtnl lock or reference to net device. This is necessary to
      accommodate future implementation of rtnl-unlocked TC offloads driver
      callbacks.
      Signed-off-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Acked-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      043b8413
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 708852dc
      Jakub Kicinski authored
      Daniel Borkmann says:
      
      ====================
      The following pull-request contains BPF updates for your *net-next* tree.
      
      There is a small merge conflict in libbpf (Cc Andrii so he's in the loop
      as well):
      
              for (i = 1; i <= btf__get_nr_types(btf); i++) {
                      t = (struct btf_type *)btf__type_by_id(btf, i);
      
                      if (!has_datasec && btf_is_var(t)) {
                              /* replace VAR with INT */
                              t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
        <<<<<<< HEAD
                              /*
                               * using size = 1 is the safest choice, 4 will be too
                               * big and cause kernel BTF validation failure if
                               * original variable took less than 4 bytes
                               */
                              t->size = 1;
                              *(int *)(t+1) = BTF_INT_ENC(0, 0, 8);
                      } else if (!has_datasec && kind == BTF_KIND_DATASEC) {
        =======
                              t->size = sizeof(int);
                              *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32);
                      } else if (!has_datasec && btf_is_datasec(t)) {
        >>>>>>> 72ef80b5
                              /* replace DATASEC with STRUCT */
      
      Conflict is between the two commits 1d4126c4 ("libbpf: sanitize VAR to
      conservative 1-byte INT") and b03bc685 ("libbpf: convert libbpf code to
      use new btf helpers"), so we need to pick the sanitation fixup as well as
      use the new btf_is_datasec() helper and the whitespace cleanup. Looks like
      the following:
      
        [...]
                      if (!has_datasec && btf_is_var(t)) {
                              /* replace VAR with INT */
                              t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
                              /*
                               * using size = 1 is the safest choice, 4 will be too
                               * big and cause kernel BTF validation failure if
                               * original variable took less than 4 bytes
                               */
                              t->size = 1;
                              *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8);
                      } else if (!has_datasec && btf_is_datasec(t)) {
                              /* replace DATASEC with STRUCT */
        [...]
      
      The main changes are:
      
      1) Addition of core parts of compile once - run everywhere (co-re) effort,
         that is, relocation of fields offsets in libbpf as well as exposure of
         kernel's own BTF via sysfs and loading through libbpf, from Andrii.
      
         More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2
         and http://vger.kernel.org/lpc-bpf2018.html#session-2
      
      2) Enable passing input flags to the BPF flow dissector to customize parsing
         and allowing it to stop early similar to the C based one, from Stanislav.
      
      3) Add a BPF helper function that allows generating SYN cookies from XDP and
         tc BPF, from Petar.
      
      4) Add devmap hash-based map type for more flexibility in device lookup for
         redirects, from Toke.
      
      5) Improvements to XDP forwarding sample code now utilizing recently enabled
         devmap lookups, from Jesper.
      
      6) Add support for reporting the effective cgroup progs in bpftool, from Jakub
         and Takshak.
      
      7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter.
      
      8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan.
      
      9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei.
      
      10) Add perf event output helper also for other skb-based program types, from Allan.
      
      11) Fix a co-re related compilation error in selftests, from Yonghong.
      ====================
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      708852dc
    • YueHaibing's avatar
      net: hns3: Make hclge_func_reset_sync_vf static · a9a96760
      YueHaibing authored
      Fix sparse warning:
      
      drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_main.c:3190:5:
       warning: symbol 'hclge_func_reset_sync_vf' was not declared. Should it be static?
      Reported-by: default avatarHulk Robot <hulkci@huawei.com>
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      a9a96760
    • Jiri Pirko's avatar
      devlink: send notifications for deleted snapshots on region destroy · 92b49822
      Jiri Pirko authored
      Currently the notifications for deleted snapshots are sent only in case
      user deletes a snapshot manually. Send the notifications in case region
      is destroyed too.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      92b49822
    • Daniel Borkmann's avatar
      Merge branch 'bpf-libbpf-read-sysfs-btf' · 72ef80b5
      Daniel Borkmann authored
      Andrii Nakryiko says:
      
      ====================
      Now that kernel's BTF is exposed through sysfs at well-known location, attempt
      to load it first as a target BTF for the purpose of BPF CO-RE relocations.
      
      Patch #1 is a follow-up patch to rename /sys/kernel/btf/kernel into
      /sys/kernel/btf/vmlinux.
      
      Patch #2 adds ability to load raw BTF contents from sysfs and expands the list
      of locations libbpf attempts to load vmlinux BTF from.
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      72ef80b5
    • Andrii Nakryiko's avatar
      libbpf: attempt to load kernel BTF from sysfs first · a1916a15
      Andrii Nakryiko authored
      Add support for loading kernel BTF from sysfs (/sys/kernel/btf/vmlinux)
      as a target BTF. Also extend the list of on disk search paths for
      vmlinux ELF image with entries that perf is searching for.
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a1916a15
    • Andrii Nakryiko's avatar
      btf: rename /sys/kernel/btf/kernel into /sys/kernel/btf/vmlinux · 7fd78568
      Andrii Nakryiko authored
      Expose kernel's BTF under the name vmlinux to be more uniform with using
      kernel module names as file names in the future.
      
      Fixes: 341dfcf8 ("btf: expose BTF info through sysfs")
      Suggested-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      7fd78568
    • Petar Penkov's avatar
      selftests/bpf: fix race in flow dissector tests · 9840a4ff
      Petar Penkov authored
      Since the "last_dissection" map holds only the flow keys for the most
      recent packet, there is a small race in the skb-less flow dissector
      tests if a new packet comes between transmitting the test packet, and
      reading its keys from the map. If this happens, the test packet keys
      will be overwritten and the test will fail.
      
      Changing the "last_dissection" map to a hash map, keyed on the
      source/dest port pair resolves this issue. Additionally, let's clear the
      last test results from the map between tests to prevent previous test
      cases from interfering with the following test cases.
      
      Fixes: 0905beec ("selftests/bpf: run flow dissector tests in skb-less mode")
      Signed-off-by: default avatarPetar Penkov <ppenkov@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      9840a4ff
    • Peter Wu's avatar
      tools: bpftool: add feature check for zlib · d66fa3c7
      Peter Wu authored
      bpftool requires libelf, and zlib for decompressing /proc/config.gz.
      zlib is a transitive dependency via libelf, and became mandatory since
      elfutils 0.165 (Jan 2016). The feature check of libelf is already done
      in the elfdep target of tools/lib/bpf/Makefile, pulled in by bpftool via
      a dependency on libbpf.a. Add a similar feature check for zlib.
      Suggested-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarPeter Wu <peter@lekensteyn.nl>
      Acked-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d66fa3c7
    • Andrii Nakryiko's avatar
      btf: expose BTF info through sysfs · 341dfcf8
      Andrii Nakryiko authored
      Make .BTF section allocated and expose its contents through sysfs.
      
      /sys/kernel/btf directory is created to contain all the BTFs present
      inside kernel. Currently there is only kernel's main BTF, represented as
      /sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
      each module will expose its BTF as /sys/kernel/btf/<module-name> file.
      
      Current approach relies on a few pieces coming together:
      1. pahole is used to take almost final vmlinux image (modulo .BTF and
         kallsyms) and generate .BTF section by converting DWARF info into
         BTF. This section is not allocated and not mapped to any segment,
         though, so is not yet accessible from inside kernel at runtime.
      2. objcopy dumps .BTF contents into binary file and subsequently
         convert binary file into linkable object file with automatically
         generated symbols _binary__btf_kernel_bin_start and
         _binary__btf_kernel_bin_end, pointing to start and end, respectively,
         of BTF raw data.
      3. final vmlinux image is generated by linking this object file (and
         kallsyms, if necessary). sysfs_btf.c then creates
         /sys/kernel/btf/kernel file and exposes embedded BTF contents through
         it. This allows, e.g., libbpf and bpftool access BTF info at
         well-known location, without resorting to searching for vmlinux image
         on disk (location of which is not standardized and vmlinux image
         might not be even available in some scenarios, e.g., inside qemu
         during testing).
      
      Alternative approach using .incbin assembler directive to embed BTF
      contents directly was attempted but didn't work, because sysfs_proc.o is
      not re-compiled during link-vmlinux.sh stage. This is required, though,
      to update embedded BTF data (initially empty data is embedded, then
      pahole generates BTF info and we need to regenerate sysfs_btf.o with
      updated contents, but it's too late at that point).
      
      If BTF couldn't be generated due to missing or too old pahole,
      sysfs_btf.c handles that gracefully by detecting that
      _binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
      /sys/kernel/btf at all.
      
      v2->v3:
      - added Documentation/ABI/testing/sysfs-kernel-btf (Greg K-H);
      - created proper kobject (btf_kobj) for btf directory (Greg K-H);
      - undo v2 change of reusing vmlinux, as it causes extra kallsyms pass
        due to initially missing  __binary__btf_kernel_bin_{start/end} symbols;
      
      v1->v2:
      - allow kallsyms stage to re-use vmlinux generated by gen_btf();
      Reviewed-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: default avatarAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      341dfcf8
  3. 12 Aug, 2019 18 commits
  4. 11 Aug, 2019 6 commits
    • David S. Miller's avatar
      Merge branch 'drop_monitor-Capture-dropped-packets-and-metadata' · 6e5ee483
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      drop_monitor: Capture dropped packets and metadata
      
      So far drop monitor supported only one mode of operation in which a
      summary of recent packet drops is periodically sent to user space as a
      netlink event. The event only includes the drop location (program
      counter) and number of drops in the last interval.
      
      While this mode of operation allows one to understand if the system is
      dropping packets, it is not sufficient if a more detailed analysis is
      required. Both the packet itself and related metadata are missing.
      
      This patchset extends drop monitor with another mode of operation where
      the packet - potentially truncated - and metadata (e.g., drop location,
      timestamp, netdev) are sent to user space as a netlink event. Thanks to
      the extensible nature of netlink, more metadata can be added in the
      future.
      
      To avoid performing expensive operations in the context in which
      kfree_skb() is called, the dropped skbs are cloned and queued on per-CPU
      skb drop list. The list is then processed in process context (using a
      workqueue), where the netlink messages are allocated, prepared and
      finally sent to user space.
      
      A follow-up patchset will integrate drop monitor with devlink and allow
      the latter to call into drop monitor to report hardware drops. In the
      future, XDP drops can be added as well, thereby making drop monitor the
      go-to netlink channel for diagnosing all packet drops.
      
      Example usage with patched dropwatch [1] can be found here [2]. Example
      dissection of drop monitor netlink events with patched wireshark [3] can
      be found here [4]. I will submit both changes upstream after the kernel
      changes are accepted. Another change worth making is adding a dropmon
      pseudo interface to libpcap, similar to the nflog interface [5]. This
      will allow users to specifically listen on dropmon traffic instead of
      capturing all netlink packets via the nlmon netdev.
      
      Patches #1-#5 prepare the code towards the actual changes in later
      patches.
      
      Patch #6 adds another mode of operation to drop monitor in which the
      dropped packet itself is notified to user space along with metadata.
      
      Patch #7 allows users to truncate reported packets to a specific length,
      in case only the headers are of interest. The original length of the
      packet is added as metadata to the netlink notification.
      
      Patch #8 allows user to query the current configuration of drop monitor
      (e.g., alert mode, truncation length).
      
      Patches #9-#10 allow users to tune the length of the per-CPU skb drop
      list according to their needs.
      
      Changes since v1 [6]:
      * Add skb protocol as metadata. This allows user space to correctly
        dissect the packet instead of blindly assuming it is an Ethernet
        packet
      
      Changes since RFC [7]:
      * Limit the length of the per-CPU skb drop list and make it configurable
      * Do not use the hysteresis timer in packet alert mode
      * Introduce alert mode operations in a separate patch and only then
        introduce the new alert mode
      * Use 'skb->skb_iif' instead of 'skb->dev' because the latter is inside
        a union with 'dev_scratch' and therefore not guaranteed to point to a
        valid netdev
      * Return '-EBUSY' instead of '-EOPNOTSUPP' when trying to configure drop
        monitor while it is monitoring
      * Did not change schedule_work() in favor of schedule_work_on() as I did
        not observe a change in number of tail drops
      
      [1] https://github.com/idosch/dropwatch/tree/packet-mode
      [2] https://gist.github.com/idosch/3d524b887e16bc11b4b19e25c23dcc23#file-gistfile1-txt
      [3] https://github.com/idosch/wireshark/tree/drop-monitor-v2
      [4] https://gist.github.com/idosch/3d524b887e16bc11b4b19e25c23dcc23#file-gistfile2-txt
      [5] https://github.com/the-tcpdump-group/libpcap/blob/master/pcap-netfilter-linux.c
      [6] https://patchwork.ozlabs.org/cover/1143443/
      [7] https://patchwork.ozlabs.org/cover/1135226/
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e5ee483
    • Ido Schimmel's avatar
      drop_monitor: Expose tail drop counter · e9feb580
      Ido Schimmel authored
      Previous patch made the length of the per-CPU skb drop list
      configurable. Expose a counter that shows how many packets could not be
      enqueued to this list.
      
      This allows users determine the desired queue length.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9feb580
    • Ido Schimmel's avatar
      drop_monitor: Make drop queue length configurable · 30328d46
      Ido Schimmel authored
      In packet alert mode, each CPU holds a list of dropped skbs that need to
      be processed in process context and sent to user space. To avoid
      exhausting the system's memory the maximum length of this queue is
      currently set to 1000.
      
      Allow users to tune the length of this queue according to their needs.
      The configured length is reported to user space when drop monitor
      configuration is queried.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      30328d46
    • Ido Schimmel's avatar
      drop_monitor: Add a command to query current configuration · 444be061
      Ido Schimmel authored
      Users should be able to query the current configuration of drop monitor
      before they start using it. Add a command to query the existing
      configuration which currently consists of alert mode and packet
      truncation length.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      444be061
    • Ido Schimmel's avatar
      drop_monitor: Allow truncation of dropped packets · 57986617
      Ido Schimmel authored
      When sending dropped packets to user space it is not always necessary to
      copy the entire packet as usually only the headers are of interest.
      
      Allow user to specify the truncation length and add the original length
      of the packet as additional metadata to the netlink message.
      
      By default no truncation is performed.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      57986617
    • Ido Schimmel's avatar
      drop_monitor: Add packet alert mode · ca30707d
      Ido Schimmel authored
      So far drop monitor supported only one alert mode in which a summary of
      locations in which packets were recently dropped was sent to user space.
      
      This alert mode is sufficient in order to understand that packets were
      dropped, but lacks information to perform a more detailed analysis.
      
      Add a new alert mode in which the dropped packet itself is passed to
      user space along with metadata: The drop location (as program counter
      and resolved symbol), ingress netdevice and drop timestamp. More
      metadata can be added in the future.
      
      To avoid performing expensive operations in the context in which
      kfree_skb() is invoked (can be hard IRQ), the dropped skb is cloned and
      queued on per-CPU skb drop list. Then, in process context the netlink
      message is allocated, prepared and finally sent to user space.
      
      The per-CPU skb drop list is limited to 1000 skbs to prevent exhausting
      the system's memory. Subsequent patches will make this limit
      configurable and also add a counter that indicates how many skbs were
      tail dropped.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca30707d