1. 29 Mar, 2018 14 commits
  2. 28 Mar, 2018 12 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-raw-tracepoints' · f6ef5658
      Daniel Borkmann authored
      Alexei Starovoitov says:
      
      ====================
      v7->v8:
      - moved 'u32 num_args' from 'struct tracepoint' into 'struct bpf_raw_event_map'
        that increases memory overhead, but can be optimized/compressed later.
        Now it's zero changes in tracepoint.[ch]
      
      v6->v7:
      - adopted Steven's bpf_raw_tp_map section approach to find tracepoint
        and corresponding bpf probe function instead of kallsyms approach.
        dropped kernel_tracepoint_find_by_name() patch
      
      v5->v6:
      - avoid changing semantics of for_each_kernel_tracepoint() function, instead
        introduce kernel_tracepoint_find_by_name() helper
      
      v4->v5:
      - adopted Daniel's fancy REPEAT macro in bpf_trace.c in patch 6
      
      v3->v4:
      - adopted Linus's CAST_TO_U64 macro to cast any integer, pointer, or small
        struct to u64. That nicely reduced the size of patch 1
      
      v2->v3:
      - with Linus's suggestion introduced generic COUNT_ARGS and CONCATENATE macros
        (or rather moved them from apparmor)
        that cleaned up patch 6
      - added patch 4 to refactor trace_iwlwifi_dev_ucode_error() from 17 args to 4
        Now any tracepoint with >12 args will have build error
      
      v1->v2:
      - simplified api by combing bpf_raw_tp_open(name) + bpf_attach(prog_fd) into
        bpf_raw_tp_open(name, prog_fd) as suggested by Daniel.
        That simplifies bpf_detach as well which is now simple close() of fd.
      - fixed memory leak in error path which was spotted by Daniel.
      - fixed bpf_get_stackid(), bpf_perf_event_output() called from raw tracepoints
      - added more tests
      - fixed allyesconfig build caught by buildbot
      
      v1:
      This patch set is a different way to address the pressing need to access
      task_struct pointers in sched tracepoints from bpf programs.
      
      The first approach simply added these pointers to sched tracepoints:
      https://lkml.org/lkml/2017/12/14/753
      which Peter nacked.
      Few options were discussed and eventually the discussion converged on
      doing bpf specific tracepoint_probe_register() probe functions.
      Details here:
      https://lkml.org/lkml/2017/12/20/929
      
      Patch 1 is kernel wide cleanup of pass-struct-by-value into
      pass-struct-by-reference into tracepoints.
      
      Patches 2 and 3 are minor cleanups to address allyesconfig build
      
      Patch 4 refactor trace_iwlwifi_dev_ucode_error from 17 to 4 args
      
      Patch 5 introduces COUNT_ARGS macro
      
      Patch 6 introduces BPF_RAW_TRACEPOINT api.
      the auto-cleanup and multiple concurrent users are must have
      features of tracing api. For bpf raw tracepoints it looks like:
        // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
        prog_fd = bpf_prog_load(...);
      
        // receive anon_inode fd for given bpf_raw_tracepoint
        // and attach bpf program to it
        raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool will automatically
      detach bpf program, unload it and unregister tracepoint probe.
      More details in patch 6.
      
      Patch 7 - trivial support in libbpf
      Patches 8, 9 - user space tests
      
      samples/bpf/test_overhead performance on 1 cpu:
      
      tracepoint    base  kprobe+bpf tracepoint+bpf raw_tracepoint+bpf
      task_rename   1.1M   769K        947K            1.0M
      urandom_read  789K   697K        750K            755K
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      f6ef5658
    • Alexei Starovoitov's avatar
      selftests/bpf: test for bpf_get_stackid() from raw tracepoints · 3bbe0869
      Alexei Starovoitov authored
      similar to traditional traceopint test add bpf_get_stackid() test
      from raw tracepoints
      and reduce verbosity of existing stackmap test
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3bbe0869
    • Alexei Starovoitov's avatar
      samples/bpf: raw tracepoint test · 4662a4e5
      Alexei Starovoitov authored
      add empty raw_tracepoint bpf program to test overhead similar
      to kprobe and traditional tracepoint tests
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4662a4e5
    • Alexei Starovoitov's avatar
      libbpf: add bpf_raw_tracepoint_open helper · a0fe3e57
      Alexei Starovoitov authored
      add bpf_raw_tracepoint_open(const char *name, int prog_fd) api to libbpf
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a0fe3e57
    • Alexei Starovoitov's avatar
      bpf: introduce BPF_RAW_TRACEPOINT · c4f6699d
      Alexei Starovoitov authored
      Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
      kernel internal arguments of the tracepoints in their raw form.
      
      >From bpf program point of view the access to the arguments look like:
      struct bpf_raw_tracepoint_args {
             __u64 args[0];
      };
      
      int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
      {
        // program can read args[N] where N depends on tracepoint
        // and statically verified at program load+attach time
      }
      
      kprobe+bpf infrastructure allows programs access function arguments.
      This feature allows programs access raw tracepoint arguments.
      
      Similar to proposed 'dynamic ftrace events' there are no abi guarantees
      to what the tracepoints arguments are and what their meaning is.
      The program needs to type cast args properly and use bpf_probe_read()
      helper to access struct fields when argument is a pointer.
      
      For every tracepoint __bpf_trace_##call function is prepared.
      In assembler it looks like:
      (gdb) disassemble __bpf_trace_xdp_exception
      Dump of assembler code for function __bpf_trace_xdp_exception:
         0xffffffff81132080 <+0>:     mov    %ecx,%ecx
         0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>
      
      where
      
      TRACE_EVENT(xdp_exception,
              TP_PROTO(const struct net_device *dev,
                       const struct bpf_prog *xdp, u32 act),
      
      The above assembler snippet is casting 32-bit 'act' field into 'u64'
      to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
      All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
      and in total this approach adds 7k bytes to .text.
      
      This approach gives the lowest possible overhead
      while calling trace_xdp_exception() from kernel C code and
      transitioning into bpf land.
      Since tracepoint+bpf are used at speeds of 1M+ events per second
      this is valuable optimization.
      
      The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
      that returns anon_inode FD of 'bpf-raw-tracepoint' object.
      
      The user space looks like:
      // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
      prog_fd = bpf_prog_load(...);
      // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
      raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool that uses this feature
      will automatically detach bpf program, unload it and
      unregister tracepoint probe.
      
      On the kernel side the __bpf_raw_tp_map section of pointers to
      tracepoint definition and to __bpf_trace_*() probe function is used
      to find a tracepoint with "xdp_exception" name and
      corresponding __bpf_trace_xdp_exception() probe function
      which are passed to tracepoint_probe_register() to connect probe
      with tracepoint.
      
      Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
      tracepoint mechanisms. perf_event_open() can be used in parallel
      on the same tracepoint.
      Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
      Each with its own bpf program. The kernel will execute
      all tracepoint probes and all attached bpf programs.
      
      In the future bpf_raw_tracepoints can be extended with
      query/introspection logic.
      
      __bpf_raw_tp_map section logic was contributed by Steven Rostedt
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: default avatarSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c4f6699d
    • Alexei Starovoitov's avatar
      macro: introduce COUNT_ARGS() macro · cf14f27f
      Alexei Starovoitov authored
      move COUNT_ARGS() macro from apparmor to generic header and extend it
      to count till twelve.
      
      COUNT() was an alternative name for this logic, but it's used for
      different purpose in many other places.
      
      Similarly for CONCATENATE() macro.
      Suggested-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      cf14f27f
    • Alexei Starovoitov's avatar
      net/wireless/iwlwifi: fix iwlwifi_dev_ucode_error tracepoint · 4fe43c2c
      Alexei Starovoitov authored
      fix iwlwifi_dev_ucode_error tracepoint to pass pointer to a table
      instead of all 17 arguments by value.
      dvm/main.c and mvm/utils.c have 'struct iwl_error_event_table'
      defined with very similar yet subtly different fields and offsets.
      tracepoint is still common and using definition of 'struct iwl_error_event_table'
      from dvm/commands.h while copying fields.
      Long term this tracepoint probably should be split into two.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4fe43c2c
    • Alexei Starovoitov's avatar
      net/mac802154: disambiguate mac80215 vs mac802154 trace events · 14624a93
      Alexei Starovoitov authored
      two trace events defined with the same name and both unused.
      They conflict in allyesconfig build. Rename one of them.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      14624a93
    • Alexei Starovoitov's avatar
      net/mediatek: disambiguate mt76 vs mt7601u trace events · d992ee6c
      Alexei Starovoitov authored
      two trace events defined with the same name and both unused.
      They conflict in allyesconfig build. Rename one of them.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d992ee6c
    • Alexei Starovoitov's avatar
      treewide: remove large struct-pass-by-value from tracepoint arguments · c1055475
      Alexei Starovoitov authored
      - fix trace_hfi1_ctxt_info() to pass large struct by reference instead of by value
      - convert 'type array[]' tracepoint arguments into 'type *array',
        since compiler will warn that sizeof('type array[]') == sizeof('type *array')
        and later should be used instead
      
      The CAST_TO_U64 macro in the later patch will enforce that tracepoint
      arguments can only be integers, pointers, or less than 8 byte structures.
      Larger structures should be passed by reference.
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c1055475
    • Nikita V. Shirokov's avatar
      bpf: Add sock_ops R/W access to ipv4 tos · 6f5c39fa
      Nikita V. Shirokov authored
      Sample usage for tos ...
      
        bpf_getsockopt(skops, SOL_IP, IP_TOS, &v, sizeof(v))
      
      ... where skops is a pointer to the ctx (struct bpf_sock_ops).
      Signed-off-by: default avatarNikita V. Shirokov <tehnerd@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6f5c39fa
    • Colin Ian King's avatar
      samples/bpf: fix spelling mistake: "revieve" -> "receive" · 20cfb7a0
      Colin Ian King authored
      Trivial fix to spelling mistake in error message text
      Signed-off-by: default avatarColin Ian King <colin.king@canonical.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      20cfb7a0
  3. 27 Mar, 2018 1 commit
  4. 26 Mar, 2018 3 commits
  5. 23 Mar, 2018 10 commits
    • Daniel Borkmann's avatar
      Merge branch 'bpf-print-insns-api' · 639a53da
      Daniel Borkmann authored
      Jiri Olsa says:
      
      ====================
      This patchset removes struct bpf_verifier_env argument
      from print_bpf_insn function (patch 1) and changes user
      space bpftool user to use it that way (patch 2).
      ====================
      Reviewed-by: default avatarQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      639a53da
    • Jiri Olsa's avatar
      bpftool: Adjust to new print_bpf_insn interface · 337682ca
      Jiri Olsa authored
      Change bpftool to skip the removed struct bpf_verifier_env
      argument in print_bpf_insn. It was passed as NULL anyway.
      
      No functional change intended.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      337682ca
    • Jiri Olsa's avatar
      bpf: Remove struct bpf_verifier_env argument from print_bpf_insn · abe08840
      Jiri Olsa authored
      We use print_bpf_insn in user space (bpftool and soon perf),
      so it'd be nice to keep it generic and strip it off the kernel
      struct bpf_verifier_env argument.
      
      This argument can be safely removed, because its users can
      use the struct bpf_insn_cbs::private_data to pass it.
      
      By changing the argument type  we can no longer have clean
      'verbose' alias to 'bpf_verifier_log_write' in verifier.c.
      Instead  we're adding the  'verbose' cb_print callback and
      removing the alias.
      
      This way we have new cb_print callback in place, and all
      the 'verbose(env, ...) calls in verifier.c will cleanly
      cast to 'verbose(void *, ...)' so no other change is
      needed.
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      abe08840
    • Jeff Kirsher's avatar
      intel: add SPDX identifiers to all the Intel drivers · ae06c70b
      Jeff Kirsher authored
      Add the SPDX identifiers to all the Intel wired LAN driver files, as
      outlined in Documentation/process/license-rules.rst.
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae06c70b
    • Chas Williams's avatar
      bridge: Allow max MTU when multiple VLANs present · 419d14af
      Chas Williams authored
      If the bridge is allowing multiple VLANs, some VLANs may have
      different MTUs.  Instead of choosing the minimum MTU for the
      bridge interface, choose the maximum MTU of the bridge members.
      With this the user only needs to set a larger MTU on the member
      ports that are participating in the large MTU VLANS.
      Signed-off-by: default avatarChas Williams <3chas3@gmail.com>
      Reviewed-by: default avatarNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: default avatarRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      419d14af
    • Jay Vosburgh's avatar
      virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS · bda7fab5
      Jay Vosburgh authored
      The operstate update logic will leave an interface in the
      default UNKNOWN operstate if the interface carrier state never changes
      from the default carrier up state set at creation.  This includes the
      case of an explicit call to netif_carrier_on, as the carrier on to on
      transition has no effect on operstate.
      
      	This affects virtio-net for the case that the virtio peer does
      not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
      updates).  Without this feature, the virtio specification states that
      "the link should be assumed active," so, logically, the operstate should
      be UP instead of UNKNOWN.  This has impact on user space applications
      that use the operstate to make availability decisions for the interface.
      
      	Resolve this by changing the virtio probe logic slightly to call
      netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
      cases, and then the existing call to netif_carrier_on for the "without"
      case will cause an operstate transition.
      
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Signed-off-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bda7fab5
    • David Ahern's avatar
      devlink: Remove top_hierarchy arg for DEVLINK disabled path · e9de0018
      David Ahern authored
      Earlier change missed the path where CONFIG_NET_DEVLINK is disabled.
      Thanks to Jiri for spotting.
      
      Fixes: 14530746 ("devlink: Remove top_hierarchy arg to devlink_resource_register")
      Signed-off-by: default avatarDavid Ahern <dsahern@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e9de0018
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net · 03fe2deb
      David S. Miller authored
      Fun set of conflict resolutions here...
      
      For the mac80211 stuff, these were fortunately just parallel
      adds.  Trivially resolved.
      
      In drivers/net/phy/phy.c we had a bug fix in 'net' that moved the
      function phy_disable_interrupts() earlier in the file, whilst in
      'net-next' the phy_error() call from this function was removed.
      
      In net/ipv4/xfrm4_policy.c, David Ahern's changes to remove the
      'rt_table_id' member of rtable collided with a bug fix in 'net' that
      added a new struct member "rt_mtu_locked" which needs to be copied
      over here.
      
      The mlxsw driver conflict consisted of net-next separating
      the span code and definitions into separate files, whilst
      a 'net' bug fix made some changes to that moved code.
      
      The mlx5 infiniband conflict resolution was quite non-trivial,
      the RDMA tree's merge commit was used as a guide here, and
      here are their notes:
      
      ====================
      
          Due to bug fixes found by the syzkaller bot and taken into the for-rc
          branch after development for the 4.17 merge window had already started
          being taken into the for-next branch, there were fairly non-trivial
          merge issues that would need to be resolved between the for-rc branch
          and the for-next branch.  This merge resolves those conflicts and
          provides a unified base upon which ongoing development for 4.17 can
          be based.
      
          Conflicts:
                  drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f
                  (IB/mlx5: Fix cleanup order on unload) added to for-rc and
                  commit b5ca15ad (IB/mlx5: Add proper representors support)
                  add as part of the devel cycle both needed to modify the
                  init/de-init functions used by mlx5.  To support the new
                  representors, the new functions added by the cleanup patch
                  needed to be made non-static, and the init/de-init list
                  added by the representors patch needed to be modified to
                  match the init/de-init list changes made by the cleanup
                  patch.
          Updates:
                  drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
                  prototypes added by representors patch to reflect new function
                  names as changed by cleanup patch
                  drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
                  stage list to match new order from cleanup patch
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03fe2deb
    • Linus Torvalds's avatar
      Merge branch 'akpm' (patches from Andrew) · f36b7534
      Linus Torvalds authored
      Merge misc fixes from Andrew Morton:
       "13 fixes"
      
      * emailed patches from Andrew Morton <akpm@linux-foundation.org>:
        mm, thp: do not cause memcg oom for thp
        mm/vmscan: wake up flushers for legacy cgroups too
        Revert "mm: page_alloc: skip over regions of invalid pfns where possible"
        mm/shmem: do not wait for lock_page() in shmem_unused_huge_shrink()
        mm/thp: do not wait for lock_page() in deferred_split_scan()
        mm/khugepaged.c: convert VM_BUG_ON() to collapse fail
        x86/mm: implement free pmd/pte page interfaces
        mm/vmalloc: add interfaces to free unmapped page table
        h8300: remove extraneous __BIG_ENDIAN definition
        hugetlbfs: check for pgoff value overflow
        lockdep: fix fs_reclaim warning
        MAINTAINERS: update Mark Fasheh's e-mail
        mm/mempolicy.c: avoid use uninitialized preferred_node
      f36b7534
    • Linus Torvalds's avatar
      Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm · 8401c72c
      Linus Torvalds authored
      Pull libnvdimm fixes from Dan Williams:
       "Two regression fixes, two bug fixes for older issues, two fixes for
        new functionality added this cycle that have userspace ABI concerns,
        and a small cleanup. These have appeared in a linux-next release and
        have a build success report from the 0day robot.
      
         * The 4.16 rework of altmap handling led to some configurations
           leaking page table allocations due to freeing from the altmap
           reservation rather than the page allocator.
      
           The impact without the fix is leaked memory and a WARN() message
           when tearing down libnvdimm namespaces. The rework also missed a
           place where error handling code needed to be removed that can lead
           to a crash if devm_memremap_pages() fails.
      
         * acpi_map_pxm_to_node() had a latent bug whereby it could
           misidentify the closest online node to a given proximity domain.
      
         * Block integrity handling was reworked several kernels back to allow
           calling add_disk() after setting up the integrity profile.
      
           The nd_btt and nd_blk drivers are just now catching up to fix
           automatic partition detection at driver load time.
      
         * The new peristence_domain attribute, a platform indicator of
           whether cpu caches are powerfail protected for example, is meant to
           be a single value enum and not a set of flags.
      
           This oversight was caught while reviewing new userspace code in
           libndctl to communicate the attribute.
      
           Fix this new enabling up so that we are not stuck with an unwanted
           userspace ABI"
      
      * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
        libnvdimm, nfit: fix persistence domain reporting
        libnvdimm, region: hide persistence_domain when unknown
        acpi, numa: fix pxm to online numa node associations
        x86, memremap: fix altmap accounting at free
        libnvdimm: remove redundant assignment to pointer 'dev'
        libnvdimm, {btt, blk}: do integrity setup before add_disk()
        kernel/memremap: Remove stale devres_free() call
      8401c72c