1. 25 May, 2018 40 commits
    • Cong Wang's avatar
      net_sched: switch to rcu_work · aaa908ff
      Cong Wang authored
      Commit 05f0fe6b ("RCU, workqueue: Implement rcu_work") introduces
      new API's for dispatching work in a RCU callback. Now we can just
      switch to the new API's for tc filters. This could get rid of a lot
      of code.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aaa908ff
    • David S. Miller's avatar
      Merge branch 'Mirroring-tests-involving-VLAN' · 1bb58d2d
      David S. Miller authored
      Petr Machata says:
      
      ====================
      Mirroring tests involving VLAN
      
      This patchset tests mirror-to-gretap with various underlay
      configurations involving VLAN netdevice in particular. Some of the tests
      involve bridges as well, but tests aimed specifically at testing bridges
      (i.e. FDB, STP) are not part of this patchset.
      
      In patches #1-#6, the codebase is adapted to support the new tests.
      
      In patch #7, a test for mirroring to VLAN is introduced.
      
      Patches #8-#10 add three tests where VLAN is part of underlay path after
      gretap encapsulation.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1bb58d2d
    • Petr Machata's avatar
      selftests: forwarding: Test mirror-to-gre w/ UL 802.1d+VLAN · 181d95f8
      Petr Machata authored
      Test for "tc action mirred egress mirror" that mirrors to GRE when the
      underlay route points at an 802.1d bridge and packet egresses through a
      VLAN device.
      
      Besides testing basic connectivity, this also tests that the traffic is
      properly tagged.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      181d95f8
    • Petr Machata's avatar
      selftests: forwarding: Test mirror-to-gre w/ UL VLAN · a08fb9f1
      Petr Machata authored
      Test for "tc action mirred egress mirror" that mirrors to a gretap
      netdevice whose underlay route points at a vlan device.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a08fb9f1
    • Petr Machata's avatar
      selftests: forwarding: Test mirror-to-gre w/ UL VLAN+802.1q · 0056042f
      Petr Machata authored
      Test for "tc action mirred egress mirror" that mirrors to GRE when the
      underlay route points at a vlan device on top of a bridge device with
      vlan filtering (802.1q).
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0056042f
    • Petr Machata's avatar
      selftests: forwarding: Test mirror-to-vlan · 35388a6a
      Petr Machata authored
      Test for "tc action mirred egress mirror" that mirrors to a vlan device.
      - test_vlan() tests that the packets get mirrored
      - test_tagged_vlan() tests that the mirrored packets have correct inner
        VLAN tag.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35388a6a
    • Petr Machata's avatar
      selftests: forwarding: lib: Extract trap_{, un}install() · 87c0c046
      Petr Machata authored
      A mirror-to-vlan test that's coming next needs to install the trap
      unconditionally. Therefore extract from slow_path_trap_{,un}install()
      a more generic functions trap_install() and trap_uninstall(), and covert
      the former two to conditional wrappers around these.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      87c0c046
    • Petr Machata's avatar
      selftests: forwarding: mirror_gre_lib: Support VLAN · 1893150f
      Petr Machata authored
      Add full_test_span_gre_dir_vlan_ips() and full_test_span_gre_dir_vlan()
      to support mirror-to-gre tests that involve VLAN.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1893150f
    • Petr Machata's avatar
      selftests: forwarding: lib: Support VLAN devices · 0e7a504c
      Petr Machata authored
      Add vlan_create() and vlan_destroy() to manage VLAN netdevices.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e7a504c
    • Petr Machata's avatar
      selftests: forwarding: Add $h3's clsact to mirror_topo_lib.sh · 91bac7f9
      Petr Machata authored
      Having a clsact qdisc on $h3 is useful in several tests, and will be
      useful in more tests to come. Move the registration from all the tests
      that need it into the topology file itself.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      91bac7f9
    • Petr Machata's avatar
      selftests: forwarding: mirror_gre_lib: Extract generic functions · d5ea2bfc
      Petr Machata authored
      For non-GRE mirroring tests, a functions along the lines of
      do_test_span_gre_dir_ips() and test_span_gre_dir_ips() are necessary,
      but such that they don't assume tunnels are involved. Extract the code
      from mirror_gre_lib.sh to mirror_lib.sh and convert to just use a given
      device without assuming it's named "h3-$tundev". Convert the two
      above-mentioned functions to wrappers that pass along the correct device
      name.
      
      Add test_span_dir() and fail_test_span_dir() to round up the API for use
      by following patches.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5ea2bfc
    • Petr Machata's avatar
      selftests: forwarding: Split mirror_gre_topo_lib.sh · 74ed089d
      Petr Machata authored
      Move generic parts of mirror_gre_topo_lib.sh into a new file
      mirror_topo_lib.sh. Reuse the functions in GRE topo, adding the tunnel
      devices as necessary.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74ed089d
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 90fed9c9
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2018-05-24
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc).
      
      2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers.
      
      3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit.
      
      4) Jiong Wang adds support for indirect and arithmetic shifts to NFP
      
      5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible.
      
      6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing
         to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions.
      
      7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT.
      
      8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events.
      
      9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90fed9c9
    • David S. Miller's avatar
      Merge branch 'ibmvnic-Failover-hardening' · 49a473f5
      David S. Miller authored
      Thomas Falcon says:
      
      ====================
      ibmvnic: Failover hardening
      
      Introduce additional transport event hardening to handle
      events during device reset. In the driver's current state,
      if a transport event is received during device reset, it can
      cause the device to become unresponsive as invalid operations
      are processed as the backing device context changes. After
      a transport event, the device expects a request to begin the
      initialization process. If the driver is still processing
      a previously queued device reset in this state, it is likely
      to fail as firmware will reject any commands other than the
      one to initialize the client driver's Command-Response Queue.
      
      Instead of failing and becoming dormant, the driver will make
      one more attempt to recover and continue operation. This is
      achieved by setting a state flag, which if true will direct
      the driver to clean up all allocated resources and perform
      a hard reset in an attempt to bring the driver back to an
      operational state.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49a473f5
    • Thomas Falcon's avatar
      ibmvnic: Introduce hard reset recovery · 2770a798
      Thomas Falcon authored
      Introduce a recovery hard reset to handle reset failure as a result of
      change of device context following a transport event, such as a
      backing device failover or partition migration. These operations reset
      the device context to its initial state. If this occurs during a reset,
      any initialization commands are likely to fail with an invalid state
      error as backing device firmware requests reinitialization.
      
      When this happens, make one more attempt by performing a hard reset,
      which frees any resources currently allocated and performs device
      initialization. If a transport event occurs during a device reset, a
      flag is set which will trigger a new hard reset following the
      completionof the current reset event.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2770a798
    • Thomas Falcon's avatar
      ibmvnic: Set resetting state at earliest possible point · 06e43d7f
      Thomas Falcon authored
      Set device resetting state at the earliest possible point: as soon as a
      reset is successfully scheduled. The reset state is toggled off when
      all resets have been processed to completion.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06e43d7f
    • Thomas Falcon's avatar
      ibmvnic: Create separate initialization routine for resets · 8a348450
      Thomas Falcon authored
      Instead of having one initialization routine for all cases, create
      a separate, simpler function for standard initialization, such as during
      device probe. Use the original initialization function to handle
      device reset scenarios. The goal of this patch is to avoid having
      a single, cluttered init function to handle all possible
      scenarios.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a348450
    • Thomas Falcon's avatar
      ibmvnic: Handle error case when setting link state · ab5ec33b
      Thomas Falcon authored
      If setting the link state is not successful, print a warning
      with the resulting return code and return it to be handled
      by the caller.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab5ec33b
    • Thomas Falcon's avatar
      ibmvnic: Return error code if init interrupted by transport event · 17c87058
      Thomas Falcon authored
      If device init is interrupted by a failover, set the init return
      code so that it can be checked and handled appropriately by the
      init routine.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17c87058
    • Thomas Falcon's avatar
      ibmvnic: Check CRQ command return codes · 9c4eaabd
      Thomas Falcon authored
      Check whether CRQ command is successful before awaiting a response
      from the management partition. If the command was not successful, the
      driver may hang waiting for a response that will never come.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c4eaabd
    • Thomas Falcon's avatar
      ibmvnic: Introduce active CRQ state · 5153698e
      Thomas Falcon authored
      Introduce an "active" state for a IBM vNIC Command-Response Queue. A CRQ
      is considered active once it has initialized or linked with its partner by
      sending an initialization request and getting a successful response back
      from the management partition.  Until this has happened, do not allow CRQ
      commands to be sent other than the initialization request.
      
      This change will avoid a protocol error in case of a device transport
      event occurring during a initialization. When the driver receives a
      transport event notification indicating that the backing hardware
      has changed and needs reinitialization, any further commands other
      than the initialization handshake with the VIOS management partition
      will result in an invalid state error. Instead of sending a command
      that will be returned with an error, print a warning and return an
      error that will be handled by the caller.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5153698e
    • Thomas Falcon's avatar
      ibmvnic: Mark NAPI flag as disabled when released · c3f22415
      Thomas Falcon authored
      Set adapter NAPI state as disabled if they are removed. This will allow
      them to be enabled again if reallocated in case of a hard reset.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3f22415
    • David S. Miller's avatar
      Merge branch 'gretap-mirroring-selftests' · 180f848b
      David S. Miller authored
      Petr Machata says:
      
      ====================
      selftests: forwarding: Additions to mirror-to-gretap tests
      
      This patchset is for a handful of edge cases in mirror-to-gretap
      scenarios: removal of mirrored-to netdevice (#1), removal of underlay
      route for tunnel remote endpoint (#2) and cessation of mirroring upon
      removal of flower mirroring rule (#3).
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      180f848b
    • Petr Machata's avatar
      selftests: forwarding: Test removal of mirroring · a96d81a2
      Petr Machata authored
      Test that when flower-based mirror action is removed, mirroring stops.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a96d81a2
    • Petr Machata's avatar
      selftests: forwarding: Test removal of underlay route · 77a8df38
      Petr Machata authored
      When underlay route is removed, the mirrored traffic should not be
      forwarded.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77a8df38
    • Petr Machata's avatar
      selftests: forwarding: Test mirroring to deleted device · 6b45432d
      Petr Machata authored
      Tests that the mirroring code catches up with deletion of a mirrored-to
      device.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b45432d
    • YueHaibing's avatar
      cxgb4: Check for kvzalloc allocation failure · d624613e
      YueHaibing authored
      t4_prep_fw doesn't check for card_fw pointer before store the read data,
      which could lead to a NULL pointer dereference if kvzalloc failed.
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d624613e
    • Alexei Starovoitov's avatar
      Merge branch 'xdp_xmit-bulking' · 10f67868
      Alexei Starovoitov authored
      Jesper Dangaard Brouer says:
      
      ====================
      This patchset change ndo_xdp_xmit API to take a bulk of xdp frames.
      
      When kernel is compiled with CONFIG_RETPOLINE, every indirect function
      pointer (branch) call hurts performance. For XDP this have a huge
      negative performance impact.
      
      This patchset reduce the needed (indirect) calls to ndo_xdp_xmit, but
      also prepares for further optimizations.  The DMA APIs use of indirect
      function pointer calls is the primary source the regression.  It is
      left for a followup patchset, to use bulking calls towards the DMA API
      (via the scatter-gatter calls).
      
      The other advantage of this API change is that drivers can easier
      amortize the cost of any sync/locking scheme, over the bulk of
      packets.  The assumption of the current API is that the driver
      implemementing the NDO will also allocate a dedicated XDP TX queue for
      every CPU in the system.  Which is not always possible or practical to
      configure. E.g. ixgbe cannot load an XDP program on a machine with
      more than 96 CPUs, due to limited hardware TX queues.  E.g. virtio_net
      is hard to configure as it requires manually increasing the
      queues. E.g. tun driver chooses to use a per XDP frame producer lock
      modulo smp_processor_id over avail queues.
      
      I'm considered adding 'flags' to ndo_xdp_xmit, but it's not part of
      this patchset.  This will be a followup patchset, once we know if this
      will be needed (e.g. for non-map xdp_redirect flush-flag, and if
      AF_XDP chooses to use ndo_xdp_xmit for TX).
      
      ---
      V5: Fixed up issues spotted by Daniel and John
      
      V4: Splitout the patches from 4 to 8 patches.  I cannot split the
      driver changes from the NDO change, but I've tried to isolated the NDO
      change together with the driver change as much as possible.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      10f67868
    • Jesper Dangaard Brouer's avatar
      samples/bpf: xdp_monitor use err code from tracepoint xdp:xdp_devmap_xmit · a570e48f
      Jesper Dangaard Brouer authored
      Update xdp_monitor to use the recently added err code introduced
      in tracepoint xdp:xdp_devmap_xmit, to show if the drop count is
      caused by some driver general delivery problem.  Other kind of drops
      will likely just be more normal TX space issues.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a570e48f
    • Jesper Dangaard Brouer's avatar
      xdp/trace: extend tracepoint in devmap with an err · e74de52e
      Jesper Dangaard Brouer authored
      Extending tracepoint xdp:xdp_devmap_xmit in devmap with an err code
      allow people to easier identify the reason behind the ndo_xdp_xmit
      call to a given driver is failing.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e74de52e
    • Jesper Dangaard Brouer's avatar
      xdp: change ndo_xdp_xmit API to support bulking · 735fc405
      Jesper Dangaard Brouer authored
      This patch change the API for ndo_xdp_xmit to support bulking
      xdp_frames.
      
      When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
      Most of the slowdown is caused by DMA API indirect function calls, but
      also the net_device->ndo_xdp_xmit() call.
      
      Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
      single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
      performance improved:
       for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
       for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
      
      With frames avail as a bulk inside the driver ndo_xdp_xmit call,
      further optimizations are possible, like bulk DMA-mapping for TX.
      
      Testing without CONFIG_RETPOLINE show the same performance for
      physical NIC drivers.
      
      The virtual NIC driver tun sees a huge performance boost, as it can
      avoid doing per frame producer locking, but instead amortize the
      locking cost over the bulk.
      
      V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
      V4: Isolated ndo, driver changes and callers.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      735fc405
    • Jesper Dangaard Brouer's avatar
      xdp: introduce xdp_return_frame_rx_napi · 389ab7f0
      Jesper Dangaard Brouer authored
      When sending an xdp_frame through xdp_do_redirect call, then error
      cases can happen where the xdp_frame needs to be dropped, and
      returning an -errno code isn't sufficient/possible any-longer
      (e.g. for cpumap case). This is already fully supported, by simply
      calling xdp_return_frame.
      
      This patch is an optimization, which provides xdp_return_frame_rx_napi,
      which is a faster variant for these error cases.  It take advantage of
      the protection provided by XDP RX running under NAPI protection.
      
      This change is mostly relevant for drivers using the page_pool
      allocator as it can take advantage of this. (Tested with mlx5).
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      389ab7f0
    • Jesper Dangaard Brouer's avatar
      samples/bpf: xdp_monitor use tracepoint xdp:xdp_devmap_xmit · 9940fbf6
      Jesper Dangaard Brouer authored
      The xdp_monitor sample/tool is updated to use the new tracepoint
      xdp:xdp_devmap_xmit the previous patch just introduced.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9940fbf6
    • Jesper Dangaard Brouer's avatar
      xdp: add tracepoint for devmap like cpumap have · 38edddb8
      Jesper Dangaard Brouer authored
      Notice how this allow us get XDP statistic without affecting the XDP
      performance, as tracepoint is no-longer activated on a per packet basis.
      
      V5: Spotted by John Fastabend.
       Fix 'sent' also counted 'drops' in this patch, a later patch corrected
       this, but it was a mistake in this intermediate step.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      38edddb8
    • Jesper Dangaard Brouer's avatar
      bpf: devmap prepare xdp frames for bulking · 5d053f9d
      Jesper Dangaard Brouer authored
      Like cpumap create queue for xdp frames that will be bulked.  For now,
      this patch simply invoke ndo_xdp_xmit foreach frame.  This happens,
      either when the map flush operation is envoked, or when the limit
      DEV_MAP_BULK_SIZE is reached.
      
      V5: Avoid memleak on error path in dev_map_update_elem()
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d053f9d
    • Jesper Dangaard Brouer's avatar
      bpf: devmap introduce dev_map_enqueue · 67f29e07
      Jesper Dangaard Brouer authored
      Functionality is the same, but the ndo_xdp_xmit call is now
      simply invoked from inside the devmap.c code.
      
      V2: Fix compile issue reported by kbuild test robot <lkp@intel.com>
      
      V5: Cleanups requested by Daniel
       - Newlines before func definition
       - Use BUILD_BUG_ON checks
       - Remove unnecessary use return value store in dev_map_enqueue
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      67f29e07
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-task-fd-query' · f80acbd2
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      Currently, suppose a userspace application has loaded a bpf program
      and attached it to a tracepoint/kprobe/uprobe, and a bpf
      introspection tool, e.g., bpftool, wants to show which bpf program
      is attached to which tracepoint/kprobe/uprobe. Such attachment
      information will be really useful to understand the overall bpf
      deployment in the system.
      
      There is a name field (16 bytes) for each program, which could
      be used to encode the attachment point. There are some drawbacks
      for this approaches. First, bpftool user (e.g., an admin) may not
      really understand the association between the name and the
      attachment point. Second, if one program is attached to multiple
      places, encoding a proper name which can imply all these
      attachments becomes difficult.
      
      This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
      Given a pid and fd, this command will return bpf related information
      to user space. Right now it only supports tracepoint/kprobe/uprobe
      perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return
         . prog_id
         . tracepoint name, or
         . k[ret]probe funcname + offset or kernel addr, or
         . u[ret]probe filename + offset
      to the userspace.
      The user can use "bpftool prog" to find more information about
      bpf program itself with prog_id.
      
      Patch #1 adds function perf_get_event() in kernel/events/core.c.
      Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY.
      Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query()
      in the libbpf library for samples/selftests/bpftool to use.
      Patch #4 adds ksym_get_addr() utility function.
      Patch #5 add a test in samples/bpf for querying k[ret]probes and
      u[ret]probes.
      Patch #6 add a test in tools/testing/selftests/bpf for querying
      raw_tracepoint and tracepoint.
      Patch #7 add a new subcommand "perf" to bpftool.
      
      Changelogs:
        v4 -> v5:
           . return strlen(buf) instead of strlen(buf) + 1
             in the attr.buf_len. As long as user provides
             non-empty buffer, it will be filed with empty
             string, truncated string, or full string
             based on the buffer size and the length of
             to-be-copied string.
        v3 -> v4:
           . made attr buf_len input/output. The length of
             actual buffter is written to buf_len so user space knows
             what is actually needed. If user provides a buffer
             with length >= 1 but less than required, do partial
             copy and return -ENOSPC.
           . code simplification with put_user.
           . changed query result attach_info to fd_type.
           . add tests at selftests/bpf to test zero len, null buf and
             insufficient buf.
        v2 -> v3:
           . made perf_get_event() return perf_event pointer const.
             this was to ensure that event fields are not meddled.
           . detect whether newly BPF_TASK_FD_QUERY is supported or
             not in "bpftool perf" and warn users if it is not.
        v1 -> v2:
           . changed bpf subcommand name from BPF_PERF_EVENT_QUERY
             to BPF_TASK_FD_QUERY.
           . fixed various "bpftool perf" issues and added documentation
             and auto-completion.
      ====================
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f80acbd2
    • Yonghong Song's avatar
      tools/bpftool: add perf subcommand · b04df400
      Yonghong Song authored
      The new command "bpftool perf [show | list]" will traverse
      all processes under /proc, and if any fd is associated
      with a perf event, it will print out related perf event
      information. Documentation is also added.
      
      Below is an example to show the results using bcc commands.
      Running the following 4 bcc commands:
        kprobe:     trace.py '__x64_sys_nanosleep'
        kretprobe:  trace.py 'r::__x64_sys_nanosleep'
        tracepoint: trace.py 't:syscalls:sys_enter_nanosleep'
        uprobe:     trace.py 'p:/home/yhs/a.out:main'
      
      The bpftool command line and result:
      
        $ bpftool perf
        pid 21711  fd 5: prog_id 5  kprobe  func __x64_sys_write  offset 0
        pid 21765  fd 5: prog_id 7  kretprobe  func __x64_sys_nanosleep  offset 0
        pid 21767  fd 5: prog_id 8  tracepoint  sys_enter_nanosleep
        pid 21800  fd 5: prog_id 9  uprobe  filename /home/yhs/a.out  offset 1159
      
        $ bpftool -j perf
        [{"pid":21711,"fd":5,"prog_id":5,"fd_type":"kprobe","func":"__x64_sys_write","offset":0}, \
         {"pid":21765,"fd":5,"prog_id":7,"fd_type":"kretprobe","func":"__x64_sys_nanosleep","offset":0}, \
         {"pid":21767,"fd":5,"prog_id":8,"fd_type":"tracepoint","tracepoint":"sys_enter_nanosleep"}, \
         {"pid":21800,"fd":5,"prog_id":9,"fd_type":"uprobe","filename":"/home/yhs/a.out","offset":1159}]
      
        $ bpftool prog
        5: kprobe  name probe___x64_sys  tag e495a0c82f2c7a8d  gpl
      	  loaded_at 2018-05-15T04:46:37-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 4
        7: kprobe  name probe___x64_sys  tag f2fdee479a503abf  gpl
      	  loaded_at 2018-05-15T04:48:32-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 7
        8: tracepoint  name tracepoint__sys  tag 5390badef2395fcf  gpl
      	  loaded_at 2018-05-15T04:48:48-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 8
        9: kprobe  name probe_main_1  tag 0a87bdc2e2953b6d  gpl
      	  loaded_at 2018-05-15T04:49:52-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 9
      
        $ ps ax | grep "python ./trace.py"
        21711 pts/0    T      0:03 python ./trace.py __x64_sys_write
        21765 pts/0    S+     0:00 python ./trace.py r::__x64_sys_nanosleep
        21767 pts/2    S+     0:00 python ./trace.py t:syscalls:sys_enter_nanosleep
        21800 pts/3    S+     0:00 python ./trace.py p:/home/yhs/a.out:main
        22374 pts/1    S+     0:00 grep --color=auto python ./trace.py
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b04df400
    • Yonghong Song's avatar
      tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs · f699cf7a
      Yonghong Song authored
      The new tests are added to query perf_event information
      for raw_tracepoint and tracepoint attachment. For tracepoint,
      both syscalls and non-syscalls tracepoints are queries as
      they are treated slightly differently inside the kernel.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f699cf7a
    • Yonghong Song's avatar
      samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY · ecb96f7f
      Yonghong Song authored
      This is mostly to test kprobe/uprobe which needs kernel headers.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ecb96f7f