1. 25 May, 2018 32 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · 90fed9c9
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      pull-request: bpf-next 2018-05-24
      
      The following pull-request contains BPF updates for your *net-next* tree.
      
      The main changes are:
      
      1) Björn Töpel cleans up AF_XDP (removes rebind, explicit cache alignment from uapi, etc).
      
      2) David Ahern adds mtu checks to bpf_ipv{4,6}_fib_lookup() helpers.
      
      3) Jesper Dangaard Brouer adds bulking support to ndo_xdp_xmit.
      
      4) Jiong Wang adds support for indirect and arithmetic shifts to NFP
      
      5) Martin KaFai Lau cleans up BTF uapi and makes the btf_header extensible.
      
      6) Mathieu Xhonneux adds an End.BPF action to seg6local with BPF helpers allowing
         to edit/grow/shrink a SRH and apply on a packet generic SRv6 actions.
      
      7) Sandipan Das adds support for bpf2bpf function calls in ppc64 JIT.
      
      8) Yonghong Song adds BPF_TASK_FD_QUERY command for introspection of tracing events.
      
      9) other misc fixes from Gustavo A. R. Silva, Sirio Balmelli, John Fastabend, and Magnus Karlsson
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90fed9c9
    • David S. Miller's avatar
      Merge branch 'ibmvnic-Failover-hardening' · 49a473f5
      David S. Miller authored
      Thomas Falcon says:
      
      ====================
      ibmvnic: Failover hardening
      
      Introduce additional transport event hardening to handle
      events during device reset. In the driver's current state,
      if a transport event is received during device reset, it can
      cause the device to become unresponsive as invalid operations
      are processed as the backing device context changes. After
      a transport event, the device expects a request to begin the
      initialization process. If the driver is still processing
      a previously queued device reset in this state, it is likely
      to fail as firmware will reject any commands other than the
      one to initialize the client driver's Command-Response Queue.
      
      Instead of failing and becoming dormant, the driver will make
      one more attempt to recover and continue operation. This is
      achieved by setting a state flag, which if true will direct
      the driver to clean up all allocated resources and perform
      a hard reset in an attempt to bring the driver back to an
      operational state.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49a473f5
    • Thomas Falcon's avatar
      ibmvnic: Introduce hard reset recovery · 2770a798
      Thomas Falcon authored
      Introduce a recovery hard reset to handle reset failure as a result of
      change of device context following a transport event, such as a
      backing device failover or partition migration. These operations reset
      the device context to its initial state. If this occurs during a reset,
      any initialization commands are likely to fail with an invalid state
      error as backing device firmware requests reinitialization.
      
      When this happens, make one more attempt by performing a hard reset,
      which frees any resources currently allocated and performs device
      initialization. If a transport event occurs during a device reset, a
      flag is set which will trigger a new hard reset following the
      completionof the current reset event.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2770a798
    • Thomas Falcon's avatar
      ibmvnic: Set resetting state at earliest possible point · 06e43d7f
      Thomas Falcon authored
      Set device resetting state at the earliest possible point: as soon as a
      reset is successfully scheduled. The reset state is toggled off when
      all resets have been processed to completion.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06e43d7f
    • Thomas Falcon's avatar
      ibmvnic: Create separate initialization routine for resets · 8a348450
      Thomas Falcon authored
      Instead of having one initialization routine for all cases, create
      a separate, simpler function for standard initialization, such as during
      device probe. Use the original initialization function to handle
      device reset scenarios. The goal of this patch is to avoid having
      a single, cluttered init function to handle all possible
      scenarios.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a348450
    • Thomas Falcon's avatar
      ibmvnic: Handle error case when setting link state · ab5ec33b
      Thomas Falcon authored
      If setting the link state is not successful, print a warning
      with the resulting return code and return it to be handled
      by the caller.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab5ec33b
    • Thomas Falcon's avatar
      ibmvnic: Return error code if init interrupted by transport event · 17c87058
      Thomas Falcon authored
      If device init is interrupted by a failover, set the init return
      code so that it can be checked and handled appropriately by the
      init routine.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      17c87058
    • Thomas Falcon's avatar
      ibmvnic: Check CRQ command return codes · 9c4eaabd
      Thomas Falcon authored
      Check whether CRQ command is successful before awaiting a response
      from the management partition. If the command was not successful, the
      driver may hang waiting for a response that will never come.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c4eaabd
    • Thomas Falcon's avatar
      ibmvnic: Introduce active CRQ state · 5153698e
      Thomas Falcon authored
      Introduce an "active" state for a IBM vNIC Command-Response Queue. A CRQ
      is considered active once it has initialized or linked with its partner by
      sending an initialization request and getting a successful response back
      from the management partition.  Until this has happened, do not allow CRQ
      commands to be sent other than the initialization request.
      
      This change will avoid a protocol error in case of a device transport
      event occurring during a initialization. When the driver receives a
      transport event notification indicating that the backing hardware
      has changed and needs reinitialization, any further commands other
      than the initialization handshake with the VIOS management partition
      will result in an invalid state error. Instead of sending a command
      that will be returned with an error, print a warning and return an
      error that will be handled by the caller.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5153698e
    • Thomas Falcon's avatar
      ibmvnic: Mark NAPI flag as disabled when released · c3f22415
      Thomas Falcon authored
      Set adapter NAPI state as disabled if they are removed. This will allow
      them to be enabled again if reallocated in case of a hard reset.
      Signed-off-by: default avatarThomas Falcon <tlfalcon@linux.vnet.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3f22415
    • David S. Miller's avatar
      Merge branch 'gretap-mirroring-selftests' · 180f848b
      David S. Miller authored
      Petr Machata says:
      
      ====================
      selftests: forwarding: Additions to mirror-to-gretap tests
      
      This patchset is for a handful of edge cases in mirror-to-gretap
      scenarios: removal of mirrored-to netdevice (#1), removal of underlay
      route for tunnel remote endpoint (#2) and cessation of mirroring upon
      removal of flower mirroring rule (#3).
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      180f848b
    • Petr Machata's avatar
      selftests: forwarding: Test removal of mirroring · a96d81a2
      Petr Machata authored
      Test that when flower-based mirror action is removed, mirroring stops.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a96d81a2
    • Petr Machata's avatar
      selftests: forwarding: Test removal of underlay route · 77a8df38
      Petr Machata authored
      When underlay route is removed, the mirrored traffic should not be
      forwarded.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      77a8df38
    • Petr Machata's avatar
      selftests: forwarding: Test mirroring to deleted device · 6b45432d
      Petr Machata authored
      Tests that the mirroring code catches up with deletion of a mirrored-to
      device.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b45432d
    • YueHaibing's avatar
      cxgb4: Check for kvzalloc allocation failure · d624613e
      YueHaibing authored
      t4_prep_fw doesn't check for card_fw pointer before store the read data,
      which could lead to a NULL pointer dereference if kvzalloc failed.
      Signed-off-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d624613e
    • Alexei Starovoitov's avatar
      Merge branch 'xdp_xmit-bulking' · 10f67868
      Alexei Starovoitov authored
      Jesper Dangaard Brouer says:
      
      ====================
      This patchset change ndo_xdp_xmit API to take a bulk of xdp frames.
      
      When kernel is compiled with CONFIG_RETPOLINE, every indirect function
      pointer (branch) call hurts performance. For XDP this have a huge
      negative performance impact.
      
      This patchset reduce the needed (indirect) calls to ndo_xdp_xmit, but
      also prepares for further optimizations.  The DMA APIs use of indirect
      function pointer calls is the primary source the regression.  It is
      left for a followup patchset, to use bulking calls towards the DMA API
      (via the scatter-gatter calls).
      
      The other advantage of this API change is that drivers can easier
      amortize the cost of any sync/locking scheme, over the bulk of
      packets.  The assumption of the current API is that the driver
      implemementing the NDO will also allocate a dedicated XDP TX queue for
      every CPU in the system.  Which is not always possible or practical to
      configure. E.g. ixgbe cannot load an XDP program on a machine with
      more than 96 CPUs, due to limited hardware TX queues.  E.g. virtio_net
      is hard to configure as it requires manually increasing the
      queues. E.g. tun driver chooses to use a per XDP frame producer lock
      modulo smp_processor_id over avail queues.
      
      I'm considered adding 'flags' to ndo_xdp_xmit, but it's not part of
      this patchset.  This will be a followup patchset, once we know if this
      will be needed (e.g. for non-map xdp_redirect flush-flag, and if
      AF_XDP chooses to use ndo_xdp_xmit for TX).
      
      ---
      V5: Fixed up issues spotted by Daniel and John
      
      V4: Splitout the patches from 4 to 8 patches.  I cannot split the
      driver changes from the NDO change, but I've tried to isolated the NDO
      change together with the driver change as much as possible.
      ====================
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      10f67868
    • Jesper Dangaard Brouer's avatar
      samples/bpf: xdp_monitor use err code from tracepoint xdp:xdp_devmap_xmit · a570e48f
      Jesper Dangaard Brouer authored
      Update xdp_monitor to use the recently added err code introduced
      in tracepoint xdp:xdp_devmap_xmit, to show if the drop count is
      caused by some driver general delivery problem.  Other kind of drops
      will likely just be more normal TX space issues.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      a570e48f
    • Jesper Dangaard Brouer's avatar
      xdp/trace: extend tracepoint in devmap with an err · e74de52e
      Jesper Dangaard Brouer authored
      Extending tracepoint xdp:xdp_devmap_xmit in devmap with an err code
      allow people to easier identify the reason behind the ndo_xdp_xmit
      call to a given driver is failing.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      e74de52e
    • Jesper Dangaard Brouer's avatar
      xdp: change ndo_xdp_xmit API to support bulking · 735fc405
      Jesper Dangaard Brouer authored
      This patch change the API for ndo_xdp_xmit to support bulking
      xdp_frames.
      
      When kernel is compiled with CONFIG_RETPOLINE, XDP sees a huge slowdown.
      Most of the slowdown is caused by DMA API indirect function calls, but
      also the net_device->ndo_xdp_xmit() call.
      
      Benchmarked patch with CONFIG_RETPOLINE, using xdp_redirect_map with
      single flow/core test (CPU E5-1650 v4 @ 3.60GHz), showed
      performance improved:
       for driver ixgbe: 6,042,682 pps -> 6,853,768 pps = +811,086 pps
       for driver i40e : 6,187,169 pps -> 6,724,519 pps = +537,350 pps
      
      With frames avail as a bulk inside the driver ndo_xdp_xmit call,
      further optimizations are possible, like bulk DMA-mapping for TX.
      
      Testing without CONFIG_RETPOLINE show the same performance for
      physical NIC drivers.
      
      The virtual NIC driver tun sees a huge performance boost, as it can
      avoid doing per frame producer locking, but instead amortize the
      locking cost over the bulk.
      
      V2: Fix compile errors reported by kbuild test robot <lkp@intel.com>
      V4: Isolated ndo, driver changes and callers.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      735fc405
    • Jesper Dangaard Brouer's avatar
      xdp: introduce xdp_return_frame_rx_napi · 389ab7f0
      Jesper Dangaard Brouer authored
      When sending an xdp_frame through xdp_do_redirect call, then error
      cases can happen where the xdp_frame needs to be dropped, and
      returning an -errno code isn't sufficient/possible any-longer
      (e.g. for cpumap case). This is already fully supported, by simply
      calling xdp_return_frame.
      
      This patch is an optimization, which provides xdp_return_frame_rx_napi,
      which is a faster variant for these error cases.  It take advantage of
      the protection provided by XDP RX running under NAPI protection.
      
      This change is mostly relevant for drivers using the page_pool
      allocator as it can take advantage of this. (Tested with mlx5).
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      389ab7f0
    • Jesper Dangaard Brouer's avatar
      samples/bpf: xdp_monitor use tracepoint xdp:xdp_devmap_xmit · 9940fbf6
      Jesper Dangaard Brouer authored
      The xdp_monitor sample/tool is updated to use the new tracepoint
      xdp:xdp_devmap_xmit the previous patch just introduced.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      9940fbf6
    • Jesper Dangaard Brouer's avatar
      xdp: add tracepoint for devmap like cpumap have · 38edddb8
      Jesper Dangaard Brouer authored
      Notice how this allow us get XDP statistic without affecting the XDP
      performance, as tracepoint is no-longer activated on a per packet basis.
      
      V5: Spotted by John Fastabend.
       Fix 'sent' also counted 'drops' in this patch, a later patch corrected
       this, but it was a mistake in this intermediate step.
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      38edddb8
    • Jesper Dangaard Brouer's avatar
      bpf: devmap prepare xdp frames for bulking · 5d053f9d
      Jesper Dangaard Brouer authored
      Like cpumap create queue for xdp frames that will be bulked.  For now,
      this patch simply invoke ndo_xdp_xmit foreach frame.  This happens,
      either when the map flush operation is envoked, or when the limit
      DEV_MAP_BULK_SIZE is reached.
      
      V5: Avoid memleak on error path in dev_map_update_elem()
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      5d053f9d
    • Jesper Dangaard Brouer's avatar
      bpf: devmap introduce dev_map_enqueue · 67f29e07
      Jesper Dangaard Brouer authored
      Functionality is the same, but the ndo_xdp_xmit call is now
      simply invoked from inside the devmap.c code.
      
      V2: Fix compile issue reported by kbuild test robot <lkp@intel.com>
      
      V5: Cleanups requested by Daniel
       - Newlines before func definition
       - Use BUILD_BUG_ON checks
       - Remove unnecessary use return value store in dev_map_enqueue
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      67f29e07
    • Alexei Starovoitov's avatar
      Merge branch 'bpf-task-fd-query' · f80acbd2
      Alexei Starovoitov authored
      Yonghong Song says:
      
      ====================
      Currently, suppose a userspace application has loaded a bpf program
      and attached it to a tracepoint/kprobe/uprobe, and a bpf
      introspection tool, e.g., bpftool, wants to show which bpf program
      is attached to which tracepoint/kprobe/uprobe. Such attachment
      information will be really useful to understand the overall bpf
      deployment in the system.
      
      There is a name field (16 bytes) for each program, which could
      be used to encode the attachment point. There are some drawbacks
      for this approaches. First, bpftool user (e.g., an admin) may not
      really understand the association between the name and the
      attachment point. Second, if one program is attached to multiple
      places, encoding a proper name which can imply all these
      attachments becomes difficult.
      
      This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
      Given a pid and fd, this command will return bpf related information
      to user space. Right now it only supports tracepoint/kprobe/uprobe
      perf event fd's. For such a fd, BPF_TASK_FD_QUERY will return
         . prog_id
         . tracepoint name, or
         . k[ret]probe funcname + offset or kernel addr, or
         . u[ret]probe filename + offset
      to the userspace.
      The user can use "bpftool prog" to find more information about
      bpf program itself with prog_id.
      
      Patch #1 adds function perf_get_event() in kernel/events/core.c.
      Patch #2 implements the bpf subcommand BPF_TASK_FD_QUERY.
      Patch #3 syncs tools bpf.h header and also add bpf_task_fd_query()
      in the libbpf library for samples/selftests/bpftool to use.
      Patch #4 adds ksym_get_addr() utility function.
      Patch #5 add a test in samples/bpf for querying k[ret]probes and
      u[ret]probes.
      Patch #6 add a test in tools/testing/selftests/bpf for querying
      raw_tracepoint and tracepoint.
      Patch #7 add a new subcommand "perf" to bpftool.
      
      Changelogs:
        v4 -> v5:
           . return strlen(buf) instead of strlen(buf) + 1
             in the attr.buf_len. As long as user provides
             non-empty buffer, it will be filed with empty
             string, truncated string, or full string
             based on the buffer size and the length of
             to-be-copied string.
        v3 -> v4:
           . made attr buf_len input/output. The length of
             actual buffter is written to buf_len so user space knows
             what is actually needed. If user provides a buffer
             with length >= 1 but less than required, do partial
             copy and return -ENOSPC.
           . code simplification with put_user.
           . changed query result attach_info to fd_type.
           . add tests at selftests/bpf to test zero len, null buf and
             insufficient buf.
        v2 -> v3:
           . made perf_get_event() return perf_event pointer const.
             this was to ensure that event fields are not meddled.
           . detect whether newly BPF_TASK_FD_QUERY is supported or
             not in "bpftool perf" and warn users if it is not.
        v1 -> v2:
           . changed bpf subcommand name from BPF_PERF_EVENT_QUERY
             to BPF_TASK_FD_QUERY.
           . fixed various "bpftool perf" issues and added documentation
             and auto-completion.
      ====================
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f80acbd2
    • Yonghong Song's avatar
      tools/bpftool: add perf subcommand · b04df400
      Yonghong Song authored
      The new command "bpftool perf [show | list]" will traverse
      all processes under /proc, and if any fd is associated
      with a perf event, it will print out related perf event
      information. Documentation is also added.
      
      Below is an example to show the results using bcc commands.
      Running the following 4 bcc commands:
        kprobe:     trace.py '__x64_sys_nanosleep'
        kretprobe:  trace.py 'r::__x64_sys_nanosleep'
        tracepoint: trace.py 't:syscalls:sys_enter_nanosleep'
        uprobe:     trace.py 'p:/home/yhs/a.out:main'
      
      The bpftool command line and result:
      
        $ bpftool perf
        pid 21711  fd 5: prog_id 5  kprobe  func __x64_sys_write  offset 0
        pid 21765  fd 5: prog_id 7  kretprobe  func __x64_sys_nanosleep  offset 0
        pid 21767  fd 5: prog_id 8  tracepoint  sys_enter_nanosleep
        pid 21800  fd 5: prog_id 9  uprobe  filename /home/yhs/a.out  offset 1159
      
        $ bpftool -j perf
        [{"pid":21711,"fd":5,"prog_id":5,"fd_type":"kprobe","func":"__x64_sys_write","offset":0}, \
         {"pid":21765,"fd":5,"prog_id":7,"fd_type":"kretprobe","func":"__x64_sys_nanosleep","offset":0}, \
         {"pid":21767,"fd":5,"prog_id":8,"fd_type":"tracepoint","tracepoint":"sys_enter_nanosleep"}, \
         {"pid":21800,"fd":5,"prog_id":9,"fd_type":"uprobe","filename":"/home/yhs/a.out","offset":1159}]
      
        $ bpftool prog
        5: kprobe  name probe___x64_sys  tag e495a0c82f2c7a8d  gpl
      	  loaded_at 2018-05-15T04:46:37-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 4
        7: kprobe  name probe___x64_sys  tag f2fdee479a503abf  gpl
      	  loaded_at 2018-05-15T04:48:32-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 7
        8: tracepoint  name tracepoint__sys  tag 5390badef2395fcf  gpl
      	  loaded_at 2018-05-15T04:48:48-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 8
        9: kprobe  name probe_main_1  tag 0a87bdc2e2953b6d  gpl
      	  loaded_at 2018-05-15T04:49:52-0700  uid 0
      	  xlated 200B  not jited  memlock 4096B  map_ids 9
      
        $ ps ax | grep "python ./trace.py"
        21711 pts/0    T      0:03 python ./trace.py __x64_sys_write
        21765 pts/0    S+     0:00 python ./trace.py r::__x64_sys_nanosleep
        21767 pts/2    S+     0:00 python ./trace.py t:syscalls:sys_enter_nanosleep
        21800 pts/3    S+     0:00 python ./trace.py p:/home/yhs/a.out:main
        22374 pts/1    S+     0:00 grep --color=auto python ./trace.py
      Reviewed-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      b04df400
    • Yonghong Song's avatar
      tools/bpf: add two BPF_TASK_FD_QUERY tests in test_progs · f699cf7a
      Yonghong Song authored
      The new tests are added to query perf_event information
      for raw_tracepoint and tracepoint attachment. For tracepoint,
      both syscalls and non-syscalls tracepoints are queries as
      they are treated slightly differently inside the kernel.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f699cf7a
    • Yonghong Song's avatar
      samples/bpf: add a samples/bpf test for BPF_TASK_FD_QUERY · ecb96f7f
      Yonghong Song authored
      This is mostly to test kprobe/uprobe which needs kernel headers.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      ecb96f7f
    • Yonghong Song's avatar
      tools/bpf: add ksym_get_addr() in trace_helpers · 73bc4d9f
      Yonghong Song authored
      Given a kernel function name, ksym_get_addr() will return the kernel
      address for this function, or 0 if it cannot find this function name
      in /proc/kallsyms. This function will be used later when a kernel
      address is used to initiate a kprobe perf event.
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      73bc4d9f
    • Yonghong Song's avatar
      tools/bpf: sync kernel header bpf.h and add bpf_task_fd_query in libbpf · 30687ad9
      Yonghong Song authored
      Sync kernel header bpf.h to tools/include/uapi/linux/bpf.h and
      implement bpf_task_fd_query() in libbpf. The test programs
      in samples/bpf and tools/testing/selftests/bpf, and later bpftool
      will use this libbpf function to query kernel.
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      30687ad9
    • Yonghong Song's avatar
      bpf: introduce bpf subcommand BPF_TASK_FD_QUERY · 41bdc4b4
      Yonghong Song authored
      Currently, suppose a userspace application has loaded a bpf program
      and attached it to a tracepoint/kprobe/uprobe, and a bpf
      introspection tool, e.g., bpftool, wants to show which bpf program
      is attached to which tracepoint/kprobe/uprobe. Such attachment
      information will be really useful to understand the overall bpf
      deployment in the system.
      
      There is a name field (16 bytes) for each program, which could
      be used to encode the attachment point. There are some drawbacks
      for this approaches. First, bpftool user (e.g., an admin) may not
      really understand the association between the name and the
      attachment point. Second, if one program is attached to multiple
      places, encoding a proper name which can imply all these
      attachments becomes difficult.
      
      This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
      Given a pid and fd, if the <pid, fd> is associated with a
      tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return
         . prog_id
         . tracepoint name, or
         . k[ret]probe funcname + offset or kernel addr, or
         . u[ret]probe filename + offset
      to the userspace.
      The user can use "bpftool prog" to find more information about
      bpf program itself with prog_id.
      Acked-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      41bdc4b4
    • Yonghong Song's avatar
      perf/core: add perf_get_event() to return perf_event given a struct file · f8d959a5
      Yonghong Song authored
      A new extern function, perf_get_event(), is added to return a perf event
      given a struct file. This function will be used in later patches.
      Signed-off-by: default avatarYonghong Song <yhs@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      f8d959a5
  2. 24 May, 2018 8 commits
    • Heiner Kallweit's avatar
      net: phy: replace bool members in struct phy_device with bit-fields · 87e5808d
      Heiner Kallweit authored
      In struct phy_device we have a number of flags being defined as type
      bool. Similar to e.g. struct pci_dev we can save some space by using
      bit-fields.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      87e5808d
    • David S. Miller's avatar
      Merge tag 'batadv-next-for-davem-20180524' of git://git.open-mesh.org/linux-merge · 5c352421
      David S. Miller authored
      Simon Wunderlich says:
      
      ====================
      This feature/cleanup patchset includes the following patches:
      
       - bump version strings, by Simon Wunderlich
      
       - Disable batman-adv debugfs by default, by Sven Eckelmann
      
       - Improve handling mesh nodes with multicast optimizations disabled,
         by Linus Luessing
      
       - Avoid bool in structs, by Sven Eckelmann
      
       - Allocate less memory when debugfs is disabled, by Sven Eckelmann
      
       - Fix batadv_interface_tx return data type, by Luc Van Oostenryck
      
       - improve link speed handling for virtual interfaces, by Marek Lindner
      
       - Enable BATMAN V algorithm by default, by Marek Lindner
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c352421
    • Jakub Kicinski's avatar
      bpfilter: don't pass O_CREAT when opening console for debug · 13405468
      Jakub Kicinski authored
      Passing O_CREAT (00000100) to open means we should also pass file
      mode as the third parameter.  Creating /dev/console as a regular
      file may not be helpful anyway, so simply drop the flag when
      opening debug_fd.
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Signed-off-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      13405468
    • Alexei Starovoitov's avatar
      bpfilter: fix build dependency · 61a552eb
      Alexei Starovoitov authored
      BPFILTER could have been enabled without INET causing this build error:
      ERROR: "bpfilter_process_sockopt" [net/bpfilter/bpfilter.ko] undefined!
      
      Fixes: d2ba09c1 ("net: add skeleton of bpfilter kernel module")
      Reported-by: default avatarJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      61a552eb
    • Daniel Borkmann's avatar
      Merge branch 'bpf-ipv6-seg6-bpf-action' · 31ad3923
      Daniel Borkmann authored
      Mathieu Xhonneux says:
      
      ====================
      As of Linux 4.14, it is possible to define advanced local processing for
      IPv6 packets with a Segment Routing Header through the seg6local LWT
      infrastructure. This LWT implements the network programming principles
      defined in the IETF "SRv6 Network Programming" draft.
      
      The implemented operations are generic, and it would be very interesting to
      be able to implement user-specific seg6local actions, without having to
      modify the kernel directly. To do so, this patchset adds an End.BPF action
      to seg6local, powered by some specific Segment Routing-related helpers,
      which provide SR functionalities that can be applied on the packet. This
      BPF hook would then allow to implement specific actions at native kernel
      speed such as OAM features, advanced SR SDN policies, SRv6 actions like
      Segment Routing Header (SRH) encapsulation depending on the content of
      the packet, etc.
      
      This patchset is divided in 6 patches, whose main features are :
      
      - A new seg6local action End.BPF with the corresponding new BPF program
        type BPF_PROG_TYPE_LWT_SEG6LOCAL. Such attached BPF program can be
        passed to the LWT seg6local through netlink, the same way as the LWT
        BPF hook operates.
      - 3 new BPF helpers for the seg6local BPF hook, allowing to edit/grow/
        shrink a SRH and apply on a packet some of the generic SRv6 actions.
      - 1 new BPF helper for the LWT BPF IN hook, allowing to add a SRH through
        encapsulation (via IPv6 encapsulation or inlining if the packet contains
        already an IPv6 header).
      
      As this patchset adds a new LWT BPF hook, I took into account the result
      of the discussions when the LWT BPF infrastructure got merged. Hence, the
      seg6local BPF hook doesn't allow write access to skb->data directly, only
      the SRH can be modified through specific helpers, which ensures that the
      integrity of the packet is maintained. More details are available in the
      related patches messages.
      
      The performances of this BPF hook have been assessed with the BPF JIT
      enabled on an Intel Xeon X3440 processors with 4 cores and 8 threads
      clocked at 2.53 GHz. No throughput losses are noted with the seg6local
      BPF hook when the BPF program does nothing (440kpps). Adding a 8-bytes
      TLV (1 call each to bpf_lwt_seg6_adjust_srh and bpf_lwt_seg6_store_bytes)
      drops the throughput to 410kpps, and inlining a SRH via bpf_lwt_seg6_action
      drops the throughput to 420kpps. All throughputs are stable.
      
      Changelog:
      
      v2: move the SRH integrity state from skb->cb to a per-cpu buffer
      v3: - document helpers in man-page style
          - fix kbuild bugs
          - un-break BPF LWT out hook
          - bpf_push_seg6_encap is now static
          - preempt_enable is now called when the packet is dropped in
            input_action_end_bpf
      v4: fix kbuild bugs when CONFIG_IPV6=m
      v5: fix kbuild sparse warnings when CONFIG_IPV6=m
      v6: fix skb pointers-related bugs in helpers
      v7: - fix memory leak in error path of End.BPF setup
          - add freeing of BPF data in seg6_local_destroy_state
          - new enums SEG6_LOCAL_BPF_* instead of re-using ones of lwt bpf for
            netlink nested bpf attributes
          - SEG6_LOCAL_BPF_PROG attr now contains prog->aux->id when dumping
            state
      ====================
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      31ad3923
    • Mathieu Xhonneux's avatar
      selftests/bpf: test for seg6local End.BPF action · c99a84ea
      Mathieu Xhonneux authored
      Add a new test for the seg6local End.BPF action. The following helpers
      are also tested:
      
      - bpf_lwt_push_encap within the LWT BPF IN hook
      - bpf_lwt_seg6_action
      - bpf_lwt_seg6_adjust_srh
      - bpf_lwt_seg6_store_bytes
      
      A chain of End.BPF actions is built. The SRH is injected through a LWT
      BPF IN hook before entering this chain. Each End.BPF action validates
      the previous one, otherwise the packet is dropped. The test succeeds
      if the last node in the chain receives the packet and the UDP datagram
      contained can be retrieved from userspace.
      Signed-off-by: default avatarMathieu Xhonneux <m.xhonneux@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      c99a84ea
    • Mathieu Xhonneux's avatar
      ipv6: sr: Add seg6local action End.BPF · 004d4b27
      Mathieu Xhonneux authored
      This patch adds the End.BPF action to the LWT seg6local infrastructure.
      This action works like any other seg6local End action, meaning that an IPv6
      header with SRH is needed, whose DA has to be equal to the SID of the
      action. It will also advance the SRH to the next segment, the BPF program
      does not have to take care of this.
      
      Since the BPF program may not be a source of instability in the kernel, it
      is important to ensure that the integrity of the packet is maintained
      before yielding it back to the IPv6 layer. The hook hence keeps track if
      the SRH has been altered through the helpers, and re-validates its
      content if needed with seg6_validate_srh. The state kept for validation is
      stored in a per-CPU buffer. The BPF program is not allowed to directly
      write into the packet, and only some fields of the SRH can be altered
      through the helper bpf_lwt_seg6_store_bytes.
      
      Performances profiling has shown that the SRH re-validation does not induce
      a significant overhead. If the altered SRH is deemed as invalid, the packet
      is dropped.
      
      This validation is also done before executing any action through
      bpf_lwt_seg6_action, and will not be performed again if the SRH is not
      modified after calling the action.
      
      The BPF program may return 3 types of return codes:
          - BPF_OK: the End.BPF action will look up the next destination through
                   seg6_lookup_nexthop.
          - BPF_REDIRECT: if an action has been executed through the
                bpf_lwt_seg6_action helper, the BPF program should return this
                value, as the skb's destination is already set and the default
                lookup should not be performed.
          - BPF_DROP : the packet will be dropped.
      Signed-off-by: default avatarMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: default avatarDavid Lebrun <dlebrun@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      004d4b27
    • Mathieu Xhonneux's avatar
      bpf: Split lwt inout verifier structures · cd3092c7
      Mathieu Xhonneux authored
      The new bpf_lwt_push_encap helper should only be accessible within the
      LWT BPF IN hook, and not the OUT one, as this may lead to a skb under
      panic.
      
      At the moment, both LWT BPF IN and OUT share the same list of helpers,
      whose calls are authorized by the verifier. This patch separates the
      verifier ops for the IN and OUT hooks, and allows the IN hook to call the
      bpf_lwt_push_encap helper.
      
      This patch is also the occasion to put all lwt_*_func_proto functions
      together for clarity. At the moment, socks_op_func_proto is in the middle
      of lwt_inout_func_proto and lwt_xmit_func_proto.
      Signed-off-by: default avatarMathieu Xhonneux <m.xhonneux@gmail.com>
      Acked-by: default avatarDavid Lebrun <dlebrun@google.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      cd3092c7