1. 27 Jun, 2019 19 commits
    • Stanislav Fomichev's avatar
      bpf: implement getsockopt and setsockopt hooks · 0d01da6a
      Stanislav Fomichev authored
      Implement new BPF_PROG_TYPE_CGROUP_SOCKOPT program type and
      BPF_CGROUP_{G,S}ETSOCKOPT cgroup hooks.
      
      BPF_CGROUP_SETSOCKOPT can modify user setsockopt arguments before
      passing them down to the kernel or bypass kernel completely.
      BPF_CGROUP_GETSOCKOPT can can inspect/modify getsockopt arguments that
      kernel returns.
      Both hooks reuse existing PTR_TO_PACKET{,_END} infrastructure.
      
      The buffer memory is pre-allocated (because I don't think there is
      a precedent for working with __user memory from bpf). This might be
      slow to do for each {s,g}etsockopt call, that's why I've added
      __cgroup_bpf_prog_array_is_empty that exits early if there is nothing
      attached to a cgroup. Note, however, that there is a race between
      __cgroup_bpf_prog_array_is_empty and BPF_PROG_RUN_ARRAY where cgroup
      program layout might have changed; this should not be a problem
      because in general there is a race between multiple calls to
      {s,g}etsocktop and user adding/removing bpf progs from a cgroup.
      
      The return code of the BPF program is handled as follows:
      * 0: EPERM
      * 1: success, continue with next BPF program in the cgroup chain
      
      v9:
      * allow overwriting setsockopt arguments (Alexei Starovoitov):
        * use set_fs (same as kernel_setsockopt)
        * buffer is always kzalloc'd (no small on-stack buffer)
      
      v8:
      * use s32 for optlen (Andrii Nakryiko)
      
      v7:
      * return only 0 or 1 (Alexei Starovoitov)
      * always run all progs (Alexei Starovoitov)
      * use optval=0 as kernel bypass in setsockopt (Alexei Starovoitov)
        (decided to use optval=-1 instead, optval=0 might be a valid input)
      * call getsockopt hook after kernel handlers (Alexei Starovoitov)
      
      v6:
      * rework cgroup chaining; stop as soon as bpf program returns
        0 or 2; see patch with the documentation for the details
      * drop Andrii's and Martin's Acked-by (not sure they are comfortable
        with the new state of things)
      
      v5:
      * skip copy_to_user() and put_user() when ret == 0 (Martin Lau)
      
      v4:
      * don't export bpf_sk_fullsock helper (Martin Lau)
      * size != sizeof(__u64) for uapi pointers (Martin Lau)
      * offsetof instead of bpf_ctx_range when checking ctx access (Martin Lau)
      
      v3:
      * typos in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY comments (Andrii Nakryiko)
      * reverse christmas tree in BPF_PROG_CGROUP_SOCKOPT_RUN_ARRAY (Andrii
        Nakryiko)
      * use __bpf_md_ptr instead of __u32 for optval{,_end} (Martin Lau)
      * use BPF_FIELD_SIZEOF() for consistency (Martin Lau)
      * new CG_SOCKOPT_ACCESS macro to wrap repeated parts
      
      v2:
      * moved bpf_sockopt_kern fields around to remove a hole (Martin Lau)
      * aligned bpf_sockopt_kern->buf to 8 bytes (Martin Lau)
      * bpf_prog_array_is_empty instead of bpf_prog_array_length (Martin Lau)
      * added [0,2] return code check to verifier (Martin Lau)
      * dropped unused buf[64] from the stack (Martin Lau)
      * use PTR_TO_SOCKET for bpf_sockopt->sk (Martin Lau)
      * dropped bpf_target_off from ctx rewrites (Martin Lau)
      * use return code for kernel bypass (Martin Lau & Andrii Nakryiko)
      
      Cc: Andrii Nakryiko <andriin@fb.com>
      Cc: Martin Lau <kafai@fb.com>
      Signed-off-by: default avatarStanislav Fomichev <sdf@google.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      0d01da6a
    • Daniel Borkmann's avatar
      Merge branch 'bpf-af-xdp-mlx5e' · 3b1c667e
      Daniel Borkmann authored
      Tariq Toukan says:
      
      ====================
      This series contains improvements to the AF_XDP kernel infrastructure
      and AF_XDP support in mlx5e. The infrastructure improvements are
      required for mlx5e, but also some of them benefit to all drivers, and
      some can be useful for other drivers that want to implement AF_XDP.
      
      The performance testing was performed on a machine with the following
      configuration:
      
      - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
      - Mellanox ConnectX-5 Ex with 100 Gbit/s link
      
      The results with retpoline disabled, single stream:
      
      txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
      rxdrop: 12.2 Mpps
      l2fwd: 9.4 Mpps
      
      The results with retpoline enabled, single stream:
      
      txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
      rxdrop: 9.9 Mpps
      l2fwd: 6.8 Mpps
      
      v2 changes:
      
      Added patches for mlx5e and addressed the comments for v1. Rebased for
      bpf-next.
      
      v3 changes:
      
      Rebased for the newer bpf-next, resolved conflicts in libbpf. Addressed
      Björn's comments for coding style. Fixed a bug in error handling flow in
      mlx5e_open_xsk.
      
      v4 changes:
      
      UAPI is not changed, XSK RX queues are exposed to the kernel. The lower
      half of the available amount of RX queues are regular queues, and the
      upper half are XSK RX queues. The patch "xsk: Extend channels to support
      combined XSK/non-XSK traffic" was dropped. The final patch was reworked
      accordingly.
      
      Added "net/mlx5e: Attach/detach XDP program safely", as the changes
      introduced in the XSK patch base on the stuff from this one.
      
      Added "libbpf: Support drivers with non-combined channels", which aligns
      the condition in libbpf with the condition in the kernel.
      
      Rebased over the newer bpf-next.
      
      v5 changes:
      
      In v4, ethtool reports the number of channels as 'combined' and the
      number of XSK RX queues as 'rx' for mlx5e. It was changed, so that 'rx'
      is 0, and 'combined' reports the double amount of channels if there is
      an active UMEM - to make libbpf happy.
      
      The patch for libbpf was dropped. Although it's still useful and fixes
      things, it raises some disagreement, so I'm dropping it - it's no longer
      useful for mlx5e anymore after the change above.
      
      v6 changes:
      
      As Maxim is out of office, I rebased the series on behalf of him,
      solved some conflicts, and re-spinned.
      ====================
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Tested-by: default avatarJonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      3b1c667e
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Add XSK zero-copy support · db05815b
      Maxim Mikityanskiy authored
      This commit adds support for AF_XDP zero-copy RX and TX.
      
      We create a dedicated XSK RQ inside the channel, it means that two
      RQs are running simultaneously: one for non-XSK traffic and the other
      for XSK traffic. The regular and XSK RQs use a single ID namespace split
      into two halves: the lower half is regular RQs, and the upper half is
      XSK RQs. When any zero-copy AF_XDP socket is active, changing the number
      of channels is not allowed, because it would break to mapping between
      XSK RQ IDs and channels.
      
      XSK requires different page allocation and release routines. Such
      functions as mlx5e_{alloc,free}_rx_mpwqe and mlx5e_{get,put}_rx_frag are
      generic enough to be used for both regular and XSK RQs, and they use the
      mlx5e_page_{alloc,release} wrappers around the real allocation
      functions. Function pointers are not used to avoid losing the
      performance with retpolines. Wherever it's certain that the regular
      (non-XSK) page release function should be used, it's called directly.
      
      Only the stats that could be meaningful for XSK are exposed to the
      userspace. Those that don't take part in the XSK flow are not
      considered.
      
      Note that we don't wait for WQEs on the XSK RQ (unlike the regular RQ),
      because the newer xdpsock sample doesn't provide any Fill Ring entries
      at the setup stage.
      
      We create a dedicated XSK SQ in the channel. This separation has its
      advantages:
      
      1. When the UMEM is closed, the XSK SQ can also be closed and stop
      receiving completions. If an existing SQ was used for XSK, it would
      continue receiving completions for the packets of the closed socket. If
      a new UMEM was opened at that point, it would start getting completions
      that don't belong to it.
      
      2. Calculating statistics separately.
      
      When the userspace kicks the TX, the driver triggers a hardware
      interrupt by posting a NOP to a dedicated XSK ICO (internal control
      operations) SQ, in order to trigger NAPI on the right CPU core. This XSK
      ICO SQ is protected by a spinlock, as the userspace application may kick
      the TX from any core.
      
      Store the pointers to the UMEMs in the net device private context,
      independently from the kernel. This way the driver can distinguish
      between the zero-copy and non-zero-copy UMEMs. The kernel function
      xdp_get_umem_from_qid does not care about this difference, but the
      driver is only interested in zero-copy UMEMs, particularly, on the
      cleanup it determines whether to close the XSK RQ and SQ or not by
      looking at the presence of the UMEM. Use state_lock to protect the
      access to this area of UMEM pointers.
      
      LRO isn't compatible with XDP, but there may be active UMEMs while
      XDP is off. If this is the case, don't allow LRO to ensure XDP can
      be reenabled at any time.
      
      The validation of XSK parameters typically happens when XSK queues
      open. However, when the interface is down or the XDP program isn't
      set, it's still possible to have active AF_XDP sockets and even to
      open new, but the XSK queues will be closed. To cover these cases,
      perform the validation also in these flows:
      
      1. A new UMEM is registered, but the XSK queues aren't going to be
      created due to missing XDP program or interface being down.
      
      2. MTU changes while there are UMEMs registered.
      
      Having this early check prevents mlx5e_open_channels from failing
      at a later stage, where recovery is impossible and the application
      has no chance to handle the error, because it got the successful
      return value for an MTU change or XSK open operation.
      
      The performance testing was performed on a machine with the following
      configuration:
      
      - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
      - Mellanox ConnectX-5 Ex with 100 Gbit/s link
      
      The results with retpoline disabled, single stream:
      
      txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
      rxdrop: 12.2 Mpps
      l2fwd: 9.4 Mpps
      
      The results with retpoline enabled, single stream:
      
      txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
      rxdrop: 9.9 Mpps
      l2fwd: 6.8 Mpps
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      db05815b
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Move queue param structs to en/params.h · 32a23653
      Maxim Mikityanskiy authored
      structs mlx5e_{rq,sq,cq,channel}_param are going to be used in the
      upcoming XSK RX and TX patches. Move them to a header file to make
      them accessible from other C files.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      32a23653
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Encapsulate open/close queues into a function · 0a06382f
      Maxim Mikityanskiy authored
      Create new functions mlx5e_{open,close}_queues to encapsulate opening
      and closing RQs and SQs, and call the new functions from
      mlx5e_{open,close}_channel. It simplifies the existing functions a bit
      and prepares them for the upcoming AF_XDP changes.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      0a06382f
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Consider XSK in XDP MTU limit calculation · a011b49f
      Maxim Mikityanskiy authored
      Use the existing mlx5e_get_linear_rq_headroom function to calculate the
      headroom for mlx5e_xdp_max_mtu. This function takes the XSK headroom
      into consideration, which will be used in the following patches.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a011b49f
    • Maxim Mikityanskiy's avatar
      net/mlx5e: XDP_TX from UMEM support · 84a0a231
      Maxim Mikityanskiy authored
      When an XDP program returns XDP_TX, and the RQ is XSK-enabled, it
      requires careful handling, because convert_to_xdp_frame creates a new
      page and copies the data there, while our driver expects the xdp_frame
      to point to the same memory as the xdp_buff. Handle this case
      separately: map the page, and in the end unmap it and call
      xdp_return_frame.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      84a0a231
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Share the XDP SQ for XDP_TX between RQs · b9673cf5
      Maxim Mikityanskiy authored
      Put the XDP SQ that is used for XDP_TX into the channel. It used to be a
      part of the RQ, but with introduction of AF_XDP there will be one more
      RQ that could share the same XDP SQ. This patch is a preparation for
      that change.
      
      Separate XDP_TX statistics per RQ were implemented in one of the previous
      patches.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      b9673cf5
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Refactor struct mlx5e_xdp_info · d963fa15
      Maxim Mikityanskiy authored
      Currently, struct mlx5e_xdp_info has some issues that have to be cleaned
      up before the upcoming AF_XDP support makes things too complicated and
      messy. This structure is used both when sending the packet and on
      completion. Moreover, the cleanup procedure on completion depends on the
      origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will
      add new flows that use this structure even differently. To avoid
      overcomplicating the code, this commit refactors the usage of this
      structure in the following ways:
      
      1. struct mlx5e_xdp_info is split into two different structures. One is
      struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to
      be stored and is only used while sending the packet. The other is still
      struct mlx5e_xdp_info that is stored in a FIFO and contains the fields
      needed on completion.
      
      2. The fields of struct mlx5e_xdp_info that are used in different flows
      are put into a union. A special enum indicates the cleanup mode and
      helps choose the right union member. This approach is clear and
      explicit. Although it could be possible to "guess" the mode by looking
      at the values of the fields and at the XDP SQ type, it wouldn't be that
      clear and extendable and would require looking through the whole chain
      to understand what's going on.
      
      For the reference, there are the fields of struct mlx5e_xdp_info that
      are used in different flows (including AF_XDP ones):
      
      Packet origin          | Fields used on completion | Cleanup steps
      -----------------------+---------------------------+------------------
      XDP_REDIRECT,          | xdpf, dma_addr            | DMA unmap and
      XDP_TX from XSK RQ     |                           | xdp_return_frame.
      -----------------------+---------------------------+------------------
      XDP_TX from regular RQ | di                        | Recycle page.
      -----------------------+---------------------------+------------------
      AF_XDP TX              | (none)                    | Increment the
                             |                           | producer index in
                             |                           | Completion Ring.
      
      On send, the same set of mlx5e_xdp_xmit_data fields is used in all
      flows: DMA and virtual addresses and length.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d963fa15
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Allow ICO SQ to be used by multiple RQs · ed084fb6
      Maxim Mikityanskiy authored
      Prepare to creation of the XSK RQ, which will require posting UMRs, too.
      The same ICO SQ will be used for both RQs and also to trigger interrupts
      by posting NOPs. UMR WQEs can't be reused any more. Optimization
      introduced in commit ab966d7e ("net/mlx5e: RX, Recycle buffer of
      UMR WQEs") is reverted.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      ed084fb6
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Calculate linear RX frag size considering XSK · a069e977
      Maxim Mikityanskiy authored
      Additional conditions introduced:
      
      - XSK implies XDP.
      - Headroom includes the XSK headroom if it exists.
      - No space is reserved for struct shared_skb_info in XSK mode.
      - Fragment size smaller than the XSK chunk size is not allowed.
      
      A new auxiliary function mlx5e_get_linear_rq_headroom with the support
      for XSK is introduced. Use this function in the implementation of
      mlx5e_get_rq_headroom. Change headroom to u32 to match the headroom
      field in struct xdp_umem.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      a069e977
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Replace deprecated PCI_DMA_TODEVICE · 6ed9350f
      Maxim Mikityanskiy authored
      The PCI API for DMA is deprecated, and PCI_DMA_TODEVICE is just defined
      to DMA_TO_DEVICE for backward compatibility. Just use DMA_TO_DEVICE.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      6ed9350f
    • Maxim Mikityanskiy's avatar
      xsk: Return the whole xdp_desc from xsk_umem_consume_tx · 4bce4e5c
      Maxim Mikityanskiy authored
      Some drivers want to access the data transmitted in order to implement
      acceleration features of the NICs. It is also useful in AF_XDP TX flow.
      
      Change the xsk_umem_consume_tx API to return the whole xdp_desc, that
      contains the data pointer, length and DMA address, instead of only the
      latter two. Adapt the implementation of i40e and ixgbe to this change.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Cc: Björn Töpel <bjorn.topel@intel.com>
      Cc: Magnus Karlsson <magnus.karlsson@intel.com>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      4bce4e5c
    • Maxim Mikityanskiy's avatar
      xsk: Change the default frame size to 4096 and allow controlling it · 123e8da1
      Maxim Mikityanskiy authored
      The typical XDP memory scheme is one packet per page. Change the AF_XDP
      frame size in libbpf to 4096, which is the page size on x86, to allow
      libbpf to be used with the drivers with the packet-per-page scheme.
      
      Add a command line option -f to xdpsock to allow to specify a custom
      frame size.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      123e8da1
    • Maxim Mikityanskiy's avatar
      libbpf: Support getsockopt XDP_OPTIONS · 2761ed4b
      Maxim Mikityanskiy authored
      Query XDP_OPTIONS in libbpf to determine if the zero-copy mode is active
      or not.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2761ed4b
    • Maxim Mikityanskiy's avatar
      xsk: Add getsockopt XDP_OPTIONS · 2640d3c8
      Maxim Mikityanskiy authored
      Make it possible for the application to determine whether the AF_XDP
      socket is running in zero-copy mode. To achieve this, add a new
      getsockopt option XDP_OPTIONS that returns flags. The only flag
      supported for now is the zero-copy mode indicator.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      2640d3c8
    • Maxim Mikityanskiy's avatar
      xsk: Add API to check for available entries in FQ · d57d7642
      Maxim Mikityanskiy authored
      Add a function that checks whether the Fill Ring has the specified
      amount of descriptors available. It will be useful for mlx5e that wants
      to check in advance, whether it can allocate a bulk of RX descriptors,
      to get the best performance.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Acked-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: default avatarBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      d57d7642
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Attach/detach XDP program safely · e1895324
      Maxim Mikityanskiy authored
      When an XDP program is set, a full reopen of all channels happens in two
      cases:
      
      1. When there was no program set, and a new one is being set.
      
      2. When there was a program set, but it's being unset.
      
      The full reopen is necessary, because the channel parameters may change
      if XDP is enabled or disabled. However, it's performed in an unsafe way:
      if the new channels fail to open, the old ones are already closed, and
      the interface goes down. Use the safe way to switch channels instead.
      The same way is already used for other configuration changes.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e1895324
    • Roman Gushchin's avatar
      bpf: fix cgroup bpf release synchronization · e5c891a3
      Roman Gushchin authored
      Since commit 4bfc0bb2 ("bpf: decouple the lifetime of cgroup_bpf
      from cgroup itself"), cgroup_bpf release occurs asynchronously
      (from a worker context), and before the release of the cgroup itself.
      
      This introduced a previously non-existing race between the release
      and update paths. E.g. if a leaf's cgroup_bpf is released and a new
      bpf program is attached to the one of ancestor cgroups at the same
      time. The race may result in double-free and other memory corruptions.
      
      To fix the problem, let's protect the body of cgroup_bpf_release()
      with cgroup_mutex, as it was effectively previously, when all this
      code was called from the cgroup release path with cgroup mutex held.
      
      Also let's skip cgroups, which have no chances to invoke a bpf
      program, on the update path. If the cgroup bpf refcnt reached 0,
      it means that the cgroup is offline (no attached processes), and
      there are no associated sockets left. It means there is no point
      in updating effective progs array! And it can lead to a leak,
      if it happens after the release. So, let's skip such cgroups.
      
      Big thanks for Tejun Heo for discovering and debugging of this problem!
      
      Fixes: 4bfc0bb2 ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
      Reported-by: default avatarTejun Heo <tj@kernel.org>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      e5c891a3
  2. 26 Jun, 2019 5 commits
  3. 25 Jun, 2019 6 commits
  4. 23 Jun, 2019 10 commits
    • David S. Miller's avatar
      Merge branch 'ipv6-avoid-taking-refcnt-on-dst-during-route-lookup' · 7d30a7f6
      David S. Miller authored
      Wei Wang says:
      
      ====================
      ipv6: avoid taking refcnt on dst during route lookup
      
      Ipv6 route lookup code always grabs refcnt on the dst for the caller.
      But for certain cases, grabbing refcnt is not always necessary if the
      call path is rcu protected and the caller does not cache the dst.
      Another issue in the route lookup logic is:
      When there are multiple custom rules, we have to do the lookup into
      each table associated to each rule individually. And when we can't
      find the route in one table, we grab and release refcnt on
      net->ipv6.ip6_null_entry before going to the next table.
      This operation is completely redundant, and causes false issue because
      net->ipv6.ip6_null_entry is a shared object.
      
      This patch set introduces a new flag RT6_LOOKUP_F_DST_NOREF for route
      lookup callers to set, to avoid any manipulation on the dst refcnt. And
      it converts the major input and output path to use it.
      
      The performance gain is noticable.
      I ran synflood tests between 2 hosts under the same switch. Both hosts
      have 20G mlx NIC, and 8 tx/rx queues.
      Sender sends pure SYN flood with random src IPs and ports using trafgen.
      Receiver has a simple TCP listener on the target port.
      Both hosts have multiple custom rules:
      - For incoming packets, only local table is traversed.
      - For outgoing packets, 3 tables are traversed to find the route.
      The packet processing rate on the receiver is as follows:
      - Before the fix: 3.78Mpps
      - After the fix:  5.50Mpps
      
      v2->v3:
      - Handled fib6_rule_lookup() when CONFIG_IPV6_MULTIPLE_TABLES is not
        configured in patch 03 (suggested by David Ahern)
      - Removed the renaming of l3mdev_link_scope_lookup() in patch 05
        (suggested by David Ahern)
      - Moved definition of ip6_route_output_flags() from an inline function
        in /net/ipv6/route.c to net/ipv6/route.c in order to address kbuild
        error in patch 05
      
      v1->v2:
      - Added a helper ip6_rt_put_flags() in patch 3 suggested by David Miller
      ====================
      Reviewed-by: default avatarDavid Ahern <dsahern@gmail.com>
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d30a7f6
    • Wei Wang's avatar
      ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF · 7d9e5f42
      Wei Wang authored
      For tx path, in most cases, we still have to take refcnt on the dst
      cause the caller is caching the dst somewhere. But it still is
      beneficial to make use of RT6_LOOKUP_F_DST_NOREF flag while doing the
      route lookup. It is cause this flag prevents manipulating refcnt on
      net->ipv6.ip6_null_entry when doing fib6_rule_lookup() to traverse each
      routing table. The null_entry is a shared object and constant updates on
      it cause false sharing.
      
      We converted the current major lookup function ip6_route_output_flags()
      to make use of RT6_LOOKUP_F_DST_NOREF.
      
      Together with the change in the rx path, we see noticable performance
      boost:
      I ran synflood tests between 2 hosts under the same switch. Both hosts
      have 20G mlx NIC, and 8 tx/rx queues.
      Sender sends pure SYN flood with random src IPs and ports using trafgen.
      Receiver has a simple TCP listener on the target port.
      Both hosts have multiple custom rules:
      - For incoming packets, only local table is traversed.
      - For outgoing packets, 3 tables are traversed to find the route.
      The packet processing rate on the receiver is as follows:
      - Before the fix: 3.78Mpps
      - After the fix:  5.50Mpps
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7d9e5f42
    • Wei Wang's avatar
      ipv6: convert rx data path to not take refcnt on dst · 67f415dd
      Wei Wang authored
      ip6_route_input() is the key function to do the route lookup in the
      rx data path. All the callers to this function are already holding rcu
      lock. So it is fairly easy to convert it to not take refcnt on the dst:
      We pass in flag RT6_LOOKUP_F_DST_NOREF and do skb_dst_set_noref().
      This saves a few atomic inc or dec operations and should boost
      performance overall.
      This also makes the logic more aligned with v4.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      67f415dd
    • Wei Wang's avatar
      ipv6: honor RT6_LOOKUP_F_DST_NOREF in rule lookup logic · d64a1f57
      Wei Wang authored
      This patch specifically converts the rule lookup logic to honor this
      flag and not release refcnt when traversing each rule and calling
      lookup() on each routing table.
      Similar to previous patch, we also need some special handling of dst
      entries in uncached list because there is always 1 refcnt taken for them
      even if RT6_LOOKUP_F_DST_NOREF flag is set.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d64a1f57
    • Wei Wang's avatar
      ipv6: initialize rt6->rt6i_uncached in all pre-allocated dst entries · 74109218
      Wei Wang authored
      Initialize rt6->rt6i_uncached on the following pre-allocated dsts:
      net->ipv6.ip6_null_entry
      net->ipv6.ip6_prohibit_entry
      net->ipv6.ip6_blk_hole_entry
      
      This is a preparation patch for later commits to be able to distinguish
      dst entries in uncached list by doing:
      !list_empty(rt6->rt6i_uncached)
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      74109218
    • Wei Wang's avatar
      ipv6: introduce RT6_LOOKUP_F_DST_NOREF flag in ip6_pol_route() · 0e09edcc
      Wei Wang authored
      This new flag is to instruct the route lookup function to not take
      refcnt on the dst entry. The user which does route lookup with this flag
      must properly use rcu protection.
      ip6_pol_route() is the major route lookup function for both tx and rx
      path.
      In this function:
      Do not take refcnt on dst if RT6_LOOKUP_F_DST_NOREF flag is set, and
      directly return the route entry. The caller should be holding rcu lock
      when using this flag, and decide whether to take refcnt or not.
      
      One note on the dst cache in the uncached_list:
      As uncached_list does not consume refcnt, one refcnt is always returned
      back to the caller even if RT6_LOOKUP_F_DST_NOREF flag is set.
      Uncached dst is only possible in the output path. So in such call path,
      caller MUST check if the dst is in the uncached_list before assuming
      that there is no refcnt taken on the returned dst.
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e09edcc
    • Russell King's avatar
      doc: phy: document some PHY_INTERFACE_MODE_xxx settings · 8c25c0cb
      Russell King authored
      There seems to be some confusion surrounding three PHY interface modes,
      specifically 1000BASE-X, 2500BASE-X and SGMII.  Add some documentation
      to phylib detailing precisely what these interface modes refer to.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8c25c0cb
    • Qian Cai's avatar
      inet: fix compilation warnings in fqdir_pre_exit() · 08003d0b
      Qian Cai authored
      The linux-next commit "inet: fix various use-after-free in defrags
      units" [1] introduced compilation warnings,
      
      ./include/net/inet_frag.h:117:1: warning: 'inline' is not at beginning
      of declaration [-Wold-style-declaration]
       static void inline fqdir_pre_exit(struct fqdir *fqdir)
       ^~~~~~
      In file included from ./include/net/netns/ipv4.h:10,
                       from ./include/net/net_namespace.h:20,
                       from ./include/linux/netdevice.h:38,
                       from ./include/linux/icmpv6.h:13,
                       from ./include/linux/ipv6.h:86,
                       from ./include/net/ipv6.h:12,
                       from ./include/rdma/ib_verbs.h:51,
                       from ./include/linux/mlx5/device.h:37,
                       from ./include/linux/mlx5/driver.h:51,
                       from
      drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:37:
      
      [1] https://lore.kernel.org/netdev/20190618180900.88939-3-edumazet@google.com/Signed-off-by: default avatarQian Cai <cai@lca.pw>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      08003d0b
    • Rasmus Villemoes's avatar
      net: dsa: mv88e6xxx: introduce helpers for handling chip->reg_lock · c9acece0
      Rasmus Villemoes authored
      This is a no-op that simply moves all locking and unlocking of
      ->reg_lock into trivial helpers. I did that to be able to easily add
      some ad hoc instrumentation to those helpers to get some information
      on contention and hold times of the mutex. Perhaps others want to do
      something similar at some point, so this frees them from doing the
      'sed -i' yoga, and have a much smaller 'git diff' while fiddling.
      Signed-off-by: default avatarRasmus Villemoes <rasmus.villemoes@prevas.dk>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c9acece0
    • Sameeh Jubran's avatar
      net: ena: Fix bug where ring allocation backoff stopped too late · 3e5bfb18
      Sameeh Jubran authored
      The current code of create_queues_with_size_backoff() allows the ring size
      to become as small as ENA_MIN_RING_SIZE/2. This is a bug since we don't
      want the queue ring to be smaller than ENA_MIN_RING_SIZE
      
      In this commit we change the loop's termination condition to look at the
      queue size of the next iteration instead of that of the current one,
      so that the minimal queue size again becomes ENA_MIN_RING_SIZE.
      
      Fixes: eece4d2a ("net: ena: add ethtool function for changing io queue sizes")
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarSameeh Jubran <sameehj@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3e5bfb18