1. 23 Sep, 2020 6 commits
    • Julian Wiedmann's avatar
      s390/qeth: don't init refcount twice for mcast IPs · 668e2251
      Julian Wiedmann authored
      mcast IP objects are allocated within qeth_l3_add_mcast_rtnl(),
      with .ref_counter already set to 1 via qeth_l3_init_ipaddr().
      Signed-off-by: default avatarJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      668e2251
    • Zheng Yongjun's avatar
      net: microchip: Make `lan743x_pm_suspend` function return right value · 46237bf3
      Zheng Yongjun authored
      drivers/net/ethernet/microchip/lan743x_main.c: In function lan743x_pm_suspend:
      
      `ret` is set but not used. In fact, `pci_prepare_to_sleep` function value should
      be the right value of `lan743x_pm_suspend` function, therefore, fix it.
      Signed-off-by: default avatarZheng Yongjun <zhengyongjun3@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      46237bf3
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2020-09-21' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 573a8095
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2020-09-21
      
      Multi packet TX descriptor support for SKBs.
      
      This series introduces some refactoring of the regular TX data path in
      mlx5 and adds the Enhanced TX MPWQE feature support. MPWQE stands for
      multi-packet work queue element, and it can serve multiple packets,
      reducing the PCI bandwidth spent on control traffic. It should improve
      performance in scenarios where PCI is the bottleneck, and xmit_more is
      signaled by the kernel. The refactoring done in this series also
      improves the packet rate on its own.
      
      MPWQE is already implemented in the XDP tx path, this series adds the
      support of MPWQE for regular kernel SKB tx path.
      
      MPWQE is supported from ConnectX-5 and onward, for legacy devices we need
      to keep backward compatibility for regular (Single packet) WQE descriptor.
      
      MPWQE is not compatible with certain offloads and features, such as TLS
      offload, TSO, nonlinear SKBs. If such incompatible features are in use,
      the driver gracefully falls back to non-MPWQE per SKB.
      
      Prior to the final patch "net/mlx5e: Enhanced TX MPWQE for SKBs" that adds
      the actual support, Maxim did some refactoring to the tx data path to
      split it into stages and smaller helper functions that can be utilized and
      reused for both legacy and new MPWQE feature.
      
      Performance testing:
      
      UDP performance is improved in a single stream pktgen test:
        Packet rate: 16.86 Mpps (±0.15 Mpps) -> 20.94 Mpps (±0.33 Mpps)
        Instructions per packet: 434 -> 329
        Cycles per packet: 158 -> 123
        Instructions per cycle: 2.75 -> 2.67
      
      TCP and XDP_TX single stream tests show no performance difference.
      
      MPWQE can reduce PCI bandwidth:
        PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores:
          Inbound PCI utilization with MPWQE off: 80.3%
          Inbound PCI utilization with MPWQE on: 59.0%
        PCI Gen3, pktgen at fixed rate of 56064000 pps on 24 CPU cores:
          Inbound PCI utilization with MPWQE off: 65.4%
          Inbound PCI utilization with MPWQE on: 49.3%
      
      MPWQE can also reduce CPU load, increasing the packet rate in case of
      CPU bottleneck:
        PCI Gen2, pktgen at full rate on 24 CPU cores:
          Packet rate with MPWQE off: 37.5 Mpps
          Packet rate with MPWQE on: 49.0 Mpps
        PCI Gen3, pktgen at full rate on 24 CPU cores:
          Packet rate with MPWQE off: 57.0 Mpps
          Packet rate with MPWQE on: 66.8 Mpps
      
      Burst size in all pktgen tests is 32.
      
      CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
      NIC: Mellanox ConnectX-6 Dx
      GCC 10.2.0
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      573a8095
    • David S. Miller's avatar
      Merge branch 'devlink-Use-nla_policy-to-validate-range' · 748d1c8a
      David S. Miller authored
      Parav Pandit says:
      
      ====================
      devlink: Use nla_policy to validate range
      
      This two small patches uses nla_policy to validate user specified
      fields are in valid range or not.
      
      Patch summary:
      Patch-1 checks the range of eswitch mode field
      Patch-2 checks for the port type field. It eliminates a check in
      code by using nla policy infrastructure.
      ====================
      Reviewed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      748d1c8a
    • Parav Pandit's avatar
      devlink: Enhance policy to validate port type input value · c49a9440
      Parav Pandit authored
      Use range checking facility of nla_policy to validate port type
      attribute input value is valid or not.
      Signed-off-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c49a9440
    • Parav Pandit's avatar
      devlink: Enhance policy to validate eswitch mode value · ba356c90
      Parav Pandit authored
      Use range checking facility of nla_policy to validate eswitch mode input
      attribute value is valid or not.
      Signed-off-by: default avatarParav Pandit <parav@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba356c90
  2. 22 Sep, 2020 34 commits
    • David S. Miller's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 3ab0a7a0
      David S. Miller authored
      Two minor conflicts:
      
      1) net/ipv4/route.c, adding a new local variable while
         moving another local variable and removing it's
         initial assignment.
      
      2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes.
         One pretty prints the port mode differently, whilst another
         changes the driver to try and obtain the port mode from
         the port node rather than the switch node.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ab0a7a0
    • Linus Torvalds's avatar
      Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs · 805c6d3c
      Linus Torvalds authored
      Pull vfs fixes from Al Viro:
       "No common topic, just assorted fixes"
      
      * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
        fuse: fix the ->direct_IO() treatment of iov_iter
        fs: fix cast in fsparam_u32hex() macro
        vboxsf: Fix the check for the old binary mount-arguments struct
      805c6d3c
    • Linus Torvalds's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · d3017135
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
      
       - fix failure to add bond interfaces to a bridge, the offload-handling
         code was too defensive there and recent refactoring unearthed that.
         Users complained (Ido)
      
       - fix unnecessarily reflecting ECN bits within TOS values / QoS marking
         in TCP ACK and reset packets (Wei)
      
       - fix a deadlock with bpf iterator. Hopefully we're in the clear on
         this front now... (Yonghong)
      
       - BPF fix for clobbering r2 in bpf_gen_ld_abs (Daniel)
      
       - fix AQL on mt76 devices with FW rate control and add a couple of AQL
         issues in mac80211 code (Felix)
      
       - fix authentication issue with mwifiex (Maximilian)
      
       - WiFi connectivity fix: revert IGTK support in ti/wlcore (Mauro)
      
       - fix exception handling for multipath routes via same device (David
         Ahern)
      
       - revert back to a BH spin lock flavor for nsid_lock: there are paths
         which do require the BH context protection (Taehee)
      
       - fix interrupt / queue / NAPI handling in the lantiq driver (Hauke)
      
       - fix ife module load deadlock (Cong)
      
       - make an adjustment to netlink reply message type for code added in
         this release (the sole change touching uAPI here) (Michal)
      
       - a number of fixes for small NXP and Microchip switches (Vladimir)
      
      [ Pull request acked by David: "you can expect more of this in the
        future as I try to delegate more things to Jakub" ]
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (167 commits)
        net: mscc: ocelot: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries
        net: dsa: seville: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries
        net: dsa: felix: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries
        inet_diag: validate INET_DIAG_REQ_PROTOCOL attribute
        net: bridge: br_vlan_get_pvid_rcu() should dereference the VLAN group under RCU
        net: Update MAINTAINERS for MediaTek switch driver
        net/mlx5e: mlx5e_fec_in_caps() returns a boolean
        net/mlx5e: kTLS, Avoid kzalloc(GFP_KERNEL) under spinlock
        net/mlx5e: kTLS, Fix leak on resync error flow
        net/mlx5e: kTLS, Add missing dma_unmap in RX resync
        net/mlx5e: kTLS, Fix napi sync and possible use-after-free
        net/mlx5e: TLS, Do not expose FPGA TLS counter if not supported
        net/mlx5e: Fix using wrong stats_grps in mlx5e_update_ndo_stats()
        net/mlx5e: Fix multicast counter not up-to-date in "ip -s"
        net/mlx5e: Fix endianness when calculating pedit mask first bit
        net/mlx5e: Enable adding peer miss rules only if merged eswitch is supported
        net/mlx5e: CT: Fix freeing ct_label mapping
        net/mlx5e: Fix memory leak of tunnel info when rule under multipath not ready
        net/mlx5e: Use synchronize_rcu to sync with NAPI
        net/mlx5e: Use RCU to protect rq->xdp_prog
        ...
      d3017135
    • Linus Torvalds's avatar
      Merge tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-block · 0baca070
      Linus Torvalds authored
      Pull io_uring fixes from Jens Axboe:
       "A few fixes - most of them regression fixes from this cycle, but also
        a few stable heading fixes, and a build fix for the included demo tool
        since some systems now actually have gettid() available"
      
      * tag 'io_uring-5.9-2020-09-22' of git://git.kernel.dk/linux-block:
        io_uring: fix openat/openat2 unified prep handling
        io_uring: mark statx/files_update/epoll_ctl as non-SQPOLL
        tools/io_uring: fix compile breakage
        io_uring: don't use retry based buffered reads for non-async bdev
        io_uring: don't re-setup vecs/iter in io_resumit_prep() is already there
        io_uring: don't run task work on an exiting task
        io_uring: drop 'ctx' ref on task work cancelation
        io_uring: grab any needed state during defer prep
      0baca070
    • Linus Torvalds's avatar
      Merge tag 'block-5.9-2020-09-22' of git://git.kernel.dk/linux-block · c37b7189
      Linus Torvalds authored
      Pull block fixes from Jens Axboe:
       "A few NVMe fixes, and a dasd write zero fix"
      
      * tag 'block-5.9-2020-09-22' of git://git.kernel.dk/linux-block:
        nvmet: get transport reference for passthru ctrl
        nvme-core: get/put ctrl and transport module in nvme_dev_open/release()
        nvme-tcp: fix kconfig dependency warning when !CRYPTO
        nvme-pci: disable the write zeros command for Intel 600P/P3100
        s390/dasd: Fix zero write for FBA devices
      c37b7189
    • Linus Torvalds's avatar
      Merge tag 'trace-v5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · eff48dde
      Linus Torvalds authored
      Pull tracing fixes from Steven Rostedt:
      
       - Check kprobe is enabled before unregistering from ftrace as it isn't
         registered when disabled.
      
       - Remove kprobes enabled via command-line that is on init text when
         freed.
      
       - Add missing RCU synchronization for ftrace trampoline symbols removed
         from kallsyms.
      
       - Free trampoline on error path if ftrace_startup() fails.
      
       - Give more space for the longer PID numbers in trace output.
      
       - Fix a possible double free in the histogram code.
      
       - A couple of fixes that were discovered by sparse.
      
      * tag 'trace-v5.9-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
        bootconfig: init: make xbc_namebuf static
        kprobes: tracing/kprobes: Fix to kill kprobes on initmem after boot
        tracing: fix double free
        ftrace: Let ftrace_enable_sysctl take a kernel pointer buffer
        tracing: Make the space reserved for the pid wider
        ftrace: Fix missing synchronize_rcu() removing trampoline from kallsyms
        ftrace: Free the trampoline when ftrace_startup() fails
        kprobes: Fix to check probe enabled before disarm_kprobe_ftrace()
      eff48dde
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Enhanced TX MPWQE for SKBs · 5af75c74
      Maxim Mikityanskiy authored
      This commit adds support for Enhanced TX MPWQE feature in the regular
      (SKB) data path. A MPWQE (multi-packet work queue element) can serve
      multiple packets, reducing the PCI bandwidth on control traffic.
      
      Two new stats (tx*_mpwqe_blks and tx*_mpwqe_pkts) are added. The feature
      is on by default and controlled by the skb_tx_mpwqe private flag.
      
      In a MPWQE, eseg is shared among all packets, so eseg-based offloads
      (IPSEC, GENEVE, checksum) run on a separate eseg that is compared to the
      eseg of the current MPWQE session to decide if the new packet can be
      added to the same session.
      
      MPWQE is not compatible with certain offloads and features, such as TLS
      offload, TSO, nonlinear SKBs. If such incompatible features are in use,
      the driver gracefully falls back to non-MPWQE.
      
      This change has no performance impact in TCP single stream test and
      XDP_TX single stream test.
      
      UDP pktgen, 64-byte packets, single stream, MPWQE off:
        Packet rate: 16.96 Mpps (±0.12 Mpps) -> 17.01 Mpps (±0.20 Mpps)
        Instructions per packet: 421 -> 429
        Cycles per packet: 156 -> 161
        Instructions per cycle: 2.70 -> 2.67
      
      UDP pktgen, 64-byte packets, single stream, MPWQE on:
        Packet rate: 16.96 Mpps (±0.12 Mpps) -> 20.94 Mpps (±0.33 Mpps)
        Instructions per packet: 421 -> 329
        Cycles per packet: 156 -> 123
        Instructions per cycle: 2.70 -> 2.67
      
      Enabling MPWQE can reduce PCI bandwidth:
        PCI Gen2, pktgen at fixed rate of 36864000 pps on 24 CPU cores:
          Inbound PCI utilization with MPWQE off: 80.3%
          Inbound PCI utilization with MPWQE on: 59.0%
        PCI Gen3, pktgen at fixed rate of 56064000 pps on 24 CPU cores:
          Inbound PCI utilization with MPWQE off: 65.4%
          Inbound PCI utilization with MPWQE on: 49.3%
      
      Enabling MPWQE can also reduce CPU load, increasing the packet rate in
      case of CPU bottleneck:
        PCI Gen2, pktgen at full rate on 24 CPU cores:
          Packet rate with MPWQE off: 37.5 Mpps
          Packet rate with MPWQE on: 49.0 Mpps
        PCI Gen3, pktgen at full rate on 24 CPU cores:
          Packet rate with MPWQE off: 57.0 Mpps
          Packet rate with MPWQE on: 66.8 Mpps
      
      Burst size in all pktgen tests is 32.
      
      CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
      NIC: Mellanox ConnectX-6 Dx
      GCC 10.2.0
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      5af75c74
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Move TX code into functions to be used by MPWQE · 67044a88
      Maxim Mikityanskiy authored
      mlx5e_txwqe_complete performs some actions that can be taken to separate
      functions:
      
      1. Update the flags needed for hardware timestamping.
      
      2. Stop the TX queue if it's full.
      
      Take these actions into separate functions to be reused by the MPWQE
      code in the following commit and to maintain clear responsibilities of
      functions.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      67044a88
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Rename xmit-related structs to generalize them · b39fe61e
      Maxim Mikityanskiy authored
      As preparation for the upcoming TX MPWQE support for SKBs, rename struct
      mlx5e_xdp_mpwqe to mlx5e_tx_mpwqe and move it above struct mlx5e_txqsq.
      This structure will be reused in the regular SQ and in the regular TX
      data path. Also rename mlx5e_xdp_xmit_data to mlx5e_xmit_data - it will
      be used in the upcoming TX MPWQE flow.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      b39fe61e
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Generalize TX MPWQE checks for full session · 530d5ce2
      Maxim Mikityanskiy authored
      As preparation for the upcoming TX MPWQE for SKBs, create a function
      (mlx5e_tx_mpwqe_is_full) to check whether an MPWQE session is full. This
      function will be shared by MPWQE code for XDP and for SKBs. Defines are
      renamed and moved to make them not XDP-specific.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      530d5ce2
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Support multiple SKBs in a TX WQE · 338c46c6
      Maxim Mikityanskiy authored
      TX MPWQE support for SKBs is coming in one of the following patches, and
      a single MPWQE can send multiple SKBs. This commit prepares the TX path
      code to handle such cases:
      
      1. An additional FIFO for SKBs is added, just like the FIFO for DMA
      chunks.
      
      2. struct mlx5e_tx_wqe_info will contain num_fifo_pkts. If a given WQE
      contains only one packet, num_fifo_pkts will be zero, and the SKB will
      be stored in mlx5e_tx_wqe_info, as usual. If num_fifo_pkts > 0, the SKB
      pointer will be NULL, and the SKBs will be stored in the FIFO.
      
      This change has no performance impact in TCP single stream test and
      XDP_TX single stream test.
      
      When compiled with a recent GCC, this change shows no visible
      performance impact on UDP pktgen (burst 32) single stream test either:
        Packet rate: 16.95 Mpps (±0.15 Mpps) -> 16.96 Mpps (±0.12 Mpps)
        Instructions per packet: 429 -> 421
        Cycles per packet: 160 -> 156
        Instructions per cycle: 2.69 -> 2.70
      
      CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
      NIC: Mellanox ConnectX-6 Dx
      GCC 10.2.0
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      338c46c6
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Move the TLS resync check out of the function · 56e4da66
      Maxim Mikityanskiy authored
      Before this patch, mlx5e_ktls_tx_handle_resync_dump_comp checked for
      resync_dump_frag_page. It happened for all WQEs without an SKB,
      including padding WQEs, and required a function call. Normally, padding
      WQEs happen more often than TLS resyncs. Take this check out of the
      function and put it to an inline function to save a call on all padding
      WQEs.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      56e4da66
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Unify constants for WQE_EMPTY_DS_COUNT · 97e3afd6
      Maxim Mikityanskiy authored
      A constant for the number of DS in an empty WQE (i.e. a WQE without data
      segments) is needed in multiple places (normal TX data path, MPWQE in
      XDP), but currently we have a constant for XDP and an inline formula in
      normal TX. This patch introduces a common constant.
      
      Additionally, mlx5e_xdp_mpwqe_session_start is converted to use struct
      assignment, because the code nearby is touched.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      97e3afd6
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Small improvements for XDP TX MPWQE logic · 388a2b56
      Maxim Mikityanskiy authored
      Use MLX5E_XDP_MPW_MAX_WQEBBS to reserve space for a MPWQE, because it's
      actually the maximal size a MPWQE can take.
      
      Reorganize the logic that checks when to close the MPWQE session:
      
      1. Put all checks into a single function.
      
      2. When inline is on, make only one comparison - if it's false, the less
      strict one will also be false. The compiler probably optimized it out
      anyway, but it's clearer to also reflect it in the code.
      
      The MLX5E_XDP_INLINE_WQE_* defines are also changed to make the
      calculations more correct from the logical point of view. Though
      MLX5E_XDP_INLINE_WQE_MAX_DS_CNT used to be 16 and didn't change its
      value, the calculation used to be DIV_ROUND_UP(max inline packet size,
      MLX5_SEND_WQE_DS), and the numerator should have included sizeof(struct
      mlx5_wqe_inline_seg).
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      388a2b56
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Refactor xmit functions · 8e4b53f6
      Maxim Mikityanskiy authored
      A huge function mlx5e_sq_xmit was split into several to achieve multiple
      goals:
      
      1. Reuse the code in IPoIB.
      
      2. Better intergrate with TLS, IPSEC, GENEVE and checksum offloads. Now
      it's possible to reserve space in the WQ before running eseg-based
      offloads, so:
      
      2.1. It's not needed to copy cseg and eseg after mlx5e_fill_sq_frag_edge
      anymore.
      
      2.2. mlx5e_txqsq_get_next_pi will be used instead of the legacy
      mlx5e_fill_sq_frag_edge for better code maintainability and reuse.
      
      3. Prepare for the upcoming TX MPWQE for SKBs. It will intervene after
      mlx5e_sq_calc_wqe_attr to check if it's possible to use MPWQE, and the
      code flow will split into two paths: MPWQE and non-MPWQE.
      
      Two high-level functions are provided to send packets:
      
      * mlx5e_xmit is called by the networking stack, runs offloads and sends
      the packet. In one of the following patches, MPWQE support will be added
      to this flow.
      
      * mlx5e_sq_xmit_simple is called by the TLS offload, runs only the
      checksum offload and sends the packet.
      
      This change has no performance impact in TCP single stream test and
      XDP_TX single stream test.
      
      When compiled with a recent GCC, this change shows no visible
      performance impact on UDP pktgen (burst 32) single stream test either:
        Packet rate: 16.86 Mpps (±0.15 Mpps) -> 16.95 Mpps (±0.15 Mpps)
        Instructions per packet: 434 -> 429
        Cycles per packet: 158 -> 160
        Instructions per cycle: 2.75 -> 2.69
      
      CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (x86_64)
      NIC: Mellanox ConnectX-6 Dx
      GCC 10.2.0
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8e4b53f6
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Move mlx5e_tx_wqe_inline_mode to en_tx.c · d02dfcd5
      Maxim Mikityanskiy authored
      Move mlx5e_tx_wqe_inline_mode from en/txrx.h to en_tx.c as it's only
      used there.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      d02dfcd5
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Use struct assignment to initialize mlx5e_tx_wqe_info · 8ba6f183
      Maxim Mikityanskiy authored
      Struct assignment guarantees that all fields of the structure are
      initialized (those that are not mentioned are zeroed). It makes code
      mode robust and reduces chances for unpredictable behavior when one
      forgets to reset some field and it holds an old value from previous
      iterations of using the structure.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8ba6f183
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Refactor inline header size calculation in the TX path · 6d55af43
      Maxim Mikityanskiy authored
      As preparation for the next patch, don't increase ihs to calculate
      ds_cnt and then decrease it, but rather calculate the intermediate value
      temporarily. This code has the same amount of arithmetic operations, but
      now allows to split out ds_cnt calculation, which will be performed in
      the next patch.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      6d55af43
    • David S. Miller's avatar
      Merge branch 'Fix-broken-tc-flower-rules-for-mscc_ocelot-switches' · b334ec66
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      Fix broken tc-flower rules for mscc_ocelot switches
      
      All 3 switch drivers from the Ocelot family have the same bug in the
      VCAP IS2 key offsets, which is that some keys are in the incorrect
      order.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b334ec66
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries · 8194d8fa
      Vladimir Oltean authored
      The IS2 IP4_TCP_UDP key offsets do not correspond to the VSC7514
      datasheet. Whether they work or not is unknown to me. On VSC9959 and
      VSC9953, with the same mistake and same discrepancy from the
      documentation, tc-flower src_port and dst_port rules did not work, so I
      am assuming the same is true here.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8194d8fa
    • Vladimir Oltean's avatar
      net: dsa: seville: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries · 7a023075
      Vladimir Oltean authored
      Since these were copied from the Felix VCAP IS2 code, and only the
      offsets were adjusted, the order of the bit fields is still wrong.
      Fix it.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7a023075
    • Xiaoliang Yang's avatar
      net: dsa: felix: fix some key offsets for IP4_TCP_UDP VCAP IS2 entries · 8b9e03cd
      Xiaoliang Yang authored
      Some of the IS2 IP4_TCP_UDP keys are not correct, like L4_DPORT,
      L4_SPORT and other L4 keys. This prevents offloaded tc-flower rules from
      matching on src_port and dst_port for TCP and UDP packets.
      Signed-off-by: default avatarXiaoliang Yang <xiaoliang.yang_1@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b9e03cd
    • Eric Dumazet's avatar
      inet_diag: validate INET_DIAG_REQ_PROTOCOL attribute · d5e4d0a5
      Eric Dumazet authored
      User space could send an invalid INET_DIAG_REQ_PROTOCOL attribute
      as caught by syzbot.
      
      BUG: KMSAN: uninit-value in inet_diag_lock_handler net/ipv4/inet_diag.c:55 [inline]
      BUG: KMSAN: uninit-value in __inet_diag_dump+0x58c/0x720 net/ipv4/inet_diag.c:1147
      CPU: 0 PID: 8505 Comm: syz-executor174 Not tainted 5.9.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x21c/0x280 lib/dump_stack.c:118
       kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:122
       __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:219
       inet_diag_lock_handler net/ipv4/inet_diag.c:55 [inline]
       __inet_diag_dump+0x58c/0x720 net/ipv4/inet_diag.c:1147
       inet_diag_dump_compat+0x2a5/0x380 net/ipv4/inet_diag.c:1254
       netlink_dump+0xb73/0x1cb0 net/netlink/af_netlink.c:2246
       __netlink_dump_start+0xcf2/0xea0 net/netlink/af_netlink.c:2354
       netlink_dump_start include/linux/netlink.h:246 [inline]
       inet_diag_rcv_msg_compat+0x5da/0x6c0 net/ipv4/inet_diag.c:1288
       sock_diag_rcv_msg+0x24f/0x620 net/core/sock_diag.c:256
       netlink_rcv_skb+0x6d7/0x7e0 net/netlink/af_netlink.c:2470
       sock_diag_rcv+0x63/0x80 net/core/sock_diag.c:275
       netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
       netlink_unicast+0x11c8/0x1490 net/netlink/af_netlink.c:1330
       netlink_sendmsg+0x173a/0x1840 net/netlink/af_netlink.c:1919
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg net/socket.c:671 [inline]
       ____sys_sendmsg+0xc82/0x1240 net/socket.c:2353
       ___sys_sendmsg net/socket.c:2407 [inline]
       __sys_sendmsg+0x6d1/0x820 net/socket.c:2440
       __do_sys_sendmsg net/socket.c:2449 [inline]
       __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
       __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
       do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x441389
      Code: e8 fc ab 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fff3b02ce98 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441389
      RDX: 0000000000000000 RSI: 0000000020001500 RDI: 0000000000000003
      RBP: 00000000006cb018 R08: 00000000004002c8 R09: 00000000004002c8
      R10: 0000000000000004 R11: 0000000000000246 R12: 0000000000402130
      R13: 00000000004021c0 R14: 0000000000000000 R15: 0000000000000000
      
      Uninit was created at:
       kmsan_save_stack_with_flags mm/kmsan/kmsan.c:143 [inline]
       kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:126
       kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:80
       slab_alloc_node mm/slub.c:2907 [inline]
       __kmalloc_node_track_caller+0x9aa/0x12f0 mm/slub.c:4511
       __kmalloc_reserve net/core/skbuff.c:142 [inline]
       __alloc_skb+0x35f/0xb30 net/core/skbuff.c:210
       alloc_skb include/linux/skbuff.h:1094 [inline]
       netlink_alloc_large_skb net/netlink/af_netlink.c:1176 [inline]
       netlink_sendmsg+0xdb9/0x1840 net/netlink/af_netlink.c:1894
       sock_sendmsg_nosec net/socket.c:651 [inline]
       sock_sendmsg net/socket.c:671 [inline]
       ____sys_sendmsg+0xc82/0x1240 net/socket.c:2353
       ___sys_sendmsg net/socket.c:2407 [inline]
       __sys_sendmsg+0x6d1/0x820 net/socket.c:2440
       __do_sys_sendmsg net/socket.c:2449 [inline]
       __se_sys_sendmsg+0x97/0xb0 net/socket.c:2447
       __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2447
       do_syscall_64+0x9f/0x140 arch/x86/entry/common.c:48
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 3f935c75 ("inet_diag: support for wider protocol numbers")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Christoph Paasch <cpaasch@apple.com>
      Cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
      Acked-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d5e4d0a5
    • Vladimir Oltean's avatar
      net: bridge: br_vlan_get_pvid_rcu() should dereference the VLAN group under RCU · 99f62a74
      Vladimir Oltean authored
      When calling the RCU brother of br_vlan_get_pvid(), lockdep warns:
      
      =============================
      WARNING: suspicious RCU usage
      5.9.0-rc3-01631-g13c17acb8e38-dirty #814 Not tainted
      -----------------------------
      net/bridge/br_private.h:1054 suspicious rcu_dereference_protected() usage!
      
      Call trace:
       lockdep_rcu_suspicious+0xd4/0xf8
       __br_vlan_get_pvid+0xc0/0x100
       br_vlan_get_pvid_rcu+0x78/0x108
      
      The warning is because br_vlan_get_pvid_rcu() calls nbp_vlan_group()
      which calls rtnl_dereference() instead of rcu_dereference(). In turn,
      rtnl_dereference() calls rcu_dereference_protected() which assumes
      operation under an RCU write-side critical section, which obviously is
      not the case here. So, when the incorrect primitive is used to access
      the RCU-protected VLAN group pointer, READ_ONCE() is not used, which may
      cause various unexpected problems.
      
      I'm sad to say that br_vlan_get_pvid() and br_vlan_get_pvid_rcu() cannot
      share the same implementation. So fix the bug by splitting the 2
      functions, and making br_vlan_get_pvid_rcu() retrieve the VLAN groups
      under proper locking annotations.
      
      Fixes: 7582f5b7 ("bridge: add br_vlan_get_pvid_rcu()")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      99f62a74
    • David S. Miller's avatar
      Merge tag 'mlx5-fixes-2020-09-18' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 47cec3f6
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5 fixes-2020-09-18
      
      This series introduces some fixes to mlx5 driver.
      
      Please pull and let me know if there is any problem.
      
      v1->v2:
       Remove missing patch from -stable list.
      
      For -stable v5.1
       ('net/mlx5: Fix FTE cleanup')
      
      For -stable v5.3
       ('net/mlx5e: TLS, Do not expose FPGA TLS counter if not supported')
       ('net/mlx5e: Enable adding peer miss rules only if merged eswitch is supported')
      
      For -stable v5.7
       ('net/mlx5e: Fix memory leak of tunnel info when rule under multipath not ready')
      
      For -stable v5.8
       ('net/mlx5e: Use RCU to protect rq->xdp_prog')
       ('net/mlx5e: Fix endianness when calculating pedit mask first bit')
       ('net/mlx5e: Use synchronize_rcu to sync with NAPI')
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      47cec3f6
    • Sean Wang's avatar
      net: Update MAINTAINERS for MediaTek switch driver · 2b617c11
      Sean Wang authored
      Update maintainers for MediaTek switch driver with Landen Chao who is
      familiar with MediaTek MT753x switch devices and will help maintenance
      from the vendor side.
      
      Cc: Steven Liu <steven.liu@mediatek.com>
      Signed-off-by: default avatarSean Wang <sean.wang@mediatek.com>
      Signed-off-by: default avatarLanden Chao <Landen.Chao@mediatek.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b617c11
    • Saeed Mahameed's avatar
      net/mlx5e: mlx5e_fec_in_caps() returns a boolean · cb39ccc5
      Saeed Mahameed authored
      Returning errno is a bug, fix that.
      
      Also fixes smatch warnings:
      drivers/net/ethernet/mellanox/mlx5/core/en/port.c:453
      mlx5e_fec_in_caps() warn: signedness bug returning '(-95)'
      
      Fixes: 2132b71f ("net/mlx5e: Advertise globaly supported FEC modes")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Reviewed-by: default avatarAya Levin <ayal@nvidia.com>
      cb39ccc5
    • Saeed Mahameed's avatar
      net/mlx5e: kTLS, Avoid kzalloc(GFP_KERNEL) under spinlock · 94c4fed7
      Saeed Mahameed authored
      The spinlock only needed when accessing the channel's icosq, grab the lock
      after the buf allocation in resync_post_get_progress_params() to avoid
      kzalloc(GFP_KERNEL) in atomic context.
      
      Fixes: 0419d8c9 ("net/mlx5e: kTLS, Add kTLS RX resync support")
      Reported-by: default avatarYueHaibing <yuehaibing@huawei.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      94c4fed7
    • Saeed Mahameed's avatar
      net/mlx5e: kTLS, Fix leak on resync error flow · 581642f3
      Saeed Mahameed authored
      Resync progress params buffer and dma weren't released on error,
      Add missing error unwinding for resync_post_get_progress_params().
      
      Fixes: 0419d8c9 ("net/mlx5e: kTLS, Add kTLS RX resync support")
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      581642f3
    • Saeed Mahameed's avatar
      net/mlx5e: kTLS, Add missing dma_unmap in RX resync · 66ce5fc0
      Saeed Mahameed authored
      Progress params dma address is never unmapped, unmap it when completion
      handling is over.
      
      Fixes: 0419d8c9 ("net/mlx5e: kTLS, Add kTLS RX resync support")
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      66ce5fc0
    • Tariq Toukan's avatar
      net/mlx5e: kTLS, Fix napi sync and possible use-after-free · 6e8de0b6
      Tariq Toukan authored
      Using synchronize_rcu() is sufficient to wait until running NAPI quits.
      
      See similar upstream fix with detailed explanation:
      ("net/mlx5e: Use synchronize_rcu to sync with NAPI")
      
      This change also fixes a possible use-after-free as the NAPI
      might be already released at this stage.
      
      Fixes: 0419d8c9 ("net/mlx5e: kTLS, Add kTLS RX resync support")
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      6e8de0b6
    • Tariq Toukan's avatar
      net/mlx5e: TLS, Do not expose FPGA TLS counter if not supported · 8f0bcd19
      Tariq Toukan authored
      The set of TLS TX global SW counters in mlx5e_tls_sw_stats_desc
      is updated from all rings by using atomic ops.
      This set of stats is used only in the FPGA TLS use case, not in
      the Connect-X TLS one, where regular per-ring counters are used.
      
      Do not expose them in the Connect-X use case, as this would cause
      counter duplication. For example, tx_tls_drop_no_sync_data would
      appear twice in the ethtool stats.
      
      Fixes: d2ead1f3 ("net/mlx5e: Add kTLS TX HW offload support")
      Signed-off-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@nvidia.com>
      8f0bcd19
    • Alaa Hleihel's avatar
      net/mlx5e: Fix using wrong stats_grps in mlx5e_update_ndo_stats() · b521105b
      Alaa Hleihel authored
      The cited commit started to reuse function mlx5e_update_ndo_stats() for
      the representors as well.
      However, the function is hard-coded to work on mlx5e_nic_stats_grps only.
      Due to this issue, the representors statistics were not updated in the
      output of "ip -s".
      
      Fix it to work with the correct group by extracting it from the caller's
      profile.
      
      Also, while at it and since this function became generic, move it to
      en_stats.c and rename it accordingly.
      
      Fixes: 8a236b15 ("net/mlx5e: Convert rep stats to mlx5e_stats_grp-based infra")
      Signed-off-by: default avatarAlaa Hleihel <alaa@nvidia.com>
      Reviewed-by: default avatarVlad Buslov <vladbu@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      b521105b
    • Ron Diskin's avatar
      net/mlx5e: Fix multicast counter not up-to-date in "ip -s" · 47c97e6b
      Ron Diskin authored
      Currently the FW does not generate events for counters other than error
      counters. Unlike ".get_ethtool_stats", ".ndo_get_stats64" (which ip -s
      uses) might run in atomic context, while the FW interface is non atomic.
      Thus, 'ip' is not allowed to issue FW commands, so it will only display
      cached counters in the driver.
      
      Add a SW counter (mcast_packets) in the driver to count rx multicast
      packets. The counter also counts broadcast packets, as we consider it a
      special case of multicast.
      Use the counter value when calling "ip -s"/"ifconfig".
      
      Fixes: f62b8bb8 ("net/mlx5: Extend mlx5_core to support ConnectX-4 Ethernet functionality")
      Signed-off-by: default avatarRon Diskin <rondi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: default avatarMoshe Shemesh <moshe@mellanox.com>
      Signed-off-by: default avatarSaeed Mahameed <saeedm@mellanox.com>
      47c97e6b