1. 27 Feb, 2020 40 commits
    • Shalom Toledo's avatar
      selftests: mlxsw: Add mlxsw lib · 4240dbd8
      Shalom Toledo authored
      Add mlxsw lib for common defines, helpers etc.
      Signed-off-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4240dbd8
    • Shalom Toledo's avatar
      selftests: devlink_lib: Add devlink port helpers · 9fb74734
      Shalom Toledo authored
      Add two devlink port helpers:
       * devlink port get by netdev
       * devlink cpu port get
      Signed-off-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9fb74734
    • Shalom Toledo's avatar
      selftests: devlink_lib: Check devlink info command is supported · 552ec3d9
      Shalom Toledo authored
      Sanity check for devlink info command.
      Signed-off-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      552ec3d9
    • Shalom Toledo's avatar
      selftests: mlxsw: Add shared buffer configuration test · 6697b51e
      Shalom Toledo authored
      Test physical ports' shared buffer configuration options using random
      values related to a specific configuration option. There are 3
      configuration options: pool, TC bind and portpool.
      
      Each sub-test, test a different configuration option and random the related
      values as the follow:
       * For pools, pool's size will be randomized.
       * For TC bind, pool number and threshold will be randomized.
       * For portpools, threshold will be randomized.
      Signed-off-by: default avatarShalom Toledo <shalomt@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6697b51e
    • Danielle Ratson's avatar
      selftests: mlxsw: Use busywait helper in rtnetlink test · 1cbe65e0
      Danielle Ratson authored
      Rtnetlink test uses offload indication checks.
      
      Use a busywait helper and wait until the offload indication is set or
      fail if it reaches timeout.
      Signed-off-by: default avatarDanielle Ratson <danieller@mellanox.com>
      Reviewed-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1cbe65e0
    • Danielle Ratson's avatar
      selftests: mlxsw: Use busywait helper in vxlan test · 05ef614c
      Danielle Ratson authored
      Vxlan test uses offload indication checks.
      
      Use a busywait helper and wait until the offload indication is set or
      fail if it reaches timeout.
      Signed-off-by: default avatarDanielle Ratson <danieller@mellanox.com>
      Reviewed-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05ef614c
    • Danielle Ratson's avatar
      selftests: mlxsw: Use busywait helper in blackhole routes test · 0c22f993
      Danielle Ratson authored
      Blackhole routes test uses offload indication checks.
      
      Use busywait helper and wait until the routes offload indication is set or
      fail if it reaches timeout.
      Signed-off-by: default avatarDanielle Ratson <danieller@mellanox.com>
      Reviewed-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c22f993
    • Ido Schimmel's avatar
      selftests: devlink_trap_l3_drops: Avoid race condition · 5d66773f
      Ido Schimmel authored
      The test checks that packets are trapped when they should egress a
      router interface (RIF) that has become disabled. This is a temporary
      state in a RIF's deletion sequence.
      
      Currently, the test deletes the RIF by flushing all the IP addresses
      configured on the associated netdev (br0). However, this is racy, as
      this also flushes all the routes pointing to the netdev and if the
      routes are deleted from the device before the RIF is disabled, then no
      packets will try to egress the disabled RIF and the trap will not be
      triggered.
      
      Instead, trigger the deletion of the RIF by unlinking the mlxsw port
      from the bridge that is backing the RIF. Unlike before, this will not
      cause the kernel to delete the routes pointing to the bridge.
      
      Note that due to current mlxsw locking scheme the RIF is always deleted
      first, but this is going to change.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5d66773f
    • Jiri Pirko's avatar
      selftests: add a mirror test to mlxsw tc flower restrictions · ab2b8ab2
      Jiri Pirko authored
      Include test of forbidding to have multiple mirror actions.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ab2b8ab2
    • Jiri Pirko's avatar
      selftests: add egress redirect test to mlxsw tc flower restrictions · c84e903f
      Jiri Pirko authored
      Include test of forbidding to have redirect rule on egress-bound block.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c84e903f
    • Petr Machata's avatar
      selftests: mlxsw: Add a RED selftest · 3de611b5
      Petr Machata authored
      This tests that below the queue minimum length, there is no dropping /
      marking, and above max, everything is dropped / marked.
      
      The test is structured as a core file with topology and test code, and
      three wrappers: one for RED used as a root Qdisc, and two for
      testing (W)RED under PRIO and ETS.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3de611b5
    • Petr Machata's avatar
      selftests: forwarding: lib.sh: Add start_tcp_traffic · 4113b048
      Petr Machata authored
      Extract a helper __start_traffic() configurable by protocol type. Allow
      passing through extra mausezahn arguments. Add a wrapper,
      start_tcp_traffic().
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4113b048
    • David S. Miller's avatar
      Merge branch 'VLANs-DSA-switches-and-multiple-bridges' · 2b99e54b
      David S. Miller authored
      Russell King says:
      
      ====================
      VLANs, DSA switches and multiple bridges
      
      This is a repost of the previously posted RFC back in December, which
      did not get fully reviewed.  I've dropped the RFC tag this time as no
      one really found anything too problematical in the RFC posting.
      
      I've been trying to configure DSA for VLANs and not having much success.
      The setup is quite simple:
      
      - The main network is untagged
      - The wifi network is a vlan tagged with id $VN running over the main
        network.
      
      I have an Armada 388 Clearfog with a PCIe wifi card which I'm trying to
      setup to provide wifi access to the vlan $VN network, while the switch
      is also part of the main network.
      
      However, I'm encountering problems:
      
      1) vlan support in DSA has a different behaviour from the Linux
         software bridge implementation.
      
          # bridge vlan
          port    vlan ids
          lan1     1 PVID Egress Untagged
          ...
      
         shows the default setup - the bridge ports are all configured for
         vlan 1, untagged egress, and vlan 1 as the port vid.  Issuing:
      
          # ip li set dev br0 type bridge vlan_filtering 1
      
         with no other vlan configuration commands on a Linux software bridge
         continues to allow untagged traffic to flow across the bridge.
      
         This difference in behaviour is because the MV88E6xxx VTU is
         completely empty - because net/dsa ignores all vlan settings for
         a port if br_vlan_enabled(dp->bridge_dev) is false - this reflects
         the vlan filtering state of the bridge, not whether the bridge is
         vlan aware.
      
         What this means is that attempting to configure the bridge port
         vlans before enabling vlan filtering works for Linux software
         bridges, but fails for DSA bridges.
      
      2) Assuming the above is sorted, we move on to the next issue, which
         is altogether more weird.  Let's take a setup where we have a
         DSA bridge with lan1..6 in a bridge device, br0, with vlan
         filtering enabled.  lan1 is the upstream port, lan2 is a downstream
         port that also wants to see traffic on vlan id $VN.
      
         Both lan1 and lan2 are configured for that:
      
           # bridge vlan add vid $VN dev lan1
           # bridge vlan add vid $VN dev lan2
           # ip li set br0 type bridge vlan_filtering 1
      
         Untagged traffic can now pass between all the six lan ports, and
         vlan $VN between lan1 and lan2 only.  The MV88E6xxx 8021q_mode
         debugfs file shows all lan ports are in mode "secure" - this is
         important!  /sys/class/net/br0/bridge/vlan_filtering contains 1.
      
         tcpdumping from another machine on lan4 shows that no $VN traffic
         reaches it.  Everything seems to be working correctly...
      
         In order to further bridge vlan $VN traffic to hostapd's wifi
         interface, things get a little more complex - we can't add hostapd's
         wifi interface to br0 directly, because hostapd will bring up the
         wifi interface and leak the main, untagged traffic onto the wifi.
         (hostapd does have vlan support, but only as a dynamic per-client
         thing, and there's no hooks I can see to allow script-based config
         of the network setup before hostapd up's the wifi interface.)
      
         So, what I tried was:
      
           # ip li add link br0 name br0.$VN type vlan id $VN
           # bridge vlan add vid $VN dev br0 self
           # ip li set dev br0.$VN up
      
         So far so good, we get a vlan interface on top of the bridge, and
         tcpdumping it shows we get traffic.  The 8021q_mode file has not
         changed state.  Everything still seems to be correct.
      
           # bridge addbr br1
      
         Still nothing has changed.
      
           # bridge addif br1 br0.$VN
      
         And now the 8021q_mode debugfs file shows that all ports are now in
         "disabled" mode, but /sys/class/net/br0/bridge/vlan_filtering still
         contains '1'.  In other words, br0 still thinks vlan filtering is
         enabled, but the hardware has had vlan filtering disabled.
      
         Adding some stack traces to an appropriate point indicates that this
         is because __switchdev_handle_port_attr_set() recurses down through
         the tree of interfaces, skipping over the vlan interface, applying
         br1's configuration to br0's ports.
      
         This surely can not be right - surely
         __switchdev_handle_port_attr_set() and similar should stop recursing
         down through another master bridge device?  There are probably other
         network device classes that switchdev shouldn't recurse down too.
      
         I've considered whether switchdev is the right level to do it, and
         I think it is - as we want the check/set callbacks to be called for
         the top level device even if it is a master bridge device, but we
         don't want to recurse through a lower master bridge device.
      
      v2: dropped patch 3, since that has an outstanding issue, and my
      question on it has not been answered.  Otherwise, these are the
      same patches.  Maybe we can move forward with just these two?
      
      v3: include DSA ports in patch 2
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b99e54b
    • Russell King's avatar
      net: dsa: mv88e6xxx: fix duplicate vlan warning · 933b4425
      Russell King authored
      When setting VLANs on DSA switches, the VLAN is added to both the port
      concerned as well as the CPU port by dsa_slave_vlan_add(), as well as
      any DSA ports.  If multiple ports are configured with the same VLAN ID,
      this triggers a warning on the CPU and DSA ports.
      
      Avoid this warning for CPU and DSA ports.
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      933b4425
    • Russell King's avatar
      net: switchdev: do not propagate bridge updates across bridges · 07c6f980
      Russell King authored
      When configuring a tree of independent bridges, propagating changes
      from the upper bridge across a bridge master to the lower bridge
      ports brings surprises.
      
      For example, a lower bridge may have vlan filtering enabled.  It
      may have a vlan interface attached to the bridge master, which may
      then be incorporated into another bridge.  As soon as the lower
      bridge vlan interface is attached to the upper bridge, the lower
      bridge has vlan filtering disabled.
      
      This occurs because switchdev recursively applies its changes to
      all lower devices no matter what.
      Reviewed-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Tested-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarRussell King <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      07c6f980
    • Dan Carpenter's avatar
      net: qrtr: Fix error pointer vs NULL bugs · 9baeea50
      Dan Carpenter authored
      The callers only expect NULL pointers, so returning an error pointer
      will lead to an Oops.
      
      Fixes: 0c2204a4 ("net: qrtr: Migrate nameservice to kernel from userspace")
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9baeea50
    • Antoine Tenart's avatar
      net: phy: mscc: add missing shift for media operation mode selection · 1ac7b090
      Antoine Tenart authored
      This patch adds a missing shift for the media operation mode selection.
      This does not fix the driver as the current operation mode (copper) has
      a value of 0, but this wouldn't work for other modes.
      Signed-off-by: default avatarAntoine Tenart <antoine.tenart@bootlin.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1ac7b090
    • Arthur Kiyanovski's avatar
      net: ena: fix broken interface between ENA driver and FW · 92040c6d
      Arthur Kiyanovski authored
      In this commit we revert the part of
      commit 1a63443a ("net/amazon: Ensure that driver version is aligned to the linux kernel"),
      which breaks the interface between the ENA driver and FW.
      
      We also replace the use of DRIVER_VERSION with DRIVER_GENERATION
      when we bring back the deleted constants that are used in interface with
      ENA device FW.
      
      This commit does not change the driver version reported to the user via
      ethtool, which remains the kernel version.
      
      Fixes: 1a63443a ("net/amazon: Ensure that driver version is aligned to the linux kernel")
      Signed-off-by: default avatarArthur Kiyanovski <akiyano@amazon.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      92040c6d
    • David S. Miller's avatar
      Merge branch 'mptcp-update-mptcp-ack-sequence-outside-of-recv-path' · 621135a0
      David S. Miller authored
      Florian Westphal says:
      
      ====================
      mptcp: update mptcp ack sequence outside of recv path
      
      This series moves mptcp-level ack sequence update outside of the recvmsg path.
      Current approach has two problems:
      
      1. There is delay between arrival of new data and the time we can ack
         this data.
      2. If userspace doesn't call recv for some time, mptcp ack_seq is not
         updated at all, even if this data is queued in the subflow socket
         receive queue.
      
      Move skbs from the subflow socket receive queue to the mptcp-level
      receive queue, updating the mptcp-level ack sequence and have recv
      take skbs from the mptcp-level receive queue.
      
      The first place where we will attempt to update the mptcp level acks
      is from the subflows' data_ready callback, even before we make userspace
      aware of new data.
      
      Because of possible deadlock (we need to take the mptcp socket lock
      while already holding the subflow sockets lock), we may still need to
      defer the mptcp-level ack update.  In such case, this work will be either
      done from work queue or recv path, depending on which runs sooner.
      
      In order to avoid pointless scheduling of the work queue, work
      will be queued from the mptcp sockets lock release callback.
      This allows to detect when the socket owner did drain the subflow
      socket receive queue.
      
      Please see individual patches for more information.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      621135a0
    • Paolo Abeni's avatar
      mptcp: defer work schedule until mptcp lock is released · 14c441b5
      Paolo Abeni authored
      Don't schedule the work queue right away, instead defer this
      to the lock release callback.
      
      This has the advantage that it will give recv path a chance to
      complete -- this might have moved all pending packets from the
      subflow to the mptcp receive queue, which allows to avoid the
      schedule_work().
      Co-developed-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14c441b5
    • Florian Westphal's avatar
      mptcp: avoid work queue scheduling if possible · 2e52213c
      Florian Westphal authored
      We can't lock_sock() the mptcp socket from the subflow data_ready callback,
      it would result in ABBA deadlock with the subflow socket lock.
      
      We can however grab the spinlock: if that succeeds and the mptcp socket
      is not owned at the moment, we can process the new skbs right away
      without deferring this to the work queue.
      
      This avoids the schedule_work and hence the small delay until the
      work item is processed.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e52213c
    • Florian Westphal's avatar
      mptcp: remove mptcp_read_actor · bfae9dae
      Florian Westphal authored
      Only used to discard stale data from the subflow, so move
      it where needed.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfae9dae
    • Florian Westphal's avatar
      mptcp: add rmem queue accounting · 600911ff
      Florian Westphal authored
      If userspace never drains the receive buffers we must stop draining
      the subflow socket(s) at some point.
      
      This adds the needed rmem accouting for this.
      If the threshold is reached, we stop draining the subflows.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      600911ff
    • Florian Westphal's avatar
      mptcp: update mptcp ack sequence from work queue · 6771bfd9
      Florian Westphal authored
      If userspace is not reading data, all the mptcp-level acks contain the
      ack_seq from the last time userspace read data rather than the most
      recent in-sequence value.
      
      This causes pointless retransmissions for data that is already queued.
      
      The reason for this is that all the mptcp protocol level processing
      happens at mptcp_recv time.
      
      This adds work queue to move skbs from the subflow sockets receive
      queue on the mptcp socket receive queue (which was not used so far).
      
      This allows us to announce the correct mptcp ack sequence in a timely
      fashion, even when the application does not call recv() on the mptcp socket
      for some time.
      
      We still wake userspace tasks waiting for POLLIN immediately:
      If the mptcp level receive queue is empty (because the work queue is
      still pending) it can be filled from in-sequence subflow sockets at
      recv time without a need to wait for the worker.
      
      The skb_orphan when moving skbs from subflow to mptcp level is needed,
      because the destructor (sock_rfree) relies on skb->sk (ssk!) lock
      being taken.
      
      A followup patch will add needed rmem accouting for the moved skbs.
      
      Other problem: In case application behaves as expected, and calls
      recv() as soon as mptcp socket becomes readable, the work queue will
      only waste cpu cycles.  This will also be addressed in followup patches.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6771bfd9
    • Paolo Abeni's avatar
      mptcp: add work queue skeleton · 80992017
      Paolo Abeni authored
      Will be extended with functionality in followup patches.
      Initial user is moving skbs from subflows receive queue to
      the mptcp-level receive queue.
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80992017
    • Florian Westphal's avatar
      mptcp: add and use mptcp_data_ready helper · 101f6f85
      Florian Westphal authored
      allows us to schedule the work queue to drain the ssk receive queue in
      a followup patch.
      
      This is needed to avoid sending all-to-pessimistic mptcp-level
      acknowledgements.  At this time, the ack_seq is what was last read by
      userspace instead of the highest in-sequence number queued for reading.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Reviewed-by: default avatarMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      101f6f85
    • David S. Miller's avatar
      Merge branch 'mlxsw-Small-driver-update' · 5cd129dd
      David S. Miller authored
      Jiri Pirko says:
      
      ====================
      mlxsw: Small driver update
      
      This patchset contains couple of patches not related to each other. They
      are small optimization and extension changes to the driver.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cd129dd
    • Petr Machata's avatar
      mlxsw: spectrum: Add mlxsw_sp_span_ops.buffsize_get for Spectrum-3 · 3b909c55
      Petr Machata authored
      The buffer factor on Spectrum-3 is larger than on Spectrum-2. Add a new
      callback and use it for mlxsw_sp->span_ops on Spectrum-3.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3b909c55
    • Ido Schimmel's avatar
      mlxsw: spectrum: Initialize advertised speeds to supported speeds · b401ff85
      Ido Schimmel authored
      During port initialization the driver instructs the device to only
      advertise speeds that can be supported by the port's current width.
      
      Since the device now returns the supported speeds based on the port's
      current width, the driver no longer needs to compute the speeds that can
      be advertised.
      
      Simplify port initialization by setting the advertised speeds to the
      queried supported speeds.
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b401ff85
    • Petr Machata's avatar
      mlxsw: spectrum: Move the ECN-marked packet counter to ethtool · 8a29581e
      Petr Machata authored
      Spectrum-1 and Spectrum-2 do not have a per-TC counter of number of packets
      marked by ECN. The value reported currently is the total number of marked
      packets. Showing this value at individual TC Qdiscs is misleading.
      
      Move the counter to ethtool instead.
      Signed-off-by: default avatarPetr Machata <petrm@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a29581e
    • Jiri Pirko's avatar
      mlxsw: spectrum_switchdev: Optimize SFN records processing · 648e53ca
      Jiri Pirko authored
      Currently, only one SFN query is done from repetitive work at a time,
      processing 64 entries. Another work iteration is scheduled in 100ms,
      that means that the max rate of learned FDB entries is limited to 6400/s.
      That is slow. Fix this by doing 2 optimizations:
      1) Run 10 SFN queries at a time.
      2) In case the SFN is not drained, schedule work with 0 delay to allow
         to continue processing rest of the records.
      
      On a testing setup with 500K entries the time to process decreased
      from 870secs to 10secs.
      Signed-off-by: default avatarJiri Pirko <jiri@mellanox.com>
      Tested-by: default avatarAlex Kushnarov <alexanderk@mellanox.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      648e53ca
    • Randy Dunlap's avatar
      af_llc: fix if-statement empty body warning · c535f920
      Randy Dunlap authored
      When debugging via dprintk() is not enabled, make the dprintk()
      macro be an empty do-while loop, as is done in
      <linux/sunrpc/debug.h>.
      
      This fixes a gcc warning when -Wextra is set:
      ../net/llc/af_llc.c:974:51: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]
      
      I have verified that there is not object code change (with gcc 7.5.0).
      Signed-off-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Cc: netdev@vger.kernel.org
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c535f920
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2020-02-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 165b94ff
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2020-02-25
      
      The following series provides some misc updates to mlx5 driver:
      
      1) From Maxim, Refactoring for mlx5e netdev channels recreation flow.
        - Add error handling
        - Add context to the preactivate hook
        - Use preactivate hook with context where it can be used
          and subsequently unify channel recreation flow everywhere.
        - Fix XPS cpumask to not reset upon channel recreation.
      
      2) From Tariq:
        - Use indirect calls wrapper on RX.
        - Check LRO capability bit
      
      3) Multiple small cleanups
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      165b94ff
    • David S. Miller's avatar
      Merge branch 'net-smc-improve-peer-ID-in-CLC-decline' · 06baf4be
      David S. Miller authored
      Hans Wippel says:
      
      ====================
      net/smc: improve peer ID in CLC decline
      
      The following two patches improve the peer ID in CLC decline messages if
      RoCE devices are present in the host but no suitable device is found for
      a connection. The first patch reworks the peer ID initialization. The
      second patch contains the actual changes of the CLC decline messages.
      
      Changes v1 -> v2:
      * make smc_ib_is_valid_local_systemid() static in first patch
      * changed if in smc_clc_send_decline() to remove curly braces
      
      Changes RFC -> v1:
      * split the patch into two parts
      * removed zero assignment to global variable (thanks Leon)
      
      Thanks to Leon Romanovsky and Karsten Graul for the feedback!
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06baf4be
    • Hans Wippel's avatar
      net/smc: improve peer ID in CLC decline for SMC-R · a082ec89
      Hans Wippel authored
      According to RFC 7609, all CLC messages contain a peer ID that consists
      of a unique instance ID and the MAC address of one of the host's RoCE
      devices. But if a SMC-R connection cannot be established, e.g., because
      no matching pnet table entry is found, the current implementation uses a
      zero value in the CLC decline message although the host's peer ID is set
      to a proper value.
      
      If no RoCE and no ISM device is usable for a connection, there is no LGR
      and the LGR check in smc_clc_send_decline() prevents that the peer ID is
      copied into the CLC decline message for both SMC-D and SMC-R. So, this
      patch modifies the check to also accept the case of no LGR. Also, only a
      valid peer ID is copied into the decline message.
      Signed-off-by: default avatarHans Wippel <ndev@hwipl.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a082ec89
    • Hans Wippel's avatar
      net/smc: rework peer ID handling · 366bb249
      Hans Wippel authored
      This patch initializes the peer ID to a random instance ID and a zero
      MAC address. If a RoCE device is in the host, the MAC address part of
      the peer ID is overwritten with the respective address. Also, a function
      for checking if the peer ID is valid is added. A peer ID is considered
      valid if the MAC address part contains a non-zero MAC address.
      Signed-off-by: default avatarHans Wippel <ndev@hwipl.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      366bb249
    • Arjun Roy's avatar
      tcp-zerocopy: Update returned getsockopt() optlen. · 0b7f41f6
      Arjun Roy authored
      TCP receive zerocopy currently does not update the returned optlen for
      getsockopt() if the user passed in a larger than expected value.
      Thus, userspace cannot properly determine if all the fields are set in
      the passed-in struct. This patch sets the optlen for this case before
      returning, in keeping with the expected operation of getsockopt().
      
      Fixes: c8856c05 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.")
      Signed-off-by: default avatarArjun Roy <arjunroy@google.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b7f41f6
    • David S. Miller's avatar
      Merge branch 'net-fix-sysfs-permssions-when-device-changes-network' · ebb4a4bf
      David S. Miller authored
      Christian Brauner says:
      
      ====================
      net: fix sysfs permssions when device changes network
      
      /* v7 */
      This is v7 with a build warning fixup that slipped past me the last
      time. It removes to unused variables in sysfs_group_change_owner(). I
      observed no warning when building just now.
      
      /* v6 */
      This is v6 with two small fixups. I missed adapting the commit message
      to reflect the renamed helper for changing the owner of sysfs files and
      I also forgot to make the new dpm helper static inline.
      
      /* v5 */
      This is v5 with a small fixup requested by Rafael.
      
      /* v4 */
      This is v4 with more documentation and other fixes that Greg requested.
      
      /* v3 */
      This is v3 with explicit uid and gid parameters added to functions that
      change sysfs object ownership as Greg requested.
      
      (I've tagged this with net-next since it's triggered by a bug for
       network device files but it also touches driver core aspects so it's
       not clear-cut. I can of course split this series into separate
       patchsets.)
      We have been struggling with a bug surrounding the ownership of network
      device sysfs files when moving network devices between network
      namespaces owned by different user namespaces reported by multiple
      users.
      
      Currently, when moving network devices between network namespaces the
      ownership of the corresponding sysfs entries is not changed. This leads
      to problems when tools try to operate on the corresponding sysfs files.
      
      I also causes a bug when creating a network device in a network
      namespaces owned by a user namespace and moving that network device back
      to the host network namespaces. Because when a network device is created
      in a network namespaces it will be owned by the root user of the user
      namespace and all its associated sysfs files will also be owned by the
      root user of the corresponding user namespace.
      If such a network device has to be moved back to the host network
      namespace the permissions will still be set to the root user of the
      owning user namespaces of the originating network namespace. This means
      unprivileged users can e.g. re-trigger uevents for such incorrectly
      owned devices on the host or in other network namespaces. They can also
      modify the settings of the device itself through sysfs when they
      wouldn't be able to do the same through netlink. Both of these things
      are unwanted.
      
      For example, quite a few workloads will create network devices in the
      host network namespace. Other tools will then proceed to move such
      devices between network namespaces owner by other user namespaces. While
      the ownership of the device itself is updated in
      net/core/net-sysfs.c:dev_change_net_namespace() the corresponding sysfs
      entry for the device is not. Below you'll find that moving a network
      device (here a veth device) from a network namespace into another
      network namespaces owned by a different user namespace with a different
      id mapping. As you can see the permissions are wrong even though it is
      owned by the userns root user after it has been moved and can be
      interacted with through netlink:
      
      drwxr-xr-x 5 nobody nobody    0 Jan 25 18:08 .
      drwxr-xr-x 9 nobody nobody    0 Jan 25 18:08 ..
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_assign_type
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_len
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 address
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 broadcast
      -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_changes
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_down_count
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_up_count
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_id
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_port
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dormant
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 duplex
      -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 flags
      -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 gro_flush_timeout
      -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifalias
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifindex
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 iflink
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 link_mode
      -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 mtu
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 name_assign_type
      -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 netdev_group
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 operstate
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_id
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_name
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_switch_id
      drwxr-xr-x 2 nobody nobody    0 Jan 25 18:09 power
      -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 proto_down
      drwxr-xr-x 4 nobody nobody    0 Jan 25 18:09 queues
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 speed
      drwxr-xr-x 2 nobody nobody    0 Jan 25 18:09 statistics
      lrwxrwxrwx 1 nobody nobody    0 Jan 25 18:08 subsystem -> ../../../../class/net
      -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 tx_queue_len
      -r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 type
      -rw-r--r-- 1 nobody nobody 4096 Jan 25 18:08 uevent
      
      Constrast this with creating a device of the same type in the network
      namespace directly. In this case the device's sysfs permissions will be
      correctly updated.
      (Please also note, that in a lot of workloads this strategy of creating
       the network device directly in the network device to workaround this
       issue can not be used. Either because the network device is dedicated
       after it has been created or because it used by a process that is
       heavily sandboxed and couldn't create network devices itself.):
      
      drwxr-xr-x 5 root   root      0 Jan 25 18:12 .
      drwxr-xr-x 9 nobody nobody    0 Jan 25 18:08 ..
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 addr_assign_type
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 addr_len
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 address
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 broadcast
      -rw-r--r-- 1 root   root   4096 Jan 25 18:12 carrier
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 carrier_changes
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 carrier_down_count
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 carrier_up_count
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 dev_id
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 dev_port
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 dormant
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 duplex
      -rw-r--r-- 1 root   root   4096 Jan 25 18:12 flags
      -rw-r--r-- 1 root   root   4096 Jan 25 18:12 gro_flush_timeout
      -rw-r--r-- 1 root   root   4096 Jan 25 18:12 ifalias
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 ifindex
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 iflink
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 link_mode
      -rw-r--r-- 1 root   root   4096 Jan 25 18:12 mtu
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 name_assign_type
      -rw-r--r-- 1 root   root   4096 Jan 25 18:12 netdev_group
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 operstate
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 phys_port_id
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 phys_port_name
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 phys_switch_id
      drwxr-xr-x 2 root   root      0 Jan 25 18:12 power
      -rw-r--r-- 1 root   root   4096 Jan 25 18:12 proto_down
      drwxr-xr-x 4 root   root      0 Jan 25 18:12 queues
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 speed
      drwxr-xr-x 2 root   root      0 Jan 25 18:12 statistics
      lrwxrwxrwx 1 nobody nobody    0 Jan 25 18:12 subsystem -> ../../../../class/net
      -rw-r--r-- 1 root   root   4096 Jan 25 18:12 tx_queue_len
      -r--r--r-- 1 root   root   4096 Jan 25 18:12 type
      -rw-r--r-- 1 root   root   4096 Jan 25 18:12 uevent
      
      Now, when creating a network device in a network namespace owned by a
      user namespace and moving it to the host the permissions will be set to
      the id that the user namespace root user has been mapped to on the host
      leading to all sorts of permission issues mentioned above:
      
      458752
      drwxr-xr-x 5 458752 458752      0 Jan 25 18:12 .
      drwxr-xr-x 9 root   root        0 Jan 25 18:08 ..
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 addr_assign_type
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 addr_len
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 address
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 broadcast
      -rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier_changes
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier_down_count
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier_up_count
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 dev_id
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 dev_port
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 dormant
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 duplex
      -rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 flags
      -rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 gro_flush_timeout
      -rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 ifalias
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 ifindex
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 iflink
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 link_mode
      -rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 mtu
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 name_assign_type
      -rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 netdev_group
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 operstate
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 phys_port_id
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 phys_port_name
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 phys_switch_id
      drwxr-xr-x 2 458752 458752      0 Jan 25 18:12 power
      -rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 proto_down
      drwxr-xr-x 4 458752 458752      0 Jan 25 18:12 queues
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 speed
      drwxr-xr-x 2 458752 458752      0 Jan 25 18:12 statistics
      lrwxrwxrwx 1 root   root        0 Jan 25 18:12 subsystem -> ../../../../class/net
      -rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 tx_queue_len
      -r--r--r-- 1 458752 458752   4096 Jan 25 18:12 type
      -rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 uevent
      
      Fix this by changing the basic sysfs files associated with network
      devices when moving them between network namespaces. To this end we add
      some infrastructure to sysfs.
      
      The patchset takes care to only do this when the owning user namespaces
      changes and the kids differ. So there's only a performance overhead,
      when the owning user namespace of the network namespace is different
      __and__ the kid mappings for the root user are different for the two
      user namespaces:
      Assume we have a netdev eth0 which we create in netns1 owned by userns1.
      userns1 has an id mapping of 0 100000 100000. Now we move eth0 into
      netns2 which is owned by userns2 which also defines an id mapping of 0
      100000 100000. In this case sysfs doesn't need updating. The patch will
      handle this case and not do any needless work. Now assume eth0 is moved
      into netns3 which is owned by userns3 which defines an id mapping of 0
      123456 65536. In this case the root user in each namespace corresponds
      to different kid and sysfs needs updating.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ebb4a4bf
    • Christian Brauner's avatar
      net: fix sysfs permssions when device changes network namespace · ef6a4c88
      Christian Brauner authored
      Now that we moved all the helpers in place and make use netdev_change_owner()
      to fixup the permissions when moving network devices between network
      namespaces.
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef6a4c88
    • Christian Brauner's avatar
      net-sysfs: add queue_change_owner() · d755407d
      Christian Brauner authored
      Add a function to change the owner of the queue entries for a network device
      when it is moved between network namespaces.
      
      Currently, when moving network devices between network namespaces the
      ownership of the corresponding queue sysfs entries are not changed. This leads
      to problems when tools try to operate on the corresponding sysfs files. Fix
      this.
      Signed-off-by: default avatarChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d755407d