1. 24 Feb, 2022 4 commits
  2. 23 Feb, 2022 36 commits
    • Xin Long's avatar
      Revert "vlan: move dev_put into vlan_dev_uninit" · 6a47cdc3
      Xin Long authored
      This reverts commit d6ff94af.
      
      Since commit faab39f6 ("net: allow out-of-order netdev unregistration")
      fixed the issue in a better way, this patch is to revert the previous fix,
      as it might bring back the old problem fixed by commit 563bcbae ("net:
      vlan: fix a UAF in vlan_dev_real_dev()").
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Link: https://lore.kernel.org/r/563c0a6e48510ccbff9ef4715de37209695e9fc4.1645592097.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6a47cdc3
    • Sebastian Andrzej Siewior's avatar
      net: Correct wrong BH disable in hard-interrupt. · 167053f8
      Sebastian Andrzej Siewior authored
      I missed the obvious case where netif_ix() is invoked from hard-IRQ
      context.
      
      Disabling bottom halves is only needed in process context. This ensures
      that the code remains on the current CPU and that the soft-interrupts
      are processed at local_bh_enable() time.
      In hard- and soft-interrupt context this is already the case and the
      soft-interrupts will be processed once the context is left (at irq-exit
      time).
      
      Disable bottom halves if neither hard-interrupts nor soft-interrupts are
      disabled. Update the kernel-doc, mention that interrupts must be enabled
      if invoked from process context.
      
      Fixes: baebdf48 ("net: dev: Makes sure netif_rx() can be invoked in any context.")
      Reported-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: default avatarSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Tested-by: default avatarMarek Szyprowski <m.szyprowski@samsung.com>
      Tested-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Link: https://lore.kernel.org/r/Yg05duINKBqvnxUc@linutronix.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      167053f8
    • David S. Miller's avatar
      Merge branch 'locked-bridge-ports' · 6ce71687
      David S. Miller authored
      Hans Schultz says:
      
      ====================
      Add support for locked bridge ports (for 802.1X)
      
      This series starts by adding support for SA filtering to the bridge,
      which is then allowed to be offloaded to switchdev devices. Furthermore
      an offloading implementation is supplied for the mv88e6xxx driver.
      
      Public Local Area Networks are often deployed such that there is a
      risk of unauthorized or unattended clients getting access to the LAN.
      To prevent such access we introduce SA filtering, such that ports
      designated as secure ports are set in locked mode, so that only
      authorized source MAC addresses are given access by adding them to
      the bridges forwarding database. Incoming packets with source MAC
      addresses that are not in the forwarding database of the bridge are
      discarded. It is then the task of user space daemons to populate the
      bridge's forwarding database with static entries of authorized entities.
      
      The most common approach is to use the IEEE 802.1X protocol to take
      care of the authorization of allowed users to gain access by opening
      for the source address of the authorized host.
      
      With the current use of the bridge parameter in hostapd, there is
      a limitation in using this for IEEE 802.1X port authentication. It
      depends on hostapd attaching the port on which it has a successful
      authentication to the bridge, but that only allows for a single
      authentication per port. This patch set allows for the use of
      IEEE 802.1X port authentication in a more general network context with
      multiple 802.1X aware hosts behind a single port as depicted, which is
      a commonly used commercial use-case, as it is only the number of
      available entries in the forwarding database that limits the number of
      authenticated clients.
      
            +--------------------------------+
            |                                |
            |      Bridge/Authenticator      |
            |                                |
            +-------------+------------------+
             802.1X port  |
                          |
                          |
                   +------+-------+
                   |              |
                   |  Hub/Switch  |
                   |              |
                   +-+----------+-+
                     |          |
                  +--+--+    +--+--+
                  |     |    |     |
          Hosts   |  a  |    |  b  |   . . .
                  |     |    |     |
                  +-----+    +-----+
      
      The 802.1X standard involves three different components, a Supplicant
      (Host), an Authenticator (Network Access Point) and an Authentication
      Server which is typically a Radius server. This patch set thus enables
      the bridge module together with an authenticator application to serve
      as an Authenticator on designated ports.
      
      For the bridge to become an IEEE 802.1X Authenticator, a solution using
      hostapd with the bridge driver can be found at
      https://github.com/westermo/hostapd/tree/bridge_driver .
      
      The relevant components work transparently in relation to if it is the
      bridge module or the offloaded switchcore case that is in use.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ce71687
    • Hans Schultz's avatar
      selftests: forwarding: tests of locked port feature · b2b681a4
      Hans Schultz authored
      These tests check that the basic locked port feature works, so that
      no 'host' can communicate (ping) through a locked port unless the
      MAC address of the 'host' interface is in the forwarding database of
      the bridge.
      Signed-off-by: default avatarHans Schultz <schultz.hans+netdev@gmail.com>
      Acked-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b2b681a4
    • Hans Schultz's avatar
      net: dsa: mv88e6xxx: Add support for bridge port locked mode · 34ea415f
      Hans Schultz authored
      Supporting bridge ports in locked mode using the drop on lock
      feature in Marvell mv88e6xxx switchcores is described in the
      '88E6096/88E6097/88E6097F Datasheet', sections 4.4.6, 4.4.7 and
      5.1.2.1 (Drop on Lock).
      
      This feature is implemented here facilitated by the locked port flag.
      Signed-off-by: default avatarHans Schultz <schultz.hans+netdev@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34ea415f
    • Hans Schultz's avatar
      net: dsa: Include BR_PORT_LOCKED in the list of synced brport flags · b9e8b58f
      Hans Schultz authored
      Ensures that the DSA switch driver gets notified of changes to the
      BR_PORT_LOCKED flag as well, for the case when a DSA port joins or
      leaves a LAG that is a bridge port.
      Signed-off-by: default avatarHans Schultz <schultz.hans+netdev@gmail.com>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9e8b58f
    • Hans Schultz's avatar
      net: bridge: Add support for offloading of locked port flag · fa1c8334
      Hans Schultz authored
      Various switchcores support setting ports in locked mode, so that
      clients behind locked ports cannot send traffic through the port
      unless a fdb entry is added with the clients MAC address.
      Signed-off-by: default avatarHans Schultz <schultz.hans+netdev@gmail.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fa1c8334
    • Hans Schultz's avatar
      net: bridge: Add support for bridge port in locked mode · a21d9a67
      Hans Schultz authored
      In a 802.1X scenario, clients connected to a bridge port shall not
      be allowed to have traffic forwarded until fully authenticated.
      A static fdb entry of the clients MAC address for the bridge port
      unlocks the client and allows bidirectional communication.
      
      This scenario is facilitated with setting the bridge port in locked
      mode, which is also supported by various switchcore chipsets.
      Signed-off-by: default avatarHans Schultz <schultz.hans+netdev@gmail.com>
      Acked-by: default avatarNikolay Aleksandrov <nikolay@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a21d9a67
    • Eric Dumazet's avatar
      drop_monitor: remove quadratic behavior · b26ef81c
      Eric Dumazet authored
      drop_monitor is using an unique list on which all netdevices in
      the host have an element, regardless of their netns.
      
      This scales poorly, not only at device unregister time (what I
      caught during my netns dismantle stress tests), but also at packet
      processing time whenever trace_napi_poll_hit() is called.
      
      If the intent was to avoid adding one pointer in 'struct net_device'
      then surely we prefer O(1) behavior.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b26ef81c
    • David S. Miller's avatar
      Merge branch 'mlxsw-next' · 503310a5
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Various updates
      
      This patchset contains miscellaneous updates to mlxsw gathered over
      time.
      
      Patches #1-#2 fix recent regressions present in net-next.
      
      Patches #3-#11 are small cleanups performed while adding line card
      support in mlxsw.
      
      Patch #12 adds the SFF-8024 Identifier Value of OSFP transceiver in
      order to be able to dump their EEPROM contents over the ethtool IOCTL
      interface.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      503310a5
    • Danielle Ratson's avatar
      mlxsw: core: Add support for OSFP transceiver modules · f881c4ab
      Danielle Ratson authored
      The driver can already dump the EEPROM contents of QSFP-DD transceiver
      modules via its ethtool_ops::get_module_info() and
      ethtool_ops::get_module_eeprom() callbacks.
      
      Add support for OSFP transceiver modules by adding their SFF-8024
      Identifier Value (0x19).
      
      This is required for future NVIDIA Spectrum-4 based systems that will be
      equipped with OSFP transceivers.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f881c4ab
    • Ido Schimmel's avatar
      mlxsw: Remove resource query check · cc4d3de9
      Ido Schimmel authored
      Since SwitchX-2 support was removed in commit b0d80c01 ("mlxsw:
      Remove Mellanox SwitchX-2 ASIC support"), all the ASICs supported by
      mlxsw support the resource query command.
      
      Therefore, remove the resource query check and always query resources
      from the device.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cc4d3de9
    • Vadim Pasternak's avatar
      mlxsw: core: Unify method of trap support validation · 902992d1
      Vadim Pasternak authored
      Currently there are several different features defined in 'mlxsw_driver'
      for trap support validation. There is no reason to have dedicated
      features for specific traps. Perform validation of all of them by
      testing feature 'MLXSW_BUS_F_TXRX'.
      
      Remove trap capability validation from 'core_env.c' which is redundant
      after validation has been added to mlxsw_core_trap_register().
      Signed-off-by: default avatarVadim Pasternak <vadimp@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      902992d1
    • Jiri Pirko's avatar
      mlxsw: spectrum: Remove SP{1,2,3} defines for FW minor and subminor · 8b5f555b
      Jiri Pirko authored
      The FW minor and subminor versions are the same for all generations of
      Spectrum ASICs. Unify them into a single set of defines.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8b5f555b
    • Vadim Pasternak's avatar
      mlxsw: core: Remove unnecessary asserts · af9911c5
      Vadim Pasternak authored
      Remove unnecessary asserts for module index validation. Leave only one
      that is actually necessary in mlxsw_env_pmpe_listener_func() where the
      module index is directly read from the firmware event.
      Signed-off-by: default avatarVadim Pasternak <vadimp@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af9911c5
    • Vadim Pasternak's avatar
      mlxsw: reg: Add "mgpir_" prefix to MGPIR fields comments · 719fc066
      Vadim Pasternak authored
      Do the same as for other registers and have "mgpir_" prefix for the
      MGPIR fields.
      Signed-off-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarVadim Pasternak <vadimp@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      719fc066
    • Vadim Pasternak's avatar
      mlxsw: core_thermal: Remove obsolete API for query resource · bfb82c9c
      Vadim Pasternak authored
      Remove obsolete API mlxsw_core_res_query_enabled(), which is only
      relevant for end-of-life SwitchX-2 ASICs. Support for these ASICs was
      removed in commit b0d80c01 ("mlxsw: Remove Mellanox SwitchX-2 ASIC
      support").
      Signed-off-by: default avatarVadim Pasternak <vadimp@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bfb82c9c
    • Vadim Pasternak's avatar
      mlxsw: core_thermal: Rename labels according to naming convention · 009da9fa
      Vadim Pasternak authored
      Rename labels for error flow handling in order to align with naming
      convention used in rest of 'mlxsw' code.
      Signed-off-by: default avatarVadim Pasternak <vadimp@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      009da9fa
    • Vadim Pasternak's avatar
      mlxsw: core_hwmon: Fix variable names for hwmon attributes · bed8f419
      Vadim Pasternak authored
      Replace all local variables 'mlwsw_hwmon_attr' by 'mlxsw_hwmon_attr'.
      All variable prefixes should start with 'mlxsw' according to the naming
      convention, so 'mlwsw' is changed to 'mlxsw'.
      Signed-off-by: default avatarVadim Pasternak <vadimp@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bed8f419
    • Vadim Pasternak's avatar
      mlxsw: core_thermal: Avoid creation of virtual hwmon objects by thermal module · f8a36880
      Vadim Pasternak authored
      The driver registers with both the hwmon and thermal subsystems.
      Therefore, there is no need for the thermal subsystem to automatically
      create hwmon entries upon registration of a thermal zone, as this
      results in duplicate information.
      
      Avoid creation of virtual hwmon objects by thermal subsystem by
      registering a thermal zone with 'no_hwmon' set to 'true'.
      Signed-off-by: default avatarVadim Pasternak <vadimp@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8a36880
    • Ido Schimmel's avatar
      mlxsw: spectrum_span: Ignore VLAN entries not used by the bridge in mirroring · 42c9135f
      Ido Schimmel authored
      Only VLAN entries installed on the bridge device itself should be
      considered when checking whether a packet with a specific VLAN can be
      mirrored via a bridge device. VLAN entries only used to keep context
      (i.e., entries with 'BRIDGE_VLAN_INFO_BRENTRY' unset) should be ignored.
      
      Fix this by preventing mirroring when the VLAN entry does not have the
      'BRIDGE_VLAN_INFO_BRENTRY' flag set.
      
      Fixes: ddaff504 ("mlxsw: spectrum: remove guards against !BRIDGE_VLAN_INFO_BRENTRY")
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42c9135f
    • Vadim Pasternak's avatar
      mlxsw: core: Prevent trap group setting if driver does not support EMAD · c035ea76
      Vadim Pasternak authored
      Avoid trap group setting if driver is not capable of EMAD support.
      For example, "mlxsw_minimal" driver works over I2C bus, overs which
      EMADs cannot be sent.
      Validation is performed by testing feature 'MLXSW_BUS_F_TXRX'.
      
      Fixes: 74e0494d ("mlxsw: core: Move basic_trap_groups_set() call out of EMAD init code")
      Signed-off-by: default avatarVadim Pasternak <vadimp@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c035ea76
    • Matt Johnston's avatar
      mctp: Fix warnings reported by clang-analyzer · 8d783197
      Matt Johnston authored
      net/mctp/device.c:140:11: warning: Assigned value is garbage or undefined
      [clang-analyzer-core.uninitialized.Assign]
              mcb->idx = idx;
      
      - Not a real problem due to how the callback runs, fix the warning.
      
      net/mctp/route.c:458:4: warning: Value stored to 'msk' is never read
      [clang-analyzer-deadcode.DeadStores]
              msk = container_of(key->sk, struct mctp_sock, sk);
      
      - 'msk' dead assignment can be removed here.
      Signed-off-by: default avatarMatt Johnston <matt@codeconstruct.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d783197
    • David S. Miller's avatar
      Merge branch 'mctp-incorrect-addr-refs' · 3185485c
      David S. Miller authored
      Matt Johnston says:
      
      ====================
      mctp: Fix incorrect refs for extended addr
      
      This fixes an incorrect netdev unref and also addresses the race
      condition identified by Jakub in v2. Thanks for the review.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3185485c
    • Matt Johnston's avatar
      mctp: Fix incorrect netdev unref for extended addr · e297db3e
      Matt Johnston authored
      In the extended addressing local route output codepath
      dev_get_by_index_rcu() doesn't take a dev_hold() so we shouldn't
      dev_put().
      Signed-off-by: default avatarMatt Johnston <matt@codeconstruct.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e297db3e
    • Matt Johnston's avatar
      mctp: make __mctp_dev_get() take a refcount hold · dc121c00
      Matt Johnston authored
      Previously there was a race that could allow the mctp_dev refcount
      to hit zero:
      
      rcu_read_lock();
      mdev = __mctp_dev_get(dev);
      // mctp_unregister() happens here, mdev->refs hits zero
      mctp_dev_hold(dev);
      rcu_read_unlock();
      
      Now we make __mctp_dev_get() take the hold itself. It is safe to test
      against the zero refcount because __mctp_dev_get() is called holding
      rcu_read_lock and mctp_dev uses kfree_rcu().
      Reported-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarMatt Johnston <matt@codeconstruct.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc121c00
    • David S. Miller's avatar
      Merge branch 'dsa-realtek-phy-read-corruption' · 4767b7e2
      David S. Miller authored
      Alvin Šipraga says:
      
      ====================
      net: dsa: realtek: fix PHY register read corruption
      
      These two patches fix the issue reported by Arınç where PHY register
      reads sometimes return garbage data.
      
      v1 -> v2:
      
      - no code changes
      - just update the commit message of patch 2 to reflect the conclusion
        of further investigation requested by Vladimir
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4767b7e2
    • Alvin Šipraga's avatar
      net: dsa: realtek: rtl8365mb: serialize indirect PHY register access · 27967284
      Alvin Šipraga authored
      Realtek switches in the rtl8365mb family can access the PHY registers of
      the internal PHYs via the switch registers. This method is called
      indirect access. At a high level, the indirect PHY register access
      method involves reading and writing some special switch registers in a
      particular sequence. This works for both SMI and MDIO connected
      switches.
      
      Currently the rtl8365mb driver does not take any care to serialize the
      aforementioned access to the switch registers. In particular, it is
      permitted for other driver code to access other switch registers while
      the indirect PHY register access is ongoing. Locking is only done at the
      regmap level. This, however, is a bug: concurrent register access, even
      to unrelated switch registers, risks corrupting the PHY register value
      read back via the indirect access method described above.
      
      Arınç reported that the switch sometimes returns nonsense data when
      reading the PHY registers. In particular, a value of 0 causes the
      kernel's PHY subsystem to think that the link is down, but since most
      reads return correct data, the link then flip-flops between up and down
      over a period of time.
      
      The aforementioned bug can be readily observed by:
      
       1. Enabling ftrace events for regmap and mdio
       2. Polling BSMR PHY register for a connected port;
          it should always read the same (e.g. 0x79ed)
       3. Wait for step 2 to give a different value
      
      Example command for step 2:
      
          while true; do phytool read swp2/2/0x01; done
      
      On my i.MX8MM, the above steps will yield a bogus value for the BSMR PHY
      register within a matter of seconds. The interleaved register access it
      then evident in the trace log:
      
       kworker/3:4-70      [003] .......  1927.139849: regmap_reg_write: ethernet-switch reg=1004 val=bd
           phytool-16816   [002] .......  1927.139979: regmap_reg_read: ethernet-switch reg=1f01 val=0
       kworker/3:4-70      [003] .......  1927.140381: regmap_reg_read: ethernet-switch reg=1005 val=0
           phytool-16816   [002] .......  1927.140468: regmap_reg_read: ethernet-switch reg=1d15 val=a69
       kworker/3:4-70      [003] .......  1927.140864: regmap_reg_read: ethernet-switch reg=1003 val=0
           phytool-16816   [002] .......  1927.140955: regmap_reg_write: ethernet-switch reg=1f02 val=2041
       kworker/3:4-70      [003] .......  1927.141390: regmap_reg_read: ethernet-switch reg=1002 val=0
           phytool-16816   [002] .......  1927.141479: regmap_reg_write: ethernet-switch reg=1f00 val=1
       kworker/3:4-70      [003] .......  1927.142311: regmap_reg_write: ethernet-switch reg=1004 val=be
           phytool-16816   [002] .......  1927.142410: regmap_reg_read: ethernet-switch reg=1f01 val=0
       kworker/3:4-70      [003] .......  1927.142534: regmap_reg_read: ethernet-switch reg=1005 val=0
           phytool-16816   [002] .......  1927.142618: regmap_reg_read: ethernet-switch reg=1f04 val=0
           phytool-16816   [002] .......  1927.142641: mdio_access: SMI-0 read  phy:0x02 reg:0x01 val:0x0000 <- ?!
       kworker/3:4-70      [003] .......  1927.143037: regmap_reg_read: ethernet-switch reg=1001 val=0
       kworker/3:4-70      [003] .......  1927.143133: regmap_reg_read: ethernet-switch reg=1000 val=2d89
       kworker/3:4-70      [003] .......  1927.143213: regmap_reg_write: ethernet-switch reg=1004 val=be
       kworker/3:4-70      [003] .......  1927.143291: regmap_reg_read: ethernet-switch reg=1005 val=0
       kworker/3:4-70      [003] .......  1927.143368: regmap_reg_read: ethernet-switch reg=1003 val=0
       kworker/3:4-70      [003] .......  1927.143443: regmap_reg_read: ethernet-switch reg=1002 val=6
      
      The kworker here is polling MIB counters for stats, as evidenced by the
      register 0x1004 that we are writing to (RTL8365MB_MIB_ADDRESS_REG). This
      polling is performed every 3 seconds, but is just one example of such
      unsynchronized access. In Arınç's case, the driver was not using the
      switch IRQ, so the PHY subsystem was itself doing polling analogous to
      phytool in the above example.
      
      A test module was created [see second Link] to simulate such spurious
      switch register accesses while performing indirect PHY register reads
      and writes. Realtek was also consulted to confirm whether this is a
      known issue or not. The conclusion of these lines of inquiry is as
      follows:
      
      1. Reading of PHY registers via indirect access will be aborted if,
         after executing the read operation (via a write to the
         INDIRECT_ACCESS_CTRL_REG), any register is accessed, other than
         INDIRECT_ACCESS_STATUS_REG.
      
      2. The PHY register indirect read is only complete when
         INDIRECT_ACCESS_STATUS_REG reads zero.
      
      3. The INDIRECT_ACCESS_DATA_REG, which is read to get the result of the
         PHY read, will contain the result of the last successful read
         operation. If there was spurious register access and the indirect
         read was aborted, then this register is not guaranteed to hold
         anything meaningful and the PHY read will silently fail.
      
      4. PHY writes do not appear to be affected by this mechanism.
      
      5. Other similar access routines, such as for MIB counters, although
         similar to the PHY indirect access method, are actually table access.
         Table access is not affected by spurious reads or writes of other
         registers. However, concurrent table access is not allowed. Currently
         this is protected via mib_lock, so there is nothing to fix.
      
      The above statements are corroborated both via the test module and
      through consultation with Realtek. In particular, Realtek states that
      this is simply a property of the hardware design and is not a hardware
      bug.
      
      To fix this problem, one must guard against regmap access while the
      PHY indirect register read is executing. Fix this by using the newly
      introduced "nolock" regmap in all PHY-related functions, and by aquiring
      the regmap mutex at the top level of the PHY register access callbacks.
      Although no issue has been observed with PHY register _writes_, this
      change also serializes the indirect access method there. This is done
      purely as a matter of convenience and for reasons of symmetry.
      
      Fixes: 4af2950c ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC")
      Link: https://lore.kernel.org/netdev/CAJq09z5FCgG-+jVT7uxh1a-0CiiFsoKoHYsAWJtiKwv7LXKofQ@mail.gmail.com/
      Link: https://lore.kernel.org/netdev/871qzwjmtv.fsf@bang-olufsen.dk/Reported-by: default avatarArınç ÜNAL <arinc.unal@arinc9.com>
      Reported-by: default avatarLuiz Angelo Daros de Luca <luizluca@gmail.com>
      Signed-off-by: default avatarAlvin Šipraga <alsi@bang-olufsen.dk>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Reviewed-by: default avatarLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27967284
    • Alvin Šipraga's avatar
      net: dsa: realtek: allow subdrivers to externally lock regmap · 907e772f
      Alvin Šipraga authored
      Currently there is no way for Realtek DSA subdrivers to serialize
      consecutive regmap accesses. In preparation for a bugfix relating to
      indirect PHY register access - which involves a series of regmap
      reads and writes - add a facility for subdrivers to serialize their
      regmap access.
      
      Specifically, a mutex is added to the driver private data structure and
      the standard regmap is initialized with custom lock/unlock ops which use
      this mutex. Then, a "nolock" variant of the regmap is added, which is
      functionally equivalent to the existing regmap except that regmap
      locking is disabled. Functions that wish to serialize a sequence of
      regmap accesses may then lock the newly introduced driver-owned mutex
      before using the nolock regmap.
      
      Doing things this way means that subdriver code that doesn't care about
      serialized register access - i.e. the vast majority of code - needn't
      worry about synchronizing register access with an external lock: it can
      just continue to use the original regmap.
      
      Another advantage of this design is that, while regmaps with locking
      disabled do not expose a debugfs interface for obvious reasons, there
      still exists the original regmap which does expose this interface. This
      interface remains safe to use even combined with driver codepaths that
      use the nolock regmap, because said codepaths will use the same mutex
      to synchronize access.
      
      With respect to disadvantages, it can be argued that having
      near-duplicate regmaps is confusing. However, the naming is rather
      explicit, and examples will abound.
      
      Finally, while we are at it, rename realtek_smi_mdio_regmap_config to
      realtek_smi_regmap_config. This makes it consistent with the naming
      realtek_mdio_regmap_config in realtek-mdio.c.
      Signed-off-by: default avatarAlvin Šipraga <alsi@bang-olufsen.dk>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      907e772f
    • Vladimir Oltean's avatar
      net: switchdev: avoid infinite recursion from LAG to bridge with port object handler · acd8df58
      Vladimir Oltean authored
      The logic from switchdev_handle_port_obj_add_foreign() is directly
      adapted from switchdev_handle_fdb_event_to_device(), which already
      detects events on foreign interfaces and reoffloads them towards the
      switchdev neighbors.
      
      However, when we have a simple br0 <-> bond0 <-> swp0 topology and the
      switchdev_handle_port_obj_add_foreign() gets called on bond0, we get
      stuck into an infinite recursion:
      
      1. bond0 does not pass check_cb(), so we attempt to find switchdev
         neighbor interfaces. For that, we recursively call
         __switchdev_handle_port_obj_add() for bond0's bridge, br0.
      
      2. __switchdev_handle_port_obj_add() recurses through br0's lowers,
         essentially calling __switchdev_handle_port_obj_add() for bond0
      
      3. Go to step 1.
      
      This happens because switchdev_handle_fdb_event_to_device() and
      switchdev_handle_port_obj_add_foreign() are not exactly the same.
      The FDB event helper special-cases LAG interfaces with its lag_mod_cb(),
      so this is why we don't end up in an infinite loop - because it doesn't
      attempt to treat LAG interfaces as potentially foreign bridge ports.
      
      The problem is solved by looking ahead through the bridge's lowers to
      see whether there is any switchdev interface that is foreign to the @dev
      we are currently processing. This stops the recursion described above at
      step 1: __switchdev_handle_port_obj_add(bond0) will not create another
      call to __switchdev_handle_port_obj_add(br0). Going one step upper
      should only happen when we're starting from a bridge port that has been
      determined to be "foreign" to the switchdev driver that passes the
      foreign_dev_check_cb().
      
      Fixes: c4076cdd ("net: switchdev: introduce switchdev_handle_port_obj_{add,del} for foreign interfaces")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      acd8df58
    • Shannon Nelson's avatar
      ionic: use vmalloc include · 922ea87f
      Shannon Nelson authored
      The ever-vigilant Linux kernel test robot reminded us that
      we need to use the correct include files to be sure that
      all the build variations will work correctly.  Adding the
      vmalloc.h include takes care of declaring our use of vzalloc()
      and vfree().
      
      drivers/net/ethernet/pensando/ionic/ionic_lif.c:396:17: error: implicit
      declaration of function 'vfree'; did you mean 'kvfree'?
      
      drivers/net/ethernet/pensando/ionic/ionic_lif.c:531:21: warning:
      assignment to 'struct ionic_desc_info *' from 'int' makes pointer from
      integer without a cast
      
      Fixes: 116dce0f ("ionic: Use vzalloc for large per-queue related buffers")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Link: https://lore.kernel.org/r/20220223015731.22025-1-snelson@pensando.ioSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      922ea87f
    • Jakub Kicinski's avatar
      Merge branch 'tcp-take-care-of-another-syzbot-issue' · fa4fad40
      Jakub Kicinski authored
      Eric Dumazet says:
      
      ====================
      tcp: take care of another syzbot issue
      
      This is a minor issue: It took months for syzbot to find a C repro,
      and even with it, I had to spend a lot of time to understand KFENCE
      was a prereq. With the default kfence 500ms interval, I had to be
      very patient to trigger the kernel warning and perform my analysis.
      
      This series targets net-next tree, because I added a new generic helper
      in the first patch, then fixed the issue in the second one.
      They can be backported once proven solid.
      ====================
      
      Link: https://lore.kernel.org/r/20220222032113.4005821-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa4fad40
    • Eric Dumazet's avatar
      net: preserve skb_end_offset() in skb_unclone_keeptruesize() · 2b88cba5
      Eric Dumazet authored
      syzbot found another way to trigger the infamous WARN_ON_ONCE(delta < len)
      in skb_try_coalesce() [1]
      
      I was able to root cause the issue to kfence.
      
      When kfence is in action, the following assertion is no longer true:
      
      int size = xxxx;
      void *ptr1 = kmalloc(size, gfp);
      void *ptr2 = kmalloc(size, gfp);
      
      if (ptr1 && ptr2)
      	ASSERT(ksize(ptr1) == ksize(ptr2));
      
      We attempted to fix these issues in the blamed commits, but forgot
      that TCP was possibly shifting data after skb_unclone_keeptruesize()
      has been used, notably from tcp_retrans_try_collapse().
      
      So we not only need to keep same skb->truesize value,
      we also need to make sure TCP wont fill new tailroom
      that pskb_expand_head() was able to get from a
      addr = kmalloc(...) followed by ksize(addr)
      
      Split skb_unclone_keeptruesize() into two parts:
      
      1) Inline skb_unclone_keeptruesize() for the common case,
         when skb is not cloned.
      
      2) Out of line __skb_unclone_keeptruesize() for the 'slow path'.
      
      WARNING: CPU: 1 PID: 6490 at net/core/skbuff.c:5295 skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295
      Modules linked in:
      CPU: 1 PID: 6490 Comm: syz-executor161 Not tainted 5.17.0-rc4-syzkaller-00229-g4f12b742 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:skb_try_coalesce+0x1235/0x1560 net/core/skbuff.c:5295
      Code: bf 01 00 00 00 0f b7 c0 89 c6 89 44 24 20 e8 62 24 4e fa 8b 44 24 20 83 e8 01 0f 85 e5 f0 ff ff e9 87 f4 ff ff e8 cb 20 4e fa <0f> 0b e9 06 f9 ff ff e8 af b2 95 fa e9 69 f0 ff ff e8 95 b2 95 fa
      RSP: 0018:ffffc900063af268 EFLAGS: 00010293
      RAX: 0000000000000000 RBX: 00000000ffffffd5 RCX: 0000000000000000
      RDX: ffff88806fc05700 RSI: ffffffff872abd55 RDI: 0000000000000003
      RBP: ffff88806e675500 R08: 00000000ffffffd5 R09: 0000000000000000
      R10: ffffffff872ab659 R11: 0000000000000000 R12: ffff88806dd554e8
      R13: ffff88806dd9bac0 R14: ffff88806dd9a2c0 R15: 0000000000000155
      FS:  00007f18014f9700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020002000 CR3: 000000006be7a000 CR4: 00000000003506f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_try_coalesce net/ipv4/tcp_input.c:4651 [inline]
       tcp_try_coalesce+0x393/0x920 net/ipv4/tcp_input.c:4630
       tcp_queue_rcv+0x8a/0x6e0 net/ipv4/tcp_input.c:4914
       tcp_data_queue+0x11fd/0x4bb0 net/ipv4/tcp_input.c:5025
       tcp_rcv_established+0x81e/0x1ff0 net/ipv4/tcp_input.c:5947
       tcp_v4_do_rcv+0x65e/0x980 net/ipv4/tcp_ipv4.c:1719
       sk_backlog_rcv include/net/sock.h:1037 [inline]
       __release_sock+0x134/0x3b0 net/core/sock.c:2779
       release_sock+0x54/0x1b0 net/core/sock.c:3311
       sk_wait_data+0x177/0x450 net/core/sock.c:2821
       tcp_recvmsg_locked+0xe28/0x1fd0 net/ipv4/tcp.c:2457
       tcp_recvmsg+0x137/0x610 net/ipv4/tcp.c:2572
       inet_recvmsg+0x11b/0x5e0 net/ipv4/af_inet.c:850
       sock_recvmsg_nosec net/socket.c:948 [inline]
       sock_recvmsg net/socket.c:966 [inline]
       sock_recvmsg net/socket.c:962 [inline]
       ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
       ___sys_recvmsg+0x127/0x200 net/socket.c:2674
       __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c4777efa ("net: add and use skb_unclone_keeptruesize() helper")
      Fixes: 097b9146 ("net: fix up truesize of cloned skb in skb_prepare_for_shift()")
      Reported-by: default avatarsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2b88cba5
    • Eric Dumazet's avatar
      net: add skb_set_end_offset() helper · 763087da
      Eric Dumazet authored
      We have multiple places where this helper is convenient,
      and plan using it in the following patch.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      763087da
    • Eric Dumazet's avatar
      ipv6: tcp: consistently use MAX_TCP_HEADER · 0ebea8f9
      Eric Dumazet authored
      All other skbs allocated for TCP tx are using MAX_TCP_HEADER already.
      
      MAX_HEADER can be too small for some cases (like eBPF based encapsulation),
      so this can avoid extra pskb_expand_head() in lower stacks.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220222031115.4005060-1-eric.dumazet@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0ebea8f9
    • Maciek Machnikowski's avatar
      testptp: add option to shift clock by nanoseconds · f64ae40d
      Maciek Machnikowski authored
      Add option to shift the clock by a specified number of nanoseconds.
      
      The new argument -n will specify the number of nanoseconds to add to the
      ptp clock. Since the API doesn't support negative shifts those needs to
      be calculated by subtracting full seconds and adding a nanosecond offset.
      Signed-off-by: default avatarMaciek Machnikowski <maciek@machnikowski.net>
      Acked-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Link: https://lore.kernel.org/r/20220221200637.125595-1-maciek@machnikowski.netSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f64ae40d