1. 07 Oct, 2021 7 commits
    • Jakub Kicinski's avatar
      Merge branch 'ethtool-add-ability-to-control-transceiver-modules-power-mode' · 4c827082
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      ethtool: Add ability to control transceiver modules' power mode
      
      This patchset extends the ethtool netlink API to allow user space to
      control transceiver modules. Two specific APIs are added, but the plan
      is to extend the interface with more APIs in the future (see "Future
      plans").
      
      This submission is a complete rework of a previous submission [1] that
      tried to achieve the same goal by allowing user space to write to the
      EEPROMs of these modules. It was rejected as it could have enabled user
      space binary blob drivers.
      
      However, the main issue is that by directly writing to some pages of
      these EEPROMs, we are interfering with the entity that is controlling
      the modules (kernel / device firmware). In addition, some functionality
      cannot be implemented solely by writing to the EEPROM, as it requires
      the assertion / de-assertion of hardware signals (e.g., "ResetL" pin in
      SFF-8636).
      
      Motivation
      ==========
      
      The kernel can currently dump the contents of module EEPROMs to user
      space via the ethtool legacy ioctl API or the new netlink API. These
      dumps can then be parsed by ethtool(8) according to the specification
      that defines the memory map of the EEPROM. For example, SFF-8636 [2] for
      QSFP and CMIS [3] for QSFP-DD.
      
      In addition to read-only elements, these specifications also define
      writeable elements that can be used to control the behavior of the
      module. For example, controlling whether the module is put in low or
      high power mode to limit its power consumption.
      
      The CMIS specification even defines a message exchange mechanism (CDB,
      Command Data Block) on top of the module's memory map. This allows the
      host to send various commands to the module. For example, to update its
      firmware.
      
      Implementation
      ==============
      
      The ethtool netlink API is extended with two new messages,
      'ETHTOOL_MSG_MODULE_SET' and 'ETHTOOL_MSG_MODULE_GET', that allow user
      space to set and get transceiver module parameters. Specifically, the
      'ETHTOOL_A_MODULE_POWER_MODE_POLICY' attribute allows user space to
      control the power mode policy of the module in order to limit its power
      consumption. See detailed description in patch #1.
      
      The user API is designed to be generic enough so that it could be used
      for modules with different memory maps (e.g., SFF-8636, CMIS).
      
      The only implementation of the device driver API in this series is for a
      MAC driver (mlxsw) where the module is controlled by the device's
      firmware, but it is designed to be generic enough so that it could also
      be used by implementations where the module is controlled by the kernel.
      
      Testing and introspection
      =========================
      
      See detailed description in patches #1 and #5.
      
      Patchset overview
      =================
      
      Patch #1 adds the initial infrastructure in ethtool along with the
      ability to control transceiver modules' power mode.
      
      Patches #2-#3 add required device registers in mlxsw.
      
      Patch #4 implements in mlxsw the ethtool operations added in patch #1.
      
      Patch #5 adds extended link states in order to allow user space to
      troubleshoot link down issues related to transceiver modules.
      
      Patch #6 adds support for these extended states in mlxsw.
      
      Future plans
      ============
      
      * Extend 'ETHTOOL_MSG_MODULE_SET' to control Tx output among other
      attributes.
      
      * Add new ethtool message(s) to update firmware on transceiver modules.
      
      * Extend ethtool(8) to parse more diagnostic information from CMIS
      modules. No kernel changes required.
      
      [1] https://lore.kernel.org/netdev/20210623075925.2610908-1-idosch@idosch.org/
      [2] https://members.snia.org/document/dl/26418
      [3] http://www.qsfp-dd.com/wp-content/uploads/2021/05/CMIS5p0.pdf
      
      Previous versions:
      [4] https://lore.kernel.org/netdev/20211003073219.1631064-1-idosch@idosch.org/
      [5] https://lore.kernel.org/netdev/20210824130344.1828076-1-idosch@idosch.org/
      [6] https://lore.kernel.org/netdev/20210818155202.1278177-1-idosch@idosch.org/
      [7] https://lore.kernel.org/netdev/20210809102152.719961-1-idosch@idosch.org/
      ====================
      
      Link: https://lore.kernel.org/r/20211006104647.2357115-1-idosch@idosch.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4c827082
    • Ido Schimmel's avatar
      mlxsw: Add support for transceiver module extended state · 235dbbec
      Ido Schimmel authored
      Add support for the transceiver module extended state and sub-state
      added in previous patch. The extended state is meant to describe link
      issues related to transceiver modules.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      235dbbec
    • Ido Schimmel's avatar
      ethtool: Add transceiver module extended state · 3dfb5112
      Ido Schimmel authored
      Add an extended state and sub-state to describe link issues related to
      transceiver modules.
      
      The 'ETHTOOL_LINK_EXT_SUBSTATE_MODULE_CMIS_NOT_READY' extended sub-state
      tells user space that port is unable to gain a carrier because the CMIS
      Module State Machine did not reach the ModuleReady (Fully Operational)
      state. For example, if the module is stuck at ModuleLowPwr or
      ModuleFault state. In case of the latter, user space can read the fault
      reason from the module's EEPROM and potentially reset it.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3dfb5112
    • Ido Schimmel's avatar
      mlxsw: Add ability to control transceiver modules' power mode · 0455dc50
      Ido Schimmel authored
      Implement support for ethtool_ops::.get_module_power_mode and
      ethtool_ops::set_module_power_mode.
      
      The get operation is implemented using the Management Cable IO and
      Notifications (MCION) register that reports the operational power mode
      of the module and its presence. In case a module is not present, its
      operational power mode is not reported to ethtool and user space. If not
      set before, the power mode policy is reported as "high", which is the
      default on Mellanox systems.
      
      The set operation is implemented using the Port Module Memory Map
      Properties (PMMP) register. The register instructs the device's firmware
      to transition a plugged-in module to / out of low power mode by writing
      to its memory map.
      
      When the power mode policy is set to 'auto', a module will not
      transition to low power mode as long as any ports using it are
      administratively up. Example:
      
       # devlink port split swp11 count 4
      
       # ethtool --set-module swp11s0 power-mode-policy auto
      
       $ ethtool --show-module swp11s0
       Module parameters for swp11s0:
       power-mode-policy auto
       power-mode low
      
       # ip link set dev swp11s0 up
      
       # ip link set dev swp11s1 up
      
       $ ethtool --show-module swp11s0
       Module parameters for swp11s0:
       power-mode-policy auto
       power-mode high
      
       # ip link set dev swp11s1 down
      
       $ ethtool --show-module swp11s0
       Module parameters for swp11s0:
       power-mode-policy auto
       power-mode high
      
       # ip link set dev swp11s0 down
      
       $ ethtool --show-module swp11s0
       Module parameters for swp11s0:
       power-mode-policy auto
       power-mode low
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0455dc50
    • Ido Schimmel's avatar
      mlxsw: reg: Add Management Cable IO and Notifications register · fc53f5fb
      Ido Schimmel authored
      Add the Management Cable IO and Notifications register. It will be used
      to retrieve the power mode status of a module in subsequent patches and
      whether a module is present in a cage or not.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fc53f5fb
    • Ido Schimmel's avatar
      mlxsw: reg: Add Port Module Memory Map Properties register · f10ba086
      Ido Schimmel authored
      Add the Port Module Memory Map Properties register. It will be used to
      set the power mode of a module in subsequent patches.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f10ba086
    • Ido Schimmel's avatar
      ethtool: Add ability to control transceiver modules' power mode · 353407d9
      Ido Schimmel authored
      Add a pair of new ethtool messages, 'ETHTOOL_MSG_MODULE_SET' and
      'ETHTOOL_MSG_MODULE_GET', that can be used to control transceiver
      modules parameters and retrieve their status.
      
      The first parameter to control is the power mode of the module. It is
      only relevant for paged memory modules, as flat memory modules always
      operate in low power mode.
      
      When a paged memory module is in low power mode, its power consumption
      is reduced to the minimum, the management interface towards the host is
      available and the data path is deactivated.
      
      User space can choose to put modules that are not currently in use in
      low power mode and transition them to high power mode before putting the
      associated ports administratively up. This is useful for user space that
      favors reduced power consumption and lower temperatures over reduced
      link up times. In QSFP-DD modules the transition from low power mode to
      high power mode can take a few seconds and this transition is only
      expected to get longer with future / more complex modules.
      
      User space can control the power mode of the module via the power mode
      policy attribute ('ETHTOOL_A_MODULE_POWER_MODE_POLICY'). Possible
      values:
      
      * high: Module is always in high power mode.
      
      * auto: Module is transitioned by the host to high power mode when the
        first port using it is put administratively up and to low power mode
        when the last port using it is put administratively down.
      
      The operational power mode of the module is available to user space via
      the 'ETHTOOL_A_MODULE_POWER_MODE' attribute. The attribute is not
      reported to user space when a module is not plugged-in.
      
      The user API is designed to be generic enough so that it could be used
      for modules with different memory maps (e.g., SFF-8636, CMIS).
      
      The only implementation of the device driver API in this series is for a
      MAC driver (mlxsw) where the module is controlled by the device's
      firmware, but it is designed to be generic enough so that it could also
      be used by implementations where the module is controlled by the CPU.
      
      CMIS testing
      ============
      
       # ethtool -m swp11
       Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
       ...
       Module State                              : 0x03 (ModuleReady)
       LowPwrAllowRequestHW                      : Off
       LowPwrRequestSW                           : Off
      
      The module is not in low power mode, as it is not forced by hardware
      (LowPwrAllowRequestHW is off) or by software (LowPwrRequestSW is off).
      
      The power mode can be queried from the kernel. In case
      LowPwrAllowRequestHW was on, the kernel would need to take into account
      the state of the LowPwrRequestHW signal, which is not visible to user
      space.
      
       $ ethtool --show-module swp11
       Module parameters for swp11:
       power-mode-policy high
       power-mode high
      
      Change the power mode policy to 'auto':
      
       # ethtool --set-module swp11 power-mode-policy auto
      
      Query the power mode again:
      
       $ ethtool --show-module swp11
       Module parameters for swp11:
       power-mode-policy auto
       power-mode low
      
      Verify with the data read from the EEPROM:
      
       # ethtool -m swp11
       Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
       ...
       Module State                              : 0x01 (ModuleLowPwr)
       LowPwrAllowRequestHW                      : Off
       LowPwrRequestSW                           : On
      
      Put the associated port administratively up which will instruct the host
      to transition the module to high power mode:
      
       # ip link set dev swp11 up
      
      Query the power mode again:
      
       $ ethtool --show-module swp11
       Module parameters for swp11:
       power-mode-policy auto
       power-mode high
      
      Verify with the data read from the EEPROM:
      
       # ethtool -m swp11
       Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
       ...
       Module State                              : 0x03 (ModuleReady)
       LowPwrAllowRequestHW                      : Off
       LowPwrRequestSW                           : Off
      
      Put the associated port administratively down which will instruct the
      host to transition the module to low power mode:
      
       # ip link set dev swp11 down
      
      Query the power mode again:
      
       $ ethtool --show-module swp11
       Module parameters for swp11:
       power-mode-policy auto
       power-mode low
      
      Verify with the data read from the EEPROM:
      
       # ethtool -m swp11
       Identifier                                : 0x18 (QSFP-DD Double Density 8X Pluggable Transceiver (INF-8628))
       ...
       Module State                              : 0x01 (ModuleLowPwr)
       LowPwrAllowRequestHW                      : Off
       LowPwrRequestSW                           : On
      
      SFF-8636 testing
      ================
      
       # ethtool -m swp13
       Identifier                                : 0x11 (QSFP28)
       ...
       Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) enabled
       Power set                                 : Off
       Power override                            : On
       ...
       Transmit avg optical power (Channel 1)    : 0.7733 mW / -1.12 dBm
       Transmit avg optical power (Channel 2)    : 0.7649 mW / -1.16 dBm
       Transmit avg optical power (Channel 3)    : 0.7790 mW / -1.08 dBm
       Transmit avg optical power (Channel 4)    : 0.7837 mW / -1.06 dBm
       Rcvr signal avg optical power(Channel 1)  : 0.9302 mW / -0.31 dBm
       Rcvr signal avg optical power(Channel 2)  : 0.9079 mW / -0.42 dBm
       Rcvr signal avg optical power(Channel 3)  : 0.8993 mW / -0.46 dBm
       Rcvr signal avg optical power(Channel 4)  : 0.8778 mW / -0.57 dBm
      
      The module is not in low power mode, as it is not forced by hardware
      (Power override is on) or by software (Power set is off).
      
      The power mode can be queried from the kernel. In case Power override
      was off, the kernel would need to take into account the state of the
      LPMode signal, which is not visible to user space.
      
       $ ethtool --show-module swp13
       Module parameters for swp13:
       power-mode-policy high
       power-mode high
      
      Change the power mode policy to 'auto':
      
       # ethtool --set-module swp13 power-mode-policy auto
      
      Query the power mode again:
      
       $ ethtool --show-module swp13
       Module parameters for swp13:
       power-mode-policy auto
       power-mode low
      
      Verify with the data read from the EEPROM:
      
       # ethtool -m swp13
       Identifier                                : 0x11 (QSFP28)
       Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) not enabled
       Power set                                 : On
       Power override                            : On
       ...
       Transmit avg optical power (Channel 1)    : 0.0000 mW / -inf dBm
       Transmit avg optical power (Channel 2)    : 0.0000 mW / -inf dBm
       Transmit avg optical power (Channel 3)    : 0.0000 mW / -inf dBm
       Transmit avg optical power (Channel 4)    : 0.0000 mW / -inf dBm
       Rcvr signal avg optical power(Channel 1)  : 0.0000 mW / -inf dBm
       Rcvr signal avg optical power(Channel 2)  : 0.0000 mW / -inf dBm
       Rcvr signal avg optical power(Channel 3)  : 0.0000 mW / -inf dBm
       Rcvr signal avg optical power(Channel 4)  : 0.0000 mW / -inf dBm
      
      Put the associated port administratively up which will instruct the host
      to transition the module to high power mode:
      
       # ip link set dev swp13 up
      
      Query the power mode again:
      
       $ ethtool --show-module swp13
       Module parameters for swp13:
       power-mode-policy auto
       power-mode high
      
      Verify with the data read from the EEPROM:
      
       # ethtool -m swp13
       Identifier                                : 0x11 (QSFP28)
       ...
       Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) enabled
       Power set                                 : Off
       Power override                            : On
       ...
       Transmit avg optical power (Channel 1)    : 0.7934 mW / -1.01 dBm
       Transmit avg optical power (Channel 2)    : 0.7859 mW / -1.05 dBm
       Transmit avg optical power (Channel 3)    : 0.7885 mW / -1.03 dBm
       Transmit avg optical power (Channel 4)    : 0.7985 mW / -0.98 dBm
       Rcvr signal avg optical power(Channel 1)  : 0.9325 mW / -0.30 dBm
       Rcvr signal avg optical power(Channel 2)  : 0.9034 mW / -0.44 dBm
       Rcvr signal avg optical power(Channel 3)  : 0.9086 mW / -0.42 dBm
       Rcvr signal avg optical power(Channel 4)  : 0.8885 mW / -0.51 dBm
      
      Put the associated port administratively down which will instruct the
      host to transition the module to low power mode:
      
       # ip link set dev swp13 down
      
      Query the power mode again:
      
       $ ethtool --show-module swp13
       Module parameters for swp13:
       power-mode-policy auto
       power-mode low
      
      Verify with the data read from the EEPROM:
      
       # ethtool -m swp13
       Identifier                                : 0x11 (QSFP28)
       ...
       Extended identifier description           : 5.0W max. Power consumption,  High Power Class (> 3.5 W) not enabled
       Power set                                 : On
       Power override                            : On
       ...
       Transmit avg optical power (Channel 1)    : 0.0000 mW / -inf dBm
       Transmit avg optical power (Channel 2)    : 0.0000 mW / -inf dBm
       Transmit avg optical power (Channel 3)    : 0.0000 mW / -inf dBm
       Transmit avg optical power (Channel 4)    : 0.0000 mW / -inf dBm
       Rcvr signal avg optical power(Channel 1)  : 0.0000 mW / -inf dBm
       Rcvr signal avg optical power(Channel 2)  : 0.0000 mW / -inf dBm
       Rcvr signal avg optical power(Channel 3)  : 0.0000 mW / -inf dBm
       Rcvr signal avg optical power(Channel 4)  : 0.0000 mW / -inf dBm
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      353407d9
  2. 06 Oct, 2021 10 commits
  3. 05 Oct, 2021 23 commits