1. 31 May, 2023 32 commits
    • David S. Miller's avatar
      Merge branch 'net-led-hw-control-api' · f209c8ec
      David S. Miller authored
      Christian Marangi says:
      
      ====================
      leds: introduce new LED hw control APIs
      
      Since this series is cross subsystem between LED and netdev,
      a stable branch was created to facilitate merging process.
      
      This is based on top of branch ib-leds-netdev-v6.5 present here [1]
      and rebased on top of net-next since the LED stable branch got merged.
      
      This is a continue of [2]. It was decided to take a more gradual
      approach to implement LEDs support for switch and phy starting with
      basic support and then implementing the hw control part when we have all
      the prereq done.
      
      This is the main part of the series, the one that actually implement the
      hw control API.
      
      Some history about this feature and why
      =======================================
      
      This proposal is highly requested by the entire net community but the API
      is not strictly designed for net usage but for a more generic usage.
      
      Initial version were very flexible and designed to try to support every
      aspect of the LED driver with many complex function that served multiple
      purpose. There was an idea to have sw only and hw only LEDs and sw only
      and hw only LEDs.
      
      With some heads up from Andrew from the net mailing list, it was suggested
      to implement a more basic yet easy to implement system.
      
      These API strictly work with a designated trigger to offload their
      function.
      This may be confused with hw blink offload but LED may have an even more
      advanced configuration where the entire aspect of the trigger is
      offloaded and completely handled by the hardware.
      
      An example of this usage are PHY or switch port LEDs. Almost every of
      these kind of device have multiple LED attached and provide info of the
      current port state.
      
      Currently we lack any support of them but these device always provide a
      way to configure them, from basic feature like turning the LED off or no
      (implemented in previous series related to this feature) or even entirely
      driven by the hw and power on/off/blink based on some events, like tx/rx
      traffic, ethernet cable attached, link speed of 10mbps, 100mbps, 1000mbps
      or more. They can also support multiple logic like blink with traffic only
      if a particular link speed is attached. (an example of this is when a LED
      is designated to be turned on only with 100mbps link speed and configured
      to blink on traffic and a secondary LED of a different color is present to
      serve the same function but only when the link speed is 1000mbps)
      
      These case are very common for a PHY or a switch but they were never
      standardized so OEM support all kind of variant and configuration.
      
      Again with Andrew we compared some feature and we reached a common set
      of modes that are for sure present in every kind of devices.
      
      And this concludes history and why.
      
      What is present in this series
      ==============================
      
      This patch contain the required API to support this feature, I decided on
      the name of hw control to quickly describe this feature.
      
      I documented each require API in the related Documentation for leds-class
      so I think it might me redundant to expose them here. Feel free to tell me
      how to improve it if anything is not clear.
      
      On an abstract idea, this feature require this:
      
          - The trigger needs to make use of it, this is currently implemented
            for the netdev trigger but other trigger can be expanded if the
            device expose these function. An idea might be a anything that
            handle a storage disk and have the LED configurable to blink when
            there is any activity to the disk.
      
          - The LED driver needs to expose and implement these new API.
      
      Currently a LED driver supports only a trigger. The trigger should use
      the related helper to check if the LED can be driven hy hardware.
      
      The different modes a trigger support are exposed in the kernel include
      leds.h header and are used by the LED driver to understand what to do.
      
      From a user standpoint, he should enable modes as usual from sysfs and if
      anything is not supported warned.
      
      Final words and missing piece from this series
      ==============================================
      
      I honestly hope this feature can finally be implemented.
      
      This series originally had also additional modes and logic to add to the
      netdev trigger, but I decided to strip them and implement only the API
      and support basic tx and rx. After this is merged, I will quickly propose
      these additional modes.
      
      Currently this is limited to tx and rx and this is what the current user
      qca8k use. Marvell PHY support link and a generic blink with any kind of
      traffic (both rx and tx). qca8k switch supports keeping the LED on based on
      link speed.
      
      The next series will add the concept of hw control only modes to the netdev
      trigger and support for these additional modes:
      - link_10
      - link_100
      - link_1000
      - activity
      
      The current implementation is voluntary basic and limited to put the ground
      work and have something easy to implement and usable. 99% part of the logic
      is done on the trigger side, leaving to the LED driver only the validating
      and the apply part.
      
      As shown for the PHY led binding, people are really intrested in this
      feature as quickly after they were merged, people were already working on
      adding support for it.
      
      [1] https://git.kernel.org/pub/scm/linux/kernel/git/lee/leds.git/?h=ib-leds-netdev-6.5
      [2] https://lore.kernel.org/lkml/20230216013230.22978-1-ansuelsmth@gmail.com/
      
      Changes in v4:
      - Added review tag from Andrew.
      - Move default interval to a define to keep them synced.
      - Apply suggested reword to improve Documentation rst.
      
      Changes in v3:
      - Rebased on top of net-next
      
      Changes in v2:
      - Drop helper as currently used only by one trigger
      - Improve Documentation and document return error of some functions
      - Squash some patch to reduce series size
      - Drop trigger mode mask as currently not used
      - Rework hw control validating function to a simple implementation
      
      Changes from previous v8 series:
      - Rewrite Documentation from scratch and move to separate commit
      - Strip additional trigger modes (to propose in a different series)
      - Strip from qca8k driver additional modes (to implement in the different
        series)
      - Split the netdev chages to smaller piece to permit easier review
      
      Changelog in the previous v8 series: (stripped of unrelated changes)
      v8:
      - Improve the documentation of the new feature
      - Rename to a more symbolic name
      - Fix some bug in netdev trigger (not using BIT())
      - Add more define for qca8k-leds driver
      - Drop interval support
      - Fix many bugs in the validate option in the netdev trigger
      v7:
      - Fix qca8k leds documentation warning
      - Remove RFC tag
      v6:
      - Back to RFC.
      - Drop additional trigger
      - Rework netdev trigger to support common modes used by switch and
        hardware only triggers
      - Refresh qca8k leds logic and driver
      v5:
      - Move out of RFC. (no comments from Andrew this is the right path?)
      - Fix more spelling mistake (thx Randy)
      - Fix error reported by kernel test bot
      - Drop the additional HW_CONTROL flag. It does simplify CONFIG
        handling and hw control should be available anyway to support
        triggers as module.
      v4:
      - Rework implementation and drop hw_configure logic.
        We now expand blink_set.
      - Address even more spelling mistake. (thx a lot Randy)
      - Drop blink option and use blink_set delay.
      v3:
      - Rework start/stop as Andrew asked.
      - Use test_bit API to check flag passed to hw_control_configure.
      - Added a new cmd to hw_control_configure to reset any active blink_mode.
      - Refactor all the patches to follow this new implementation.
      v2:
      - Fix spelling mistake (sorry)
      - Drop patch 02 "permit to declare supported offload triggers".
        Change the logic, now the LED driver declare support for them
        using the configure_offload with the cmd TRIGGER_SUPPORTED.
      - Rework code to follow this new implementation.
      - Update Documentation to better describe how this offload
        implementation work.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f209c8ec
    • Andrew Lunn's avatar
      net: dsa: qca8k: add op to get ports netdev · 4f53c27f
      Andrew Lunn authored
      In order that the LED trigger can blink the switch MAC ports LED, it
      needs to know the netdev associated to the port. Add the callback to
      return the struct device of the netdev.
      
      Add an helper function qca8k_phy_to_port() to convert the phy back to
      dsa_port index, as we reference LED port based on the internal PHY
      index and needs to be converted back.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4f53c27f
    • Christian Marangi's avatar
      net: dsa: qca8k: implement hw_control ops · e0256648
      Christian Marangi authored
      Implement hw_control ops to drive Switch LEDs based on hardware events.
      
      Netdev trigger is the declared supported trigger for hw control
      operation and supports the following mode:
      - tx
      - rx
      
      When hw_control_set is called, LEDs are set to follow the requested
      mode.
      Each LEDs will blink at 4Hz by default.
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e0256648
    • Christian Marangi's avatar
      leds: trigger: netdev: expose netdev trigger modes in linux include · 947acaca
      Christian Marangi authored
      Expose netdev trigger modes to make them accessible by LED driver that
      will support netdev trigger for hw control.
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      947acaca
    • Christian Marangi's avatar
      leds: trigger: netdev: init mode if hw control already active · 0316cc56
      Christian Marangi authored
      On netdev trigger activation, hw control may be already active by
      default. If this is the case and a device is actually provided by
      hw_control_get_device(), init the already active mode and set the
      bool to hw_control bool to true to reflect the already set mode in the
      trigger_data.
      Co-developed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0316cc56
    • Andrew Lunn's avatar
      leds: trigger: netdev: validate configured netdev · 33ec0b53
      Andrew Lunn authored
      The netdev which the LED should blink for is configurable in
      /sys/class/led/foo/device_name. Ensure when offloading that the
      configured netdev is the same as the netdev the LED is associated
      with. If it is not, only perform software blinking.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      33ec0b53
    • Christian Marangi's avatar
      leds: trigger: netdev: add support for LED hw control · 7c145a34
      Christian Marangi authored
      Add support for LED hw control for the netdev trigger.
      
      The trigger on calling set_baseline_state to configure a new mode, will
      do various check to verify if hw control can be used for the requested
      mode in can_hw_control() function.
      
      It will first check if the LED driver supports hw control for the netdev
      trigger, then will use hw_control_is_supported() and finally will call
      hw_control_set() to apply the requested mode.
      
      To use such mode, interval MUST be set to the default value and net_dev
      MUST be set. If one of these 2 value are not valid, hw control will
      never be used and normal software fallback is used.
      
      The default interval value is moved to a define to make sure they are
      always synced.
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c145a34
    • Christian Marangi's avatar
      leds: trigger: netdev: reject interval store for hw_control · c84c80c7
      Christian Marangi authored
      Reject interval store with hw_control enabled. It's are currently not
      supported and MUST be set to the default value with hw control enabled.
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c84c80c7
    • Christian Marangi's avatar
      leds: trigger: netdev: add basic check for hw control support · 6352f25f
      Christian Marangi authored
      Add basic check for hw control support. Check if the required API are
      defined and check if the defined trigger supported in hw control for the
      LED driver match netdev.
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6352f25f
    • Christian Marangi's avatar
      leds: trigger: netdev: introduce check for possible hw control · 4fd1b6d4
      Christian Marangi authored
      Introduce function to check if the requested mode can use hw control in
      preparation for hw control support. Currently everything is handled in
      software so can_hw_control will always return false.
      
      Add knob with the new value hw_control in trigger_data struct to
      set hw control possible. Useful for future implementation to implement
      in set_baseline_state() the required function to set the requested mode
      using LEDs hw control ops and in other function to reject set if hw
      control is currently active.
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4fd1b6d4
    • Andrew Lunn's avatar
      leds: trigger: netdev: refactor code setting device name · 28a6a2ef
      Andrew Lunn authored
      Move the code into a helper, ready for it to be called at
      other times. No intended behaviour change.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28a6a2ef
    • Christian Marangi's avatar
      Documentation: leds: leds-class: Document new Hardware driven LEDs APIs · 8aa2fd7b
      Christian Marangi authored
      Document new Hardware driven LEDs APIs.
      
      Some LEDs can be programmed to be driven by hardware. This is not
      limited to blink but also to turn off or on autonomously.
      To support this feature, a LED needs to implement various additional
      ops and needs to declare specific support for the supported triggers.
      
      Add documentation for each required value and API to make hw control
      possible and implementable by both LEDs and triggers.
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8aa2fd7b
    • Andrew Lunn's avatar
      leds: add API to get attached device for LED hw control · 052c38eb
      Andrew Lunn authored
      Some specific LED triggers blink the LED based on events from a device
      or subsystem.
      For example, an LED could be blinked to indicate a network device is
      receiving packets, or a disk is reading blocks. To correctly enable and
      request the hw control of the LED, the trigger has to check if the
      network interface or block device configured via a /sys/class/led file
      match the one the LED driver provide for hw control for.
      
      Provide an API call to get the device which the LED blinks for.
      Signed-off-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      052c38eb
    • Christian Marangi's avatar
      leds: add APIs for LEDs hw control · ed554d3f
      Christian Marangi authored
      Add an option to permit LED driver to declare support for a specific
      trigger to use hw control and setup the LED to blink based on specific
      provided modes.
      
      Add APIs for LEDs hw control. These functions will be used to activate
      hardware control where a LED will use the provided flags, from an
      unique defined supported trigger, to setup the LED to be driven by
      hardware.
      
      Add hw_control_is_supported() to ask the LED driver if the requested
      mode by the trigger are supported and the LED can be setup to follow
      the requested modes.
      
      Deactivate hardware blink control by setting brightness to LED_OFF via
      the brightness_set() callback.
      Signed-off-by: default avatarChristian Marangi <ansuelsmth@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed554d3f
    • Xin Long's avatar
      tipc: delete tipc_mtu_bad from tipc_udp_enable · 6cd8ec58
      Xin Long authored
      Since commit a4dfa72d ("tipc: set default MTU for UDP media"), it's
      been no longer using dev->mtu for b->mtu, and the issue described in
      commit 3de81b75 ("tipc: check minimum bearer MTU") doesn't exist
      in UDP bearer any more.
      
      Besides, dev->mtu can still be changed to a too small mtu after the UDP
      bearer is created even with tipc_mtu_bad() check in tipc_udp_enable().
      Note that NETDEV_CHANGEMTU event processing in tipc_l2_device_event()
      doesn't really work for UDP bearer.
      
      So this patch deletes the unnecessary tipc_mtu_bad from tipc_udp_enable.
      Signed-off-by: default avatarXin Long <lucien.xin@gmail.com>
      Reviewed-by: default avatarTung Nguyen <tung.q.nguyen@dektech.com.au>
      Link: https://lore.kernel.org/r/282f1f5cc40e6cad385aa1c60569e6c5b70e2fb3.1685371933.git.lucien.xin@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      6cd8ec58
    • Jakub Kicinski's avatar
      Merge branch 'net-dsa-mv88e6xxx-add-88e6361-support' · c23515ad
      Jakub Kicinski authored
      Alexis Lothoré says:
      
      ====================
      net: dsa: mv88e6xxx: add 88E6361 support
      
      This series brings initial support for Marvell 88E6361 switch.
      
      MV88E6361 is a 8 ports switch with 5 integrated Gigabit PHYs and 3
      2.5Gigabit SerDes interfaces. It is in fact a new variant in the
      88E639X/88E6193X/88E6191X family with a subset of existing features:
      - port 0: MII, RMII, RGMII, 1000BaseX, 2500BaseX
      - port 3 to 7: triple speed internal phys
      - port 9 and 10: 1000BaseX, 25000BaseX
      
      Since said family is already well supported in mv88e6xxx driver, adding
      initial support for this new switch mostly consists in finding the ID
      exposed in its identification register, adding a proper description
      in switch description tables in mv88e6xxx driver, and enforcing 88E6361
      specificities in mv88e6393x_XXX methods.
      
      - first 4 commits introduce an internal phy offset field for switches which
        have internal phys but not starting from port 0
      - 5th commit is a fix on existing switches based on first commits
      - 6th commit is a slight modification to prepare 886361 support
      - last commit introduces 88E6361 support in 88E6393X family
      
      This initial support has been tested with two samples of a custom board
      with the following hardware configuration:
      - a main CPU connected to MV88E6361 using port 0 as CPU port
      - port 9 wired to a SFP cage
      - port 10 wired to a G.Hn transceiver
      
      The following setup was used:
      PC <-ethernet-> (copper SFP) - Board 1 - (G.hn) <-phone line(RJ11)-> (G.hn) Board 2
      
      The unit 1 has been configured to bridge SFP port and G.hn port together,
      which allowed to successfully ping Board 2 from PC.
      ====================
      
      Link: https://lore.kernel.org/r/20230529080246.82953-1-alexis.lothore@bootlin.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c23515ad
    • Alexis Lothoré's avatar
      net: dsa: mv88e6xxx: enable support for 88E6361 switch · 12899f29
      Alexis Lothoré authored
      Marvell 88E6361 is an 8-port switch derived from the
      88E6393X/88E9193X/88E6191X switches family. It can benefit from the
      existing mv88e6xxx driver by simply adding the proper switch description in
      the driver. Main differences with other switches from this
      family are:
      - 8 ports exposed (instead of 11): ports 1, 2 and 8 not available
      - No 5GBase-x nor SFI/USXGMII support
      Signed-off-by: default avatarAlexis Lothoré <alexis.lothore@bootlin.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      12899f29
    • Alexis Lothoré's avatar
      net: dsa: mv88e6xxx: pass mv88e6xxx_chip structure to port_max_speed_mode · 18e1b742
      Alexis Lothoré authored
      Some switches families have minor differences on supported link speed for
      ports. Instead of redefining a new port_max_speed_mode for each different
      configuration, allow to pass mv88e6xxx_chip structure to allow
      differentiating those chips by known chip id
      Signed-off-by: default avatarAlexis Lothoré <alexis.lothore@bootlin.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      18e1b742
    • Alexis Lothoré's avatar
      net: dsa: mv88e6xxx: fix 88E6393X family internal phys layout · 2f934939
      Alexis Lothoré authored
      88E6393X/88E6193X/88E6191X switches have in fact 8 internal PHYs, but those
      are not present starting at port 0: supported ports go from 1 to 8
      Signed-off-by: default avatarAlexis Lothoré <alexis.lothore@bootlin.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      2f934939
    • Alexis Lothoré's avatar
      net: dsa: mv88e6xxx: add field to specify internal phys layout · 3ba89b28
      Alexis Lothoré authored
      mv88e6xxx currently assumes that switch equipped with internal phys have
      those phys mapped contiguously starting from port 0 (see
      mv88e6xxx_phy_is_internal). However, some switches have internal PHYs but
      NOT starting from port 0. For example 88e6393X, 88E6193X and 88E6191X have
      integrated PHYs available on ports 1 to 8
      To properly support this offset, add a new field to allow specifying an
      internal PHYs layout. If field is not set, default layout is assumed (start
      at port 0)
      Signed-off-by: default avatarAlexis Lothoré <alexis.lothore@bootlin.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3ba89b28
    • Alexis Lothoré's avatar
      net: dsa: mv88e6xxx: use mv88e6xxx_phy_is_internal in mv88e6xxx_port_ppu_updates · 7a2dd00b
      Alexis Lothoré authored
      Make sure to use existing helper to get internal PHYs count instead of
      redoing it manually
      Signed-off-by: default avatarAlexis Lothoré <alexis.lothore@bootlin.com>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7a2dd00b
    • Alexis Lothoré's avatar
      net: dsa: mv88e6xxx: pass directly chip structure to mv88e6xxx_phy_is_internal · ca345931
      Alexis Lothoré authored
      Since this function is a simple helper, we do not need to pass a full
      dsa_switch structure, we can directly pass the mv88e6xxx_chip structure.
      Doing so will allow to share this function with any other function
      not manipulating dsa_switch structure but needing info about number of
      internal phys
      Signed-off-by: default avatarAlexis Lothoré <alexis.lothore@bootlin.com>
      Reviewed-by: default avatarRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ca345931
    • Alexis Lothoré's avatar
      dt-bindings: net: dsa: marvell: add MV88E6361 switch to compatibility list · 9229a948
      Alexis Lothoré authored
      Marvell MV88E6361 is an 8-port switch derived from the
      88E6393X/88E9193X/88E6191X switches family. Since its functional behavior
      is very close to switches from this family, it can benefit from existing
      drivers for this family, so add it to the list of compatible switches
      Signed-off-by: default avatarAlexis Lothoré <alexis.lothore@bootlin.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Acked-by: default avatarConor Dooley <conor.dooley@microchip.com>
      Reviewed-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9229a948
    • Jakub Kicinski's avatar
      Merge branch 'add-layer-2-miss-indication-and-filtering' · e180a33c
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      Add layer 2 miss indication and filtering
      
      tl;dr
      =====
      
      This patchset adds a single bit to the tc skb extension to indicate that
      a packet encountered a layer 2 miss in the bridge and extends flower to
      match on this metadata. This is required for non-DF (Designated
      Forwarder) filtering in EVPN multi-homing which prevents decapsulated
      BUM packets from being forwarded multiple times to the same multi-homed
      host.
      
      Background
      ==========
      
      In a typical EVPN multi-homing setup each host is multi-homed using a
      set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf
      switches in a rack. These switches act as VTEPs and are not directly
      connected (as opposed to MLAG), but can communicate with each other (as
      well as with VTEPs in remote racks) via spine switches over L3.
      
      When a host sends a BUM packet over ES1 to VTEP1, the VTEP will flood it
      to other VTEPs in the network, including those connected to the host
      over ES1. The receiving VTEPs must drop the packet and not forward it
      back to the host. This is called "split-horizon filtering" (SPH) [1].
      
      FRR configures SPH filtering using two tc filters. The first, an ingress
      filter that matches on packets received from VTEP1 and marks them using
      a fwmark (firewall mark). The second, an egress filter configured on the
      LAG interface connected to the host that matches on the fwmark and drops
      the packets. Example:
      
       # tc filter add dev vxlan0 ingress pref 1 proto all flower enc_src_ip $VTEP1_IP action skbedit mark 101
       # tc filter add dev bond0 egress pref 1 handle 101 fw action drop
      
      Motivation
      ==========
      
      For each ES, only one VTEP is elected by the control plane as the DF.
      The DF is responsible for forwarding decapsulated BUM traffic to the
      host over the ES. The non-DF VTEPs must drop such traffic as otherwise
      the host will receive multiple copies of BUM traffic. This is called
      "non-DF filtering" [2].
      
      Filtering of multicast and broadcast traffic can be achieved using the
      following flower filter:
      
       # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop
      
      Unlike broadcast and multicast traffic, it is not currently possible to
      filter unknown unicast traffic. The classification into unknown unicast
      is performed by the bridge driver, but is not visible to other layers.
      
      Implementation
      ==============
      
      The proposed solution is to add a single bit to the tc skb extension
      that is set by the bridge for packets that encountered an FDB or MDB
      miss. The flower classifier is extended to be able to match on this new
      metadata bit in a similar fashion to existing metadata options such as
      'indev'.
      
      A bit that is set for every flooded packet would also work, but it does
      not allow us to differentiate between registered and unregistered
      multicast traffic which might be useful in the future.
      
      A relatively generic name is chosen for this bit - 'l2_miss' - to allow
      its use to be extended to other layer 2 devices such as VXLAN, should a
      use case arise.
      
      With the above, the control plane can implement a non-DF filter using
      the following tc filters:
      
       # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop
       # tc filter add dev bond0 egress pref 2 proto all flower indev vxlan0 l2_miss true action drop
      
      The first drops broadcast and multicast traffic and the second drops
      unknown unicast traffic.
      
      Testing
      =======
      
      A test exercising the different permutations of the 'l2_miss' bit is
      added in patch #8.
      
      Patchset overview
      =================
      
      Patch #1 adds the new bit to the tc skb extension and sets it in the
      bridge driver for packets that encountered a miss. The marking of the
      packets and the use of this extension is protected by the
      'tc_skb_ext_tc' static key in order to keep performance impact to a
      minimum when the feature is not in use.
      
      Patch #2 extends the flow dissector to dissect this information from the
      tc skb extension into the 'FLOW_DISSECTOR_KEY_META' key.
      
      Patch #3 extends the flower classifier to be able to match on the new
      layer 2 miss metadata. The classifier enables the 'tc_skb_ext_tc' static
      key upon the installation of the first filter that matches on 'l2_miss'
      and disables the key upon the removal of the last filter that matches on
      it.
      
      Patch #4 rejects matching on the new metadata in drivers that already
      support the 'FLOW_DISSECTOR_KEY_META' key.
      
      Patches #5-#6 are small preparations in mlxsw.
      
      Patch #7 extends mlxsw to be able to match on layer 2 miss.
      
      Patch #8 adds a selftest.
      
      iproute2 patches can be found here [3].
      
      [1] https://datatracker.ietf.org/doc/html/rfc7432#section-8.3
      [2] https://datatracker.ietf.org/doc/html/rfc7432#section-8.5
      [3] https://github.com/idosch/iproute2/tree/submit/non_df_filter_v1
      [4] https://lore.kernel.org/netdev/20230518113328.1952135-1-idosch@nvidia.com/
      [5] https://lore.kernel.org/netdev/20230509070446.246088-1-idosch@nvidia.com/
      ====================
      
      Link: https://lore.kernel.org/r/20230529114835.372140-1-idosch@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e180a33c
    • Ido Schimmel's avatar
      selftests: forwarding: Add layer 2 miss test cases · 8c33266a
      Ido Schimmel authored
      Add test cases to verify that the bridge driver correctly marks layer 2
      misses only when it should and that the flower classifier can match on
      this metadata.
      
      Example output:
      
       # ./tc_flower_l2_miss.sh
       TEST: L2 miss - Unicast                                             [ OK ]
       TEST: L2 miss - Multicast (IPv4)                                    [ OK ]
       TEST: L2 miss - Multicast (IPv6)                                    [ OK ]
       TEST: L2 miss - Link-local multicast (IPv4)                         [ OK ]
       TEST: L2 miss - Link-local multicast (IPv6)                         [ OK ]
       TEST: L2 miss - Broadcast                                           [ OK ]
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8c33266a
    • Ido Schimmel's avatar
      mlxsw: spectrum_flower: Add ability to match on layer 2 miss · caa4c58a
      Ido Schimmel authored
      Add the 'fdb_miss' key element to supported key blocks and make use of
      it to match on layer 2 miss.
      
      The key is only supported on Spectrum-{2,3,4}. An error is returned for
      Spectrum-1 since the key element is not present in any of its key
      blocks.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      caa4c58a
    • Ido Schimmel's avatar
      mlxsw: spectrum_flower: Do not force matching on iif · 0b9cd74b
      Ido Schimmel authored
      Currently, mlxsw only supports the 'ingress_ifindex' field in the
      'FLOW_DISSECTOR_KEY_META' key, but subsequent patches are going to add
      support for the 'l2_miss' field as well. It is valid to only match on
      'l2_miss' without 'ingress_ifindex', so do not force matching on it.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0b9cd74b
    • Ido Schimmel's avatar
      mlxsw: spectrum_flower: Split iif parsing to a separate function · d04e2650
      Ido Schimmel authored
      Currently, mlxsw only supports the 'ingress_ifindex' field in the
      'FLOW_DISSECTOR_KEY_META' key, but subsequent patches are going to add
      support for the 'l2_miss' field as well. Split the parsing of the
      'ingress_ifindex' field to a separate function to avoid nesting. No
      functional changes intended.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d04e2650
    • Ido Schimmel's avatar
      flow_offload: Reject matching on layer 2 miss · f4356947
      Ido Schimmel authored
      Adjust drivers that support the 'FLOW_DISSECTOR_KEY_META' key to reject
      filters that try to match on the newly added layer 2 miss field. Add an
      extack message to clearly communicate the failure reason to user space.
      
      The following users were not patched:
      
      1. mtk_flow_offload_replace(): Only checks that the key is present, but
         does not do anything with it.
      2. mlx5_tc_ct_set_tuple_match(): Used as part of netfilter offload,
         which does not make use of the new field, unlike tc.
      3. get_netdev_from_rule() in nfp: Likewise.
      
      Example:
      
       # tc filter add dev swp1 egress pref 1 proto all flower skip_sw l2_miss true action drop
       Error: mlxsw_spectrum: Can't match on "l2_miss".
       We have an error talking to the kernel
      Acked-by: default avatarElad Nachman <enachman@marvell.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f4356947
    • Ido Schimmel's avatar
      net/sched: flower: Allow matching on layer 2 miss · 1a432018
      Ido Schimmel authored
      Add the 'TCA_FLOWER_L2_MISS' netlink attribute that allows user space to
      match on packets that encountered a layer 2 miss. The miss indication is
      set as metadata in the tc skb extension by the bridge driver upon FDB or
      MDB lookup miss and dissected by the flow dissector to the
      'FLOW_DISSECTOR_KEY_META' key.
      
      The use of this skb extension is guarded by the 'tc_skb_ext_tc' static
      key. As such, enable / disable this key when filters that match on layer
      2 miss are added / deleted.
      
      Tested:
      
       # cat tc_skb_ext_tc.py
       #!/usr/bin/env -S drgn -s vmlinux
      
       refcount = prog["tc_skb_ext_tc"].key.enabled.counter.value_()
       print(f"tc_skb_ext_tc reference count is {refcount}")
      
       # ./tc_skb_ext_tc.py
       tc_skb_ext_tc reference count is 0
      
       # tc filter add dev swp1 egress proto all handle 101 pref 1 flower src_mac 00:11:22:33:44:55 action drop
       # tc filter add dev swp1 egress proto all handle 102 pref 2 flower src_mac 00:11:22:33:44:55 l2_miss true action drop
       # tc filter add dev swp1 egress proto all handle 103 pref 3 flower src_mac 00:11:22:33:44:55 l2_miss false action drop
      
       # ./tc_skb_ext_tc.py
       tc_skb_ext_tc reference count is 2
      
       # tc filter replace dev swp1 egress proto all handle 102 pref 2 flower src_mac 00:01:02:03:04:05 l2_miss false action drop
      
       # ./tc_skb_ext_tc.py
       tc_skb_ext_tc reference count is 2
      
       # tc filter del dev swp1 egress proto all handle 103 pref 3 flower
       # tc filter del dev swp1 egress proto all handle 102 pref 2 flower
       # tc filter del dev swp1 egress proto all handle 101 pref 1 flower
      
       # ./tc_skb_ext_tc.py
       tc_skb_ext_tc reference count is 0
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1a432018
    • Ido Schimmel's avatar
      flow_dissector: Dissect layer 2 miss from tc skb extension · d5ccfd90
      Ido Schimmel authored
      Extend the 'FLOW_DISSECTOR_KEY_META' key with a new 'l2_miss' field and
      populate it from a field with the same name in the tc skb extension.
      This field is set by the bridge driver for packets that incur an FDB or
      MDB miss.
      
      The next patch will extend the flower classifier to be able to match on
      layer 2 misses.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d5ccfd90
    • Ido Schimmel's avatar
      skbuff: bridge: Add layer 2 miss indication · 7b4858df
      Ido Schimmel authored
      For EVPN non-DF (Designated Forwarder) filtering we need to be able to
      prevent decapsulated traffic from being flooded to a multi-homed host.
      Filtering of multicast and broadcast traffic can be achieved using the
      following flower filter:
      
       # tc filter add dev bond0 egress pref 1 proto all flower indev vxlan0 dst_mac 01:00:00:00:00:00/01:00:00:00:00:00 action drop
      
      Unlike broadcast and multicast traffic, it is not currently possible to
      filter unknown unicast traffic. The classification into unknown unicast
      is performed by the bridge driver, but is not visible to other layers
      such as tc.
      
      Solve this by adding a new 'l2_miss' bit to the tc skb extension. Clear
      the bit whenever a packet enters the bridge (received from a bridge port
      or transmitted via the bridge) and set it if the packet did not match an
      FDB or MDB entry. If there is no skb extension and the bit needs to be
      cleared, then do not allocate one as no extension is equivalent to the
      bit being cleared. The bit is not set for broadcast packets as they
      never perform a lookup and therefore never incur a miss.
      
      A bit that is set for every flooded packet would also work for the
      current use case, but it does not allow us to differentiate between
      registered and unregistered multicast traffic, which might be useful in
      the future.
      
      To keep the performance impact to a minimum, the marking of packets is
      guarded by the 'tc_skb_ext_tc' static key. When 'false', the skb is not
      touched and an skb extension is not allocated. Instead, only a
      5 bytes nop is executed, as demonstrated below for the call site in
      br_handle_frame().
      
      Before the patch:
      
      ```
              memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
        c37b09:       49 c7 44 24 28 00 00    movq   $0x0,0x28(%r12)
        c37b10:       00 00
      
              p = br_port_get_rcu(skb->dev);
        c37b12:       49 8b 44 24 10          mov    0x10(%r12),%rax
              memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
        c37b17:       49 c7 44 24 30 00 00    movq   $0x0,0x30(%r12)
        c37b1e:       00 00
        c37b20:       49 c7 44 24 38 00 00    movq   $0x0,0x38(%r12)
        c37b27:       00 00
      ```
      
      After the patch (when static key is disabled):
      
      ```
              memset(skb->cb, 0, sizeof(struct br_input_skb_cb));
        c37c29:       49 c7 44 24 28 00 00    movq   $0x0,0x28(%r12)
        c37c30:       00 00
        c37c32:       49 8d 44 24 28          lea    0x28(%r12),%rax
        c37c37:       48 c7 40 08 00 00 00    movq   $0x0,0x8(%rax)
        c37c3e:       00
        c37c3f:       48 c7 40 10 00 00 00    movq   $0x0,0x10(%rax)
        c37c46:       00
      
      #ifdef CONFIG_HAVE_JUMP_LABEL_HACK
      
      static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
      {
              asm_volatile_goto("1:"
        c37c47:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
              br_tc_skb_miss_set(skb, false);
      
              p = br_port_get_rcu(skb->dev);
        c37c4c:       49 8b 44 24 10          mov    0x10(%r12),%rax
      ```
      
      Subsequent patches will extend the flower classifier to be able to match
      on the new 'l2_miss' bit and enable / disable the static key when
      filters that match on it are added / deleted.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7b4858df
  2. 30 May, 2023 8 commits