Commits · 89bddde389a8a02b678dcb49bd8a10e341b018e5 · Kirill Smelkov / linux

24 Jun, 2021 33 commits

David S. Miller authored Jun 24, 2021

Bailey Forrest says:

====================
gve: Introduce DQO descriptor format

DQO is the descriptor format for our next generation virtual NIC. The existing
descriptor format will be referred to as "GQI" in the patch set.

One major change with DQO is it uses dual descriptor rings for both TX and RX
queues.

The TX path uses a TX queue to send descriptors to HW, and receives packet
completion events on a TX completion queue.

The RX path posts buffers to HW using an RX buffer queue and receives incoming
packets on an RX queue.

One important note is that DQO descriptors and doorbells are little endian. We
continue to use the existing big endian control plane infrastructure.

The general format of the patch series is:
- Refactor existing code/data structures to be shared by DQO
- Expand admin queues to support DQO device setup
- Expand data structures and device setup to support DQO
- Add logic to setup DQO queues
- Implement datapath
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

89bddde3

gve: DQO: Add RX path · 9b8dd5e5

Bailey Forrest authored Jun 24, 2021

The RX queue has an array of `gve_rx_buf_state_dqo` objects. All
allocated pages have an associated buf_state object. When a buffer is
posted on the RX buffer queue, the buffer ID will be the buf_state's
index into the RX queue's array.

On packet reception, the RX queue will have one descriptor for each
buffer associated with a received packet. Each RX descriptor will have
a buffer_id that was posted on the buffer queue.

Notable mentions:

- We use a default buffer size of 2048 bytes. Based on page size, we
  may post separate sections of a single page as separate buffers.

- The driver holds an extra reference on pages passed up the receive
  path with an skb and keeps these pages on a list. When posting new
  buffers to the NIC, we check if any of these pages has only our
  reference, or another buffer sized segment of the page has no
  references. If so, it is free to reuse. This page recycling approach
  is a common netdev optimization that reduces page alloc/free calls.

- Pages in the free list have a page_count bias in order to avoid an
  atomic increment of pagecount every time we attempt to reuse a page.
  # references = page_count() - bias

- In order to track when a page is safe to reuse, we keep track of the
  last offset which had a single SKB reference. When this occurs, it
  implies that every single other offset is reusable. Otherwise, we
  don't know if offsets can be safely reused.

- We maintain two free lists of pages. List #1 (recycled_buf_states)
  contains pages we know can be reused right away. List #2
  (used_buf_states) contains pages which cannot be used right away. We
  only attempt to get pages from list #2 when list #1 is empty. We only
  attempt to use a small fixed number pages from list #2 before giving
  up and allocating a new page. Both lists are FIFOs in hope that by the
  time we attempt to reuse a page, the references were dropped.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9b8dd5e5

gve: DQO: Add TX path · a57e5de4

Bailey Forrest authored Jun 24, 2021

TX SKBs will have their buffers DMA mapped with the device. Each buffer
will have at least one TX descriptor associated. Each SKB will also have
a metadata descriptor.

Each TX queue maintains an array of `gve_tx_pending_packet_dqo` objects.
Every TX SKB will have an associated pending_packet object. A TX SKB's
descriptors will use its pending_packet's index as the completion tag,
which will be returned on the TX completion queue.

The device implements a "flow-miss model". Most packets will simply
receive a packet completion. The flow-miss system may choose to process
a packet based on its contents. A TX packet which experiences a flow
miss would receive a miss completion followed by a later reinjection
completion. The miss-completion is received when the packet starts to be
processed by the flow-miss system and the reinjection completion is
received when the flow-miss system completes processing the packet and
sends it on the wire.

Notable mentions:

- Buffers may be freed after receiving the miss-completion, but in order
  to avoid packet reordering, we do not complete the SKB until receiving
  the reinjection completion.

- The driver must robustly handle the unlikely scenario where a miss
  completion does not have an associated reinjection completion. This is
  accomplished by maintaining a list of packets which have a pending
  reinjection completion. After a short timeout (5 seconds), the
  SKB and buffers are released and the pending_packet is moved to a
  second list which has a longer timeout (60 seconds), where the
  pending_packet will not be reused. When the longer timeout elapses,
  the driver may assume the reinjection completion would never be
  received and the pending_packet may be reused.

- Completion handling is triggered by an interrupt and is done in the
  NAPI poll function. Because the TX path and completion exist in
  different threading contexts they maintain their own lists for free
  pending_packet objects. The TX path uses a lock-free approach to steal
  the list from the completion path.

- Both the TSO context and general context descriptors have metadata
  bytes. The device requires that if multiple descriptors contain the
  same field, each descriptor must have the same value set for that
  field.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a57e5de4

gve: DQO: Configure interrupts on device up · 0dcc144a

Bailey Forrest authored Jun 24, 2021

When interrupts are first enabled, we also set the ratelimits, which
will be static for the entire usage of the device.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0dcc144a

gve: DQO: Add ring allocation and initialization · 9c1a59a2

Bailey Forrest authored Jun 24, 2021

Allocate the buffer and completion ring structures. Do not populate the
rings yet. That will happen in the respective rx and tx datapath
follow-on patches
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9c1a59a2

gve: DQO: Add core netdev features · 5e8c5adf

Bailey Forrest authored Jun 24, 2021

Add napi netdev device registration, interrupt handling and initial tx
and rx polling stubs. The stubs will be filled in follow-on patches.

Also:
- LRO feature advertisement and handling
- Also update ethtool logic
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e8c5adf

gve: Update adminq commands to support DQO queues · 1f6228e4

Bailey Forrest authored Jun 24, 2021

DQO queue creation requires additional parameters:
- TX completion/RX buffer queue size
- TX completion/RX buffer queue address
- TX/RX queue size
- RX buffer size
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1f6228e4

gve: Add DQO fields for core data structures · a4aa1f1e

Bailey Forrest authored Jun 24, 2021

- Add new DQO datapath structures:
  - `gve_rx_buf_queue_dqo`
  - `gve_rx_compl_queue_dqo`
  - `gve_rx_buf_state_dqo`
  - `gve_tx_desc_dqo`
  - `gve_tx_pending_packet_dqo`

- Incorporate these into the existing ring data structures:
  - `gve_rx_ring`
  - `gve_tx_ring`

Noteworthy mentions:

- `gve_rx_buf_state` represents an RX buffer which was posted to HW.
  Each RX queue has an array of these objects and the index into the
  array is used as the buffer_id when posted to HW.

- `gve_tx_pending_packet_dqo` is treated similarly for TX queues. The
  completion_tag is the index into the array.

- These two structures have links for linked lists which are represented
  by 16b indexes into a contiguous array of these structures.
  This reduces memory footprint compared to 64b pointers.

- We use unions for the writeable datapath structures to reduce cache
  footprint. GQI specific members will renamed like DQO members in a
  future patch.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a4aa1f1e

gve: Add dqo descriptors · 22319818

Bailey Forrest authored Jun 24, 2021

General description of rings and descriptors:

TX ring is used for sending TX packet buffers to the NIC. It has the
following descriptors:
- `gve_tx_pkt_desc_dqo` - Data buffer descriptor
- `gve_tx_tso_context_desc_dqo` - TSO context descriptor
- `gve_tx_general_context_desc_dqo` - Generic metadata descriptor

Metadata is a collection of 12 bytes. We define `gve_tx_metadata_dqo`
which represents the logical interpetation of the metadata bytes. It's
helpful to define this structure because the metadata bytes exist in
multiple descriptor types (including `gve_tx_tso_context_desc_dqo`),
and the device requires same field has the same value in all
descriptors.

The TX completion ring is used to receive completions from the NIC.
Having a separate ring allows for completions to be out of order. The
completion descriptor `gve_tx_compl_desc` has several different types,
most important are packet and descriptor completions. Descriptor
completions are used to notify the driver when descriptors sent on the
TX ring are done being consumed. The descriptor completion is only used
to signal that space is cleared in the TX ring. A packet completion will
be received when a packet transmitted on the TX queue is done being
transmitted.

In addition there are "miss" and "reinjection" completions. The device
implements a "flow-miss model". Most packets will simply receive a
packet completion. The flow-miss system may choose to process a packet
based on its contents. A TX packet which experiences a flow miss would
receive a miss completion followed by a later reinjection completion.
The miss-completion is received when the packet starts to be processed
by the flow-miss system and the reinjection completion is received when
the flow-miss system completes processing the packet and sends it on the
wire.

The RX buffer ring is used to send buffers to HW via the
`gve_rx_desc_dqo` descriptor.

Received packets are put into the RX queue by the device, which
populates the `gve_rx_compl_desc_dqo` descriptor. The RX descriptors
refer to buffers posted by the buffer queue. Received buffers may be
returned out of order, such as when HW LRO is enabled.

Important concepts:
- "TX" and "RX buffer" queues, which send descriptors to the device, use
  MMIO doorbells to notify the device of new descriptors.

- "RX" and "TX completion" queues, which receive descriptors from the
  device, use a "generation bit" to know when a descriptor was populated
  by the device. The driver initializes all bits with the "current
  generation". The device will populate received descriptors with the
  "next generation" which is inverted from the current generation. When
  the ring wraps, the current/next generation are swapped.

- It's the driver's responsibility to ensure that the RX and TX
  completion queues are not overrun. This can be accomplished by
  limiting the number of descriptors posted to HW.

- TX packets have a 16 bit completion_tag and RX buffers have a 16 bit
  buffer_id. These will be returned on the TX completion and RX queues
  respectively to let the driver know which packet/buffer was completed.

Bitfields are used to describe descriptor fields. This notation is more
concise and readable than shift-and-mask. It is possible because the
driver is restricted to little endian platforms.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

22319818

gve: Add support for DQO RX PTYPE map · c4b87ac8

Bailey Forrest authored Jun 24, 2021

Unlike GQI, DQO RX descriptors do not contain the L3 and L4 type of the
packet. L3 and L4 types are necessary in order to set the hash and csum
on RX SKBs correctly.

DQO RX descriptors instead contain a 10 bit PTYPE index. The PTYPE map
enables the device to tell the driver how to map from PTYPE index to
L3/L4 type.

The device doesn't provide any guarantees about the range of possible
PTYPEs, so we just use a 1024 entry array to implement a fast mapping
structure.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c4b87ac8

gve: adminq: DQO specific device descriptor logic · 5ca2265e

Bailey Forrest authored Jun 24, 2021

- In addition to TX and RX queues, DQO has TX completion and RX buffer
  queues.
  - TX completions are received when the device has completed sending a
    packet on the wire.
  - RX buffers are posted on a separate queue form the RX completions.
- DQO descriptor rings are allowed to be smaller than PAGE_SIZE.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5ca2265e

gve: Introduce per netdev `enum gve_queue_format` · a5886ef4

Bailey Forrest authored Jun 24, 2021

The currently supported queue formats are:
- GQI_RDA - GQI with raw DMA addressing
- GQI_QPL - GQI with queue page list
- DQO_RDA - DQO with raw DMA addressing

The old `gve_priv.raw_addressing` value is only used for GQI_RDA, so we
remove it in favor of just checking against GQI_RDA
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a5886ef4

gve: Introduce a new model for device options · 8a39d3e0

Bailey Forrest authored Jun 24, 2021

The current model uses an integer ID and a fixed size struct for the
parameters of each device option.

The new model allows the device option structs to grow in size over
time. A driver may assume that changes to device option structs will
always be appended.

New device options will also generally have a
`supported_features_mask` so that the driver knows which fields within a
particular device option are enabled.

`gve_device_option.feat_mask` is changed to `required_features_mask`,
and it is a bitmask which must match the value expected by the driver.
This gives the device the ability to break backwards compatibility with
old drivers for certain features by blocking the old drivers from trying
to use the feature.

We maintain ABI compatibility with the old model for
GVE_DEV_OPT_ID_RAW_ADDRESSING in case a driver is using a device which
does not support the new model.

This patch introduces some new terminology:

RDA - Raw DMA Addressing - Buffers associated with SKBs are directly DMA
      mapped and read/updated by the device.

QPL - Queue Page Lists - Driver uses bounce buffers which are DMA mapped
      with the device for read/write and data is copied from/to SKBs.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8a39d3e0

gve: Make gve_rx_slot_page_info.page_offset an absolute offset · 920fb451

Bailey Forrest authored Jun 24, 2021

Using `page_offset` like a boolean means a page may only be split into
two sections. With page sizes larger than 4k, this can be very wasteful.
Future commits in this patchset use `struct gve_rx_slot_page_info` in a
way which supports a fixed buffer size and a variable page size.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

920fb451

gve: gve_rx_copy: Move padding to an argument · 35f9b2f4

Bailey Forrest authored Jun 24, 2021

Future use cases will have a different padding value.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

35f9b2f4

gve: Move some static functions to a common file · dbdaa675

Bailey Forrest authored Jun 24, 2021

These functions will be shared by the GQI and DQO variants of the GVNIC
driver as of follow-up patches in this series.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dbdaa675

gve: Update GVE documentation to describe DQO · c6a7ed77

Bailey Forrest authored Jun 24, 2021

DQO is a new descriptor format for our next generation virtual NIC.
Signed-off-by: Bailey Forrest <bcf@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c6a7ed77

usbnet: add usbnet_event_names[] for kevent · 47889068

Yajun Deng authored Jun 24, 2021

Modify the netdev_dbg content from int to char * in usbnet_defer_kevent(),
this looks more readable.
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>

47889068

Merge branch 'add-sparx5i-driver' · 67faf76d

David S. Miller authored Jun 24, 2021

Steen Hegelund says:

====================
Adding the Sparx5i Switch Driver

This series provides the Microchip Sparx5i Switch Driver

The SparX-5 Enterprise Ethernet switch family provides a rich set of
Enterprise switching features such as advanced TCAM-based VLAN and QoS
processing enabling delivery of differentiated services, and security
through TCAMbased frame processing using versatile content aware processor
(VCAP). IPv4/IPv6 Layer 3 (L3) unicast and multicast routing is supported
with up to 18K IPv4/9K IPv6 unicast LPM entries and up to 9K IPv4/3K IPv6
(S,G) multicast groups. L3 security features include source guard and
reverse path forwarding (uRPF) tasks. Additional L3 features include
VRF-Lite and IP tunnels (IP over GRE/IP).

The SparX-5 switch family features a highly flexible set of Ethernet ports
with support for 10G and 25G aggregation links, QSGMII, USGMII, and
USXGMII. The device integrates a powerful 1 GHz dual-core ARM® Cortex®-A53
CPU enabling full management of the switch and advanced Enterprise
applications.

The SparX-5 switch family targets managed Layer 2 and Layer 3 equipment in
SMB, SME, and Enterprise where high port count 1G/2.5G/5G/10G switching
with 10G/25G aggregation links is required.

The SparX-5 switch family consists of following SKUs:

VSC7546 SparX-5-64 supports up to 64 Gbps of bandwidth with the following
primary port configurations.
- 6 ×10G
- 16 × 2.5G + 2 × 10G
- 24 × 1G + 4 × 10G

VSC7549 SparX-5-90 supports up to 90 Gbps of bandwidth with the following
primary port configurations.
- 9 × 10G
- 16 × 2.5G + 4 × 10G
- 48 × 1G + 4 × 10G

VSC7552 SparX-5-128 supports up to 128 Gbps of bandwidth with the
following primary port configurations.
- 12 × 10G
- 6 x 10G + 2 x 25G
- 16 × 2.5G + 8 × 10G
- 48 × 1G + 8 × 10G

VSC7556 SparX-5-160 supports up to 160 Gbps of bandwidth with the
following primary port configurations.
- 16 × 10G
- 10 × 10G + 2 × 25G
- 16 × 2.5G + 10 × 10G
- 48 × 1G + 10 × 10G

VSC7558 SparX-5-200 supports up to 200 Gbps of bandwidth with the
following primary port configurations.
- 20 × 10G
- 8 × 25G

In addition, the device supports one 10/100/1000/2500/5000 Mbps
SGMII/SerDes node processor interface (NPI) Ethernet port.

Time sensitive networking (TSN) is supported through a comprehensive set of
features including frame preemption, cut-through, frame replication and
elimination for reliability, enhanced scheduling: credit-based shaping,
time-aware shaping, cyclic queuing, and forwarding, and per-stream policing
and filtering.

Together with IEEE 1588 and IEEE 802.1AS support, this guarantees
low-latency deterministic networking for Industrial Ethernet.

The Sparx5i support is developed on the PCB134 and PCB135 evaluation boards.

- PCB134 main networking features:
- 12x SFP+ front 10G module slots (connected to Sparx5i through SFI).
- 8x SFP28 front 25G module slots (connected to Sparx5i through SFI high
speed).
- Optional, one additional 10/100/1000BASE-T (RJ45) Ethernet port
(on-board VSC8211 PHY connected to Sparx5i through SGMII).

- PCB135 main networking features:
- 48x1G (10/100/1000M) RJ45 front ports using 12xVSC8514 QuadPHY’s each
connected to VSC7558 through QSGMII.
- 4x10G (1G/2.5G/5G/10G) RJ45 front ports using the AQR407 10G QuadPHY
each port connects to VSC7558 through SFI.
- 4x SFP28 25G module slots on back connected to VSC7558 through SFI high
speed.
- Optional, one additional 1G (10/100/1000M) RJ45 port using an on-board
VSC8211 PHY, which can be connected to VSC7558 NPI port through SGMII
using a loopback add-on PCB)

This series provides support for:
- SFPs and DAC cables via PHYLINK with a number of 5G, 10G and 25G
devices and media types.
- Port module configuration for 10M to 25G speeds with SGMII, QSGMII,
1000BASEX, 2500BASEX and 10GBASER as appropriate for these modes.
- SerDes configuration via the Sparx5i SerDes driver (see below).
- Host mode providing register based injection and extraction.
- Switch mode providing MAC/VLAN table learning and Layer2 switching
offloaded to the Sparx5i switch.
- STP state, VLAN support, host/bridge port mode, Forwarding DB, and
configuration and statistics via ethtool.

More support will be added at a later stage.

The Sparx5i Chip Register Model can be browsed at this location:
https://github.com/microchip-ung/sparx-5_reginfo
and the datasheet is available here:
https://ww1.microchip.com/downloads/en/DeviceDoc/SparX-5_Family_L2L3_Enterprise_10G_Ethernet_Switches_Datasheet_00003822B.pdf

The series depends on the following series currently on their way
into the kernel:

- 25G Base-R phy mode
Link: https://lore.kernel.org/r/20210611125453.313308-1-steen.hegelund@microchip.com/
- Sparx5 Reset Driver
Link: https://lore.kernel.org/r/20210416084054.2922327-1-steen.hegelund@microchip.com/

ChangeLog:
v5:
- cover letter
- updated the description to match the latest data sheets
- basic driver
- added error message in case of reset controller error
- port struct: replacing has_sfp with inband, adding pause_adv
- host mode
- port cleanup: unregisters netdevs and then removes phylink etc
- checking for pause_adv when comparing port config changes
- getting duplex and pause state in the link_up callback.
- getting inband, autoneg and pause_adv config in the pcs_config
callback.
- port
- use only the pause_adv bits when getting aneg status
- use the inband state when updating the PCS and port config
v4:
- basic driver:
Using devm_reset_control_get_optional_shared to get the reset
control, and let the reset framework check if it is valid.
- host mode (phylink):
Use the PCS operations to get state and update configuration.
Removed the setting of interface modes. Let phylink control this.
Using the new 5gbase-r and 25gbase-r modes.
Using a helper function to check if one of the 3 base-r modes has
been selected.
Currently it will not be possible to change the interface mode by
changing the speed (e.g via ethtool). This will be added later.
v3:
- basic driver:
- removed unneeded braces
- release reference to ports node after use
- use dev_err_probe to handle DEFER
- update error value when bailing out (a few cases)
- updated formatting of port struct and grouping of bool values
- simplified the spx5_rmw and spx5_inst_rmw inline functions
- host mode (netdev):
- removed lockless flag
- added port timer init
- host mode (packet - manual injection):
- updated error counters in error situations
- implemented timer handling of watermark threshold: stop and
restart netif queues.
- fixed error message handling (rate limited)
- fixed comment style error
- used DIV_ROUND_UP macro
- removed a debug message for open ports

v2:
- Updated bindings:
- drop minItems for the reg property
- Statistics implementation:
- Reorganized statistics into ethtool groups:
eth-phy, eth-mac, eth-ctrl, rmon
as defined by the IEEE 802.3 categories and RFC 2819.
- The remaining statistics are provided by the classic ethtool
statistics command.
- Hostmode support:
- Removed netdev renaming
- Validate ethernet address in sparx5_set_mac_address()
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

67faf76d

arm64: dts: sparx5: Add the Sparx5 switch node · d0f482bb

Steen Hegelund authored Jun 24, 2021

This provides the configuration for the currently available evaluation
boards PCB134 and PCB135.

The series depends on the following series currently on its way
into the kernel:

- Sparx5 Reset Driver
  Link: https://lore.kernel.org/r/20210416084054.2922327-1-steen.hegelund@microchip.com/Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d0f482bb

net: sparx5: add ethtool configuration and statistics support · af4b1102

Steen Hegelund authored Jun 24, 2021

This adds statistic counters for the network interfaces provided
by the driver.  It also adds CPU port counters (which are not
exposed by ethtool).
This also adds support for configuring the network interface
parameters via ethtool: speed, duplex, aneg etc.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af4b1102

net: sparx5: add calendar bandwidth allocation support · 0a9d48ad

Steen Hegelund authored Jun 24, 2021

This configures the Sparx5 calendars according to the bandwidth
requested in the Device Tree nodes.
It also checks if the total requested bandwidth is within the
specs of the detected Sparx5 models limits.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0a9d48ad

net: sparx5: add switching support · d6fce514

Steen Hegelund authored Jun 24, 2021

This adds SwitchDev support by hardware offloading the
software bridge.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6fce514

net: sparx5: add vlan support · 78eab33b

Steen Hegelund authored Jun 24, 2021

This adds Sparx5 VLAN support.

Sparx5 has more VLAN features than provided here, but these will be added
in later series. For now we only add the basic L2 features.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

78eab33b

net: sparx5: add mactable support · b37a1bae

Steen Hegelund authored Jun 24, 2021

This adds the Sparx5 MAC tables: listening for MAC table updates and
updating on request.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b37a1bae

net: sparx5: add port module support · 946e7fd5

Steen Hegelund authored Jun 24, 2021

This add configuration of the Sparx5 port module instances.

Sparx5 has in total 65 logical ports (denoted D0 to D64) and 33
physical SerDes connections (S0 to S32).  The 65th port (D64) is fixed
allocated to SerDes0 (S0). The remaining 64 ports can in various
multiplexing scenarios be connected to the remaining 32 SerDes using
QSGMII, or USGMII or USXGMII extenders. 32 of the ports can have a 1:1
mapping to the 32 SerDes.

Some additional ports (D65 to D69) are internal to the device and do not
connect to port modules or SerDes macros. For example, internal ports are
used for frame injection and extraction to the CPU queues.

The 65 logical ports are split up into the following blocks.

- 13 x 5G ports (D0-D11, D64)
- 32 x 2G5 ports (D16-D47)
- 12 x 10G ports (D12-D15, D48-D55)
- 8 x 25G ports (D56-D63)

Each logical port supports different line speeds, and depending on the
speeds supported, different port modules (MAC+PCS) are needed. A port
supporting 5 Gbps, 10 Gbps, or 25 Gbps as maximum line speed, will have a
DEV5G, DEV10G, or DEV25G module to support the 5 Gbps, 10 Gbps (incl 5
Gbps), or 25 Gbps (including 10 Gbps and 5 Gbps) speeds. As well as, it
will have a shadow DEV2G5 port module to support the lower speeds
(10/100/1000/2500Mbps). When a port needs to operate at lower speed and the
shadow DEV2G5 needs to be connected to its corresponding SerDes

Not all interface modes are supported in this series, but will be added at
a later stage.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

946e7fd5

net: sparx5: add hostmode with phylink support · f3cad261

Steen Hegelund authored Jun 24, 2021

This patch adds netdevs and phylink support for the ports in the switch.
It also adds register based injection and extraction for these ports.

Frame DMA support for injection and extraction will be added in a later
series.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f3cad261

net: sparx5: add the basic sparx5 driver · 3cfa11ba

Steen Hegelund authored Jun 24, 2021

This adds the Sparx5 basic SwitchDev driver framework with IO range
mapping, switch device detection and core clock configuration.

Support for ports, phylink, netdev, mactable etc. are in the following
patches.
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Reviewed-by: Philipp Zabel <p.zabel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

3cfa11ba

dt-bindings: net: sparx5: Add sparx5-switch bindings · f8c63088

Steen Hegelund authored Jun 24, 2021

Document the Sparx5 switch device driver bindings
Signed-off-by: Steen Hegelund <steen.hegelund@microchip.com>
Signed-off-by: Lars Povlsen <lars.povlsen@microchip.com>
Signed-off-by: Bjarni Jonasson <bjarni.jonasson@microchip.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

f8c63088

net: mdiobus: fix fwnode_mdbiobus_register() fallback case · c88c192d

Marcin Wojtas authored Jun 24, 2021

The fallback case of fwnode_mdbiobus_register()
(relevant for !CONFIG_FWNODE_MDIO) was defined with wrong
argument name, causing a compilation error. Fix that.
Signed-off-by: Marcin Wojtas <mw@semihalf.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

c88c192d

net: ip: avoid OOM kills with large UDP sends over loopback · 6d123b81

Jakub Kicinski authored Jun 23, 2021

Dave observed number of machines hitting OOM on the UDP send
path. The workload seems to be sending large UDP packets over
loopback. Since loopback has MTU of 64k kernel will try to
allocate an skb with up to 64k of head space. This has a good
chance of failing under memory pressure. What's worse if
the message length is <32k the allocation may trigger an
OOM killer.

This is entirely avoidable, we can use an skb with page frags.

af_unix solves a similar problem by limiting the head
length to SKB_MAX_ALLOC. This seems like a good and simple
approach. It means that UDP messages > 16kB will now
use fragments if underlying device supports SG, if extra
allocator pressure causes regressions in real workloads
we can switch to trying the large allocation first and
falling back.

v4: pre-calculate all the additions to alloclen so
    we can be sure it won't go over order-2
Reported-by: Dave Jones <dsj@fb.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

6d123b81

tools/testing: add a selftest for SO_NETNS_COOKIE · ae24bab2

Lorenz Bauer authored Jun 23, 2021

Make sure that SO_NETNS_COOKIE returns a non-zero value, and
that sockets from different namespaces have a distinct cookie
value.
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ae24bab2

net: retrieve netns cookie via getsocketopt · e8b9eab9

Martynas Pumputis authored Jun 23, 2021

It's getting more common to run nested container environments for
testing cloud software. One of such examples is Kind [1] which runs a
Kubernetes cluster in Docker containers on a single host. Each container
acts as a Kubernetes node, and thus can run any Pod (aka container)
inside the former. This approach simplifies testing a lot, as it
eliminates complicated VM setups.

Unfortunately, such a setup breaks some functionality when cgroupv2 BPF
programs are used for load-balancing. The load-balancer BPF program
needs to detect whether a request originates from the host netns or a
container netns in order to allow some access, e.g. to a service via a
loopback IP address. Typically, the programs detect this by comparing
netns cookies with the one of the init ns via a call to
bpf_get_netns_cookie(NULL). However, in nested environments the latter
cannot be used given the Kubernetes node's netns is outside the init ns.
To fix this, we need to pass the Kubernetes node netns cookie to the
program in a different way: by extending getsockopt() with a
SO_NETNS_COOKIE option, the orchestrator which runs in the Kubernetes
node netns can retrieve the cookie and pass it to the program instead.

Thus, this is following up on Eric's commit 3d368ab8 ("net:
initialize net->net_cookie at netns setup") to allow retrieval via
SO_NETNS_COOKIE. This is also in line in how we retrieve socket cookie
via SO_COOKIE.

[1] https://kind.sigs.k8s.io/Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Martynas Pumputis <m@lambda.lt>
Cc: Eric Dumazet <edumazet@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e8b9eab9

23 Jun, 2021 7 commits

Merge branch 'devlink-rate-limit-fixes' · 35713d9b

David S. Miller authored Jun 23, 2021

Dmytro Linkin says:

====================
Fixes for devlink rate objects API

Patch #1 fixes not decreased refcount of parent node for destroyed leaf
object.

Patch #2 fixes incorect eswitch mode check.

Patch #3 protects list traversing with a lock.

====================
Signed-off-by: David S. Miller <davem@davemloft.net>

35713d9b

devlink: Protect rate list with lock while switching modes · a3e5e579

Dmytro Linkin authored Jun 23, 2021

Devlink eswitch set command doesn't hold devlink->lock, which makes
possible race condition between rate list traversing and others devlink
rate KAPI calls, like devlink_rate_nodes_destroy().
Hold devlink lock while traversing the list.

Fixes: a8ecb93e ("devlink: Introduce rate nodes")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a3e5e579

devlink: Remove eswitch mode check for mode set call · ff99324d

Dmytro Linkin authored Jun 23, 2021

When eswitch is disabled, querying its current mode results in error.
Due to this when trying to set the eswitch mode for mlx5 devices, it
fails to set the eswitch switchdev mode.
Hence remove such check.

Fixes: a8ecb93e ("devlink: Introduce rate nodes")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ff99324d

devlink: Decrease refcnt of parent rate object on leaf destroy · 1321ed5e

Dmytro Linkin authored Jun 23, 2021

Port functions, like SFs, can be deleted by the user when its leaf rate
object has parent node. In such case node refcnt won't be decreased
which blocks the node from deletion later.
Do simple refcnt decrease, since driver in cleanup stage. This:
1) assumes that driver took proper internal parent unset action;
2) allows to avoid nested callbacks call and deadlock.

Fixes: d7555984 ("devlink: Allow setting parent node of rate objects")
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1321ed5e

virtio_net: Use virtio_find_vqs_ctx() helper · a2f7dc00

Xianting Tian authored Jun 23, 2021

virtio_find_vqs_ctx() is defined but never be called currently,
it is the right place to use it.
Signed-off-by: Xianting Tian <xianting.tian@linux.alibaba.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a2f7dc00

net/tls: Remove the __TLS_DEC_STATS() macro. · 10ed7ce4

Kuniyuki Iwashima authored Jun 23, 2021

The commit d26b698d ("net/tls: add skeleton of MIB statistics")
introduced __TLS_DEC_STATS(), but it is not used and __SNMP_DEC_STATS() is
not defined also. Let's remove it.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

10ed7ce4

tcp: Add stats for socket migration. · 55d444b3

Kuniyuki Iwashima authored Jun 23, 2021

This commit adds two stats for the socket migration feature to evaluate the
effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE).

If the migration fails because of the own_req race in receiving ACK and
sending SYN+ACK paths, we do not increment the failure stat. Then another
CPU is responsible for the req.

Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

55d444b3