- 15 Feb, 2021 8 commits
-
-
David S. Miller authored
Michael Chan says: ==================== bnxt_en: Error recovery optimizations. This series implements some optimizations to error recovery. One patch adds an echo/reply mechanism with firmware to enhance error detection. The other patches speed up the recovery process by polling config space earlier and to selectively initialize context memory during re-initialization. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael Chan authored
We currently only log the error recovery settings if it is enabled. In some cases, firmware disables error recovery after it was initially enabled. Without logging anything, the user will not be aware of this change in setting. Log it when error recovery is disabled. Also, change the reset count value from hexadecimal to decimal. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael Chan authored
This is a new async message that the firmware can send to check if it can communicate with the driver. This is an added error detection scheme that firmware can use if it suspects errors in the PCIe interface. When the driver receives this async message, it will reply back echoing some data in the async message. If the firmware is not getting the reply with the proper data after some retries, error recovery will kick in. Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael Chan authored
If firmware provides the offset to the "context kind" field of the relevant context memory blocks, we'll initialize just that field for each block instead of initializing all of context memory. Populate the bnxt_mem_init structure with the proper offset returned by firmware. If it is older firmware and the information is not available, we set the offset to an invalid value and fall back to the old behavior of initializing every byte. Otherwise, we initialize only the "context kind" byte at the offset. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael Chan authored
Currently, the driver calls memset() to set all relevant context memory used by the chip to the initial value. This can take many milliseconds with the potentially large number of context pages allocated for the chip. To make this faster, we only need to initialize the "context kind" field of each block of context memory. This patch sets up the infrastructure to do that with the bnxt_mem_init structure. In the next patch, we'll add the logic to obtain the offset of the "context kind" from the firmware. This patch is not changing the current behavior of calling memset() to initialize all relevant context memory. Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael Chan authored
During some fatal firmware error conditions, the PCI config space register 0x2e which normally contains the subsystem ID will become 0xffff. This register will revert back to the normal value after the chip has completed core reset. If we detect this condition, we can poll this config register immediately for the value to revert. Because we use config read cycles to poll this register, there is no possibility of Master Abort if we happen to read it during core reset. This speeds up recovery significantly as we don't have to wait for the conservative min_time before polling MMIO to see if the firmware has come out of reset. As soon as this register changes value we can proceed to re-initialize the device. Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> Reviewed-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Edwin Peer authored
Newer devices may have local context memory instead of relying on the host for backing store. In these cases, HWRM_FUNC_BACKING_STORE_QCAPS will return a zero entry size to indicate contexts for which the host should not allocate backing store. Selectively allocate context memory based on device capabilities and only enable backing store for the appropriate contexts. Signed-off-by: Edwin Peer <edwin.peer@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Michael Chan authored
The main changes are the echo request/response from firmware for error detection and the NO_FCS feature to transmit frames without FCS. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 13 Feb, 2021 32 commits
-
-
David S. Miller authored
Alexander Lobakin says: ==================== skbuff: introduce skbuff_heads bulking and reusing Currently, all sorts of skb allocation always do allocate skbuff_heads one by one via kmem_cache_alloc(). On the other hand, we have percpu napi_alloc_cache to store skbuff_heads queued up for freeing and flush them by bulks. We can use this cache not only for bulk-wiping, but also to obtain heads for new skbs and avoid unconditional allocations, as well as for bulk-allocating (like XDP's cpumap code and veth driver already do). As this might affect latencies, cache pressure and lots of hardware and driver-dependent stuff, this new feature is mostly optional and can be issued via: - a new napi_build_skb() function (as a replacement for build_skb()); - existing {,__}napi_alloc_skb() and napi_get_frags() functions; - __alloc_skb() with passing SKB_ALLOC_NAPI in flags. iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be bigger on more powerful hosts and NICs with tens of Mpps. Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs: - kmalloc()/kmem_cache_alloc() itself allows by default allocating memory from the remote nodes to defragment their slabs. This is controlled by sysctl, but according to this, skbuff_head from a remote node is an OK case; - The easiest way to check if the slab of skbuff_head is remote or pfmemalloc'ed is: if (!dev_page_is_reusable(virt_to_head_page(skb))) /* drop it */; ...*but*, regarding that most slabs are built of compound pages, virt_to_head_page() will hit unlikely-branch every single call. This check costed at least 20 Mbps in test scenarios and seems like it'd be better to _not_ do this. Since v5 [4]: - revert flags-to-bool conversion and simplify flags testing in __alloc_skb() (Alexander Duyck). Since v4 [3]: - rebase on top of net-next and address kernel build robot issue; - reorder checks a bit in __alloc_skb() to make new condition even more harmless. Since v3 [2]: - make the feature mostly optional, so driver developers could decide whether to use it or not (Paolo Abeni). This reuses the old flag for __alloc_skb() and introduces a new napi_build_skb(); - reduce bulk-allocation size from 32 to 16 elements (also Paolo). This equals to the value of XDP's devmap and veth batch processing (which were tested a lot) and should be sane enough; - don't waste cycles on explicit in_serving_softirq() check. Since v2 [1]: - also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy after the changes that pass tiny skbs requests to kmalloc layer); - cover the cache with KASAN instrumentation (suggested by Eric Dumazet, help of Dmitry Vyukov); - completely drop redundant __kfree_skb_flush() (also Eric); - lots of code cleanups; - expand the commit message with NUMA and pfmemalloc points (Jakub). Since v1 [0]: - use one unified cache instead of two separate to greatly simplify the logics and reduce hotpath overhead (Edward Cree); - new: recycle also GRO_MERGED_FREE skbs instead of immediate freeing; - correct performance numbers after optimizations and performing lots of tests for different use cases. [0] https://lore.kernel.org/netdev/20210111182655.12159-1-alobakin@pm.me [1] https://lore.kernel.org/netdev/20210113133523.39205-1-alobakin@pm.me [2] https://lore.kernel.org/netdev/20210209204533.327360-1-alobakin@pm.me [3] https://lore.kernel.org/netdev/20210210162732.80467-1-alobakin@pm.me [4] https://lore.kernel.org/netdev/20210211185220.9753-1-alobakin@pm.me ==================== Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
napi_frags_finish() and napi_skb_finish() can only be called inside NAPI Rx context, so we can feed NAPI cache with skbuff_heads that got NAPI_MERGED_FREE verdict instead of immediate freeing. Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish() and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs to NAPI cache. As many drivers call napi_alloc_skb()/napi_get_frags() on their receive path, this becomes especially useful. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
{,__}napi_alloc_skb() is mostly used either for optional non-linear receive methods (usually controlled via Ethtool private flags and off by default) and/or for Rx copybreaks. Use __napi_build_skb() here for obtaining skbuff_heads from NAPI cache instead of inplace allocations. This includes both kmalloc and page frag paths. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
Reuse the old and forgotten SKB_ALLOC_NAPI to add an option to get an skbuff_head from the NAPI cache instead of inplace allocation inside __alloc_skb(). This implies that the function is called from softirq or BH-off context, not for allocating a clone or from a distant node. Cc: Alexander Duyck <alexander.duyck@gmail.com> # Simplified flags check Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
Instead of just bulk-flushing skbuff_heads queued up through napi_consume_skb() or __kfree_skb_defer(), try to reuse them on allocation path. If the cache is empty on allocation, bulk-allocate the first 16 elements, which is more efficient than per-skb allocation. If the cache is full on freeing, bulk-wipe the second half of the cache (32 elements). This also includes custom KASAN poisoning/unpoisoning to be double sure there are no use-after-free cases. To not change current behaviour, introduce a new function, napi_build_skb(), to optionally use a new approach later in drivers. Note on selected bulk size, 16: - this equals to XDP_BULK_QUEUE_SIZE, DEV_MAP_BULK_SIZE and especially VETH_XDP_BATCH, which is also used to bulk-allocate skbuff_heads and was tested on powerful setups; - this also showed the best performance in the actual test series (from the array of {8, 16, 32}). Suggested-by: Edward Cree <ecree.xilinx@gmail.com> # Divide on two halves Suggested-by: Eric Dumazet <edumazet@google.com> # KASAN poisoning Cc: Dmitry Vyukov <dvyukov@google.com> # Help with KASAN Cc: Paolo Abeni <pabeni@redhat.com> # Reduced batch size Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
NAPI cache structures will be used for allocating skbuff_heads, so move their declarations a bit upper. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
This function isn't much needed as NAPI skb queue gets bulk-freed anyway when there's no more room, and even may reduce the efficiency of bulk operations. It will be even less needed after reusing skb cache on allocation path, so remove it and this way lighten network softirqs a bit. Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
Just call __build_skb_around() instead of open-coding it. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
Use unlikely() annotations for skbuff_head and data similarly to the two other allocation functions and remove totally redundant goto. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
__build_skb_around() can never fail and always returns passed skb. Make it return void to simplify and optimize the code. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
Eversince the introduction of __kmalloc_reserve(), "ip" argument hasn't been used. _RET_IP_ is embedded inside kmalloc_node_track_caller(). Remove the redundant macro and rename the function after it. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexander Lobakin authored
In preparation before reusing several functions in all three skb allocation variants, move __alloc_skb() next to the __netdev_alloc_skb() and __napi_alloc_skb(). No functional changes. Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Robert Hancock says: ==================== Xilinx axienet updates Updates to the Xilinx AXI Ethernet driver to add support for an additional ethtool operation, and to support dynamic switching between 1000BaseX and SGMII interface modes. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Robert Hancock authored
Newer versions of the Xilinx AXI Ethernet core (specifically version 7.2 or later) allow the core to be configured with a PHY interface mode of "Both", allowing either 1000BaseX or SGMII modes to be selected at runtime. Add support for this in the driver to allow better support for applications which can use both fiber and copper SFP modules. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Robert Hancock authored
Document the new xlnx,switch-x-sgmii attribute which is used to indicate that the Ethernet core supports dynamic switching between 1000BaseX and SGMII. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Robert Hancock authored
Hook up the nway_reset ethtool operation to the corresponding phylink function so that "ethtool -r" can be supported. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Eric Dumazet says: ==================== tcp: mem pressure vs SO_RCVLOWAT First patch fixes an issue for applications using SO_RCVLOWAT to reduce context switches. Second patch is a cleanup. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Both tcp_data_ready() and tcp_stream_is_readable() share the same logic. Add tcp_epollin_ready() helper to avoid duplication. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Arjun Roy <arjunroy@google.com> Cc: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
While commit 24adbc16 ("tcp: fix SO_RCVLOWAT hangs with fat skbs") fixed an issue vs too small sk_rcvbuf for given sk_rcvlowat constraint, it missed to address issue caused by memory pressure. 1) If we are under memory pressure and socket receive queue is empty. First incoming packet is allowed to be queued, after commit 76dfa608 ("tcp: allow one skb to be received per socket under memory pressure") But we do not send EPOLLIN yet, in case tcp_data_ready() sees sk_rcvlowat is bigger than skb length. 2) Then, when next packet comes, it is dropped, and we directly call sk->sk_data_ready(). 3) If application is using poll(), tcp_poll() will then use tcp_stream_is_readable() and decide the socket receive queue is not yet filled, so nothing will happen. Even when sender retransmits packets, phases 2) & 3) repeat and flow is effectively frozen, until memory pressure is off. Fix is to consider tcp_under_memory_pressure() to take care of global memory pressure or memcg pressure. Fixes: 24adbc16 ("tcp: fix SO_RCVLOWAT hangs with fat skbs") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Arjun Roy <arjunroy@google.com> Suggested-by: Wei Wang <weiwan@google.com> Reviewed-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queueDavid S. Miller authored
Tony Nguyen says: ==================== 40GbE Intel Wired LAN Driver Updates 2021-02-12 This series contains updates to i40e, ice, and ixgbe drivers. Maciej does cleanups on the following drivers. For i40e, removes redundant check for XDP prog, cleans up no longer relevant information, and removes an unused function argument. For ice, removes local variable use, instead returning values directly. Moves skb pointer from buffer to ring and removes an unneeded check for xdp_prog in zero copy path. Also removes a redundant MTU check when changing it. For i40e, ice, and ixgbe, stores the rx_offset in the Rx ring as the value is constant so there's no need for continual calls. Bjorn folds a decrement into a while statement. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Guillaume Nault says: ==================== selftests: tc: Test tc-flower's MPLS features A couple of patches for exercising the MPLS filters of tc-flower. Patch 1 tests basic MPLS matching features: those that only work on the first label stack entry (that is, the mpls_label, mpls_tc, mpls_bos and mpls_ttl options). Patch 2 tests the more generic "mpls" and "lse" options, which allow matching MPLS fields beyond the first stack entry. In both patches, special care is taken to skip these new tests for incompatible versions of tc. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
Add tests in tc_flower.sh for generic matching on MPLS Label Stack Entries. The label, tc, bos and ttl fields are tested for the first and second labels. For each field, the minimal and maximal values are tested (the former at depth 1 and the later at depth 2). There are also tests for matching the presence of a label stack entry at a given depth. In order to reduce the amount of code, all "lse" subcommands are tested in match_mpls_lse_test(). Action "continue" is used, so that test packets are evaluated by all filters. Then, we can verify if each filter matched the expected number of packets. Some versions of tc-flower produced invalid json output when dumping MPLS filters with depth > 1. Skip the test if tc isn't recent enough. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Guillaume Nault authored
Add tests in tc_flower.sh for mpls_label, mpls_tc, mpls_bos and mpls_ttl. For each keyword, test the minimal and maximal values. Selectively skip these new mpls tests for tc versions that don't support them. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Vladimir Oltean says: ==================== Cleanup in brport flags switchdev offload for DSA The initial goal of this series was to have better support for standalone ports mode on the DSA drivers like ocelot/felix and sja1105. This turned out to require some API adjustments in both directions: to the information presented to and by the switchdev notifier, and to the API presented to the switch drivers by the DSA layer. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
The chip can configure unicast flooding, broadcast flooding and learning. Learning is per port, while flooding is per {ingress, egress} port pair and we need to configure the same value for all possible ingress ports towards the requested one. While multicast flooding is not officially supported, we can hack it by using a feature of the second generation (P/Q/R/S) devices, which is that FDB entries are maskable, and multicast addresses always have an odd first octet. So by putting a match-all for 00:01:00:00:00:00 addr and 00:01:00:00:00:00 mask at the end of the FDB, we make sure that it is always checked last, and does not take precedence in front of any other MDB. So it behaves effectively as an unknown multicast entry. For the first generation switches, this feature is not available, so unknown multicast will always be treated the same as unknown unicast. So the only thing we can do is request the user to offload the settings for these 2 flags in tandem, i.e. ip link set swp2 type bridge_slave flood off Error: sja1105: This chip cannot configure multicast flooding independently of unicast. ip link set swp2 type bridge_slave flood off mcast_flood off ip link set swp2 type bridge_slave mcast_flood on Error: sja1105: This chip cannot configure multicast flooding independently of unicast. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
We should not be unconditionally enabling address learning, since doing that is actively detrimential when a port is standalone and not offloading a bridge. Namely, if a port in the switch is standalone and others are offloading the bridge, then we could enter a situation where we learn an address towards the standalone port, but the bridged ports could not forward the packet there, because the CPU is the only path between the standalone and the bridged ports. The solution of course is to not enable address learning unless the bridge asks for it. We need to set up the initial port flags for no learning and flooding everything, and also when the port joins and leaves the bridge. The flood configuration was already configured ok for standalone mode in ocelot_init, we just need to disable learning in ocelot_init_port. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
In preparation of offloading the bridge port flags which have independent settings for unknown multicast and for broadcast, we should also start reserving one destination Port Group ID for the flooding of broadcast packets, to allow configuring it individually. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
ocelot_init sets up PGID_MC to include the CPU port module, and that is fine, but the ocelot-8021q tagger removes the CPU port module from the unknown multicast replicator. So after a transition from the default ocelot tagger towards ocelot-8021q and then again towards ocelot, multicast flooding towards the CPU port module will be disabled. Fixes: e21268ef ("net: dsa: felix: perform switch setup for tag_8021q") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
There are multiple ways in which a PORT_BRIDGE_FLAGS attribute can be expressed by the bridge through switchdev, and not all of them can be emulated by DSA mid-layer API at the same time. One possible configuration is when the bridge offloads the port flags using a mask that has a single bit set - therefore only one feature should change. However, DSA currently groups together unicast and multicast flooding in the .port_egress_floods method, which limits our options when we try to add support for turning off broadcast flooding: do we extend .port_egress_floods with a third parameter which b53 and mv88e6xxx will ignore? But that means that the DSA layer, which currently implements the PRE_BRIDGE_FLAGS attribute all by itself, will see that .port_egress_floods is implemented, and will report that all 3 types of flooding are supported - not necessarily true. Another configuration is when the user specifies more than one flag at the same time, in the same netlink message. If we were to create one individual function per offloadable bridge port flag, we would limit the expressiveness of the switch driver of refusing certain combinations of flag values. For example, a switch may not have an explicit knob for flooding of unknown multicast, just for flooding in general. In that case, the only correct thing to do is to allow changes to BR_FLOOD and BR_MCAST_FLOOD in tandem, and never allow mismatched values. But having a separate .port_set_unicast_flood and .port_set_multicast_flood would not allow the driver to possibly reject that. Also, DSA doesn't consider it necessary to inform the driver that a SWITCHDEV_ATTR_ID_BRIDGE_MROUTER attribute was offloaded, because it just calls .port_egress_floods for the CPU port. When we'll add support for the plain SWITCHDEV_ATTR_ID_PORT_MROUTER, that will become a real problem because the flood settings will need to be held statefully in the DSA middle layer, otherwise changing the mrouter port attribute will impact the flooding attribute. And that's _assuming_ that the underlying hardware doesn't have anything else to do when a multicast router attaches to a port than flood unknown traffic to it. If it does, there will need to be a dedicated .port_set_mrouter anyway. So we need to let the DSA drivers see the exact form that the bridge passes this switchdev attribute in, otherwise we are standing in the way. Therefore we also need to use this form of language when communicating to the driver that it needs to configure its initial (before bridge join) and final (after bridge leave) port flags. The b53 and mv88e6xxx drivers are converted to the passthrough API and their implementation of .port_egress_floods is split into two: a function that configures unicast flooding and another for multicast. The mv88e6xxx implementation is quite hairy, and it turns out that the implementations of unknown unicast flooding are actually the same for 6185 and for 6352: behind the confusing names actually lie two individual bits: NO_UNKNOWN_MC -> FLOOD_UC = 0x4 = BIT(2) NO_UNKNOWN_UC -> FLOOD_MC = 0x8 = BIT(3) so there was no reason to entangle them in the first place. Whereas the 6185 writes to MV88E6185_PORT_CTL0_FORWARD_UNKNOWN of PORT_CTL0, which has the exact same bit index. I have left the implementations separate though, for the only reason that the names are different enough to confuse me, since I am not able to double-check with a user manual. The multicast flooding setting for 6185 is in a different register than for 6352 though. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
This switchdev attribute offers a counterproductive API for a driver writer, because although br_switchdev_set_port_flag gets passed a "flags" and a "mask", those are passed piecemeal to the driver, so while the PRE_BRIDGE_FLAGS listener knows what changed because it has the "mask", the BRIDGE_FLAGS listener doesn't, because it only has the final value. But certain drivers can offload only certain combinations of settings, like for example they cannot change unicast flooding independently of multicast flooding - they must be both on or both off. The way the information is passed to switchdev makes drivers not expressive enough, and unable to reject this request ahead of time, in the PRE_BRIDGE_FLAGS notifier, so they are forced to reject it during the deferred BRIDGE_FLAGS attribute, where the rejection is currently ignored. This patch also changes drivers to make use of the "mask" field for edge detection when possible. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
For a DSA switch port operating in standalone mode, address learning doesn't make much sense since that is a bridge function. In fact, address learning even breaks setups such as this one: +---------------------------------------------+ | | | +-------------------+ | | | br0 | send receive | | +--------+-+--------+ +--------+ +--------+ | | | | | | | | | | | | | swp0 | | swp1 | | swp2 | | swp3 | | | | | | | | | | | | +-+--------+-+--------+-+--------+-+--------+-+ | ^ | ^ | | | | | +-----------+ | | | +--------------------------------+ because if the switch has a single FDB (can offload a single bridge) then source address learning on swp3 can "steal" the source MAC address of swp2 from br0's FDB, because learning frames coming from swp2 will be done twice: first on the swp1 ingress port, second on the swp3 ingress port. So the hardware FDB will become out of sync with the software bridge, and when swp2 tries to send one more packet towards swp1, the ASIC will attempt to short-circuit the forwarding path and send it directly to swp3 (since that's the last port it learned that address on), which it obviously can't, because swp3 operates in standalone mode. So DSA drivers operating in standalone mode should still configure a list of bridge port flags even when they are standalone. Currently DSA attempts to call dsa_port_bridge_flags with 0, which disables egress flooding of unknown unicast and multicast, something which doesn't make much sense. For the switches that implement .port_egress_floods - b53 and mv88e6xxx, it probably doesn't matter too much either, since they can possibly inject traffic from the CPU into a standalone port, regardless of MAC DA, even if egress flooding is turned off for that port, but certainly not all DSA switches can do that - sja1105, for example, can't. So it makes sense to use a better common default there, such as "flood everything". It should also be noted that what DSA calls "dsa_port_bridge_flags()" is a degenerate name for just calling .port_egress_floods(), since nothing else is implemented - not learning, in particular. But disabling address learning, something that this driver is also coding up for, will be supported by individual drivers once .port_egress_floods is replaced with a more generic .port_bridge_flags. Previous attempts to code up this logic have been in the common bridge layer, but as pointed out by Ido Schimmel, there are corner cases that are missed when doing that: https://patchwork.kernel.org/project/netdevbpf/patch/20210209151936.97382-5-olteanv@gmail.com/ So, at least for now, let's leave DSA in charge of setting port flags before and after the bridge join and leave. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Vladimir Oltean authored
For the netlink interface, propagate errors through extack rather than simply printing them to the console. For the sysfs interface, we still print to the console, but at least that's one layer higher than in switchdev, which also allows us to silently ignore the offloading of flags if that is ever needed in the future. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-