Commits · c2273219baa5097a4d7c1c162b992623534f34c1 · Kirill Smelkov / linux

23 Apr, 2019 6 commits

net/mlx5e: XDP, Inline small packets into the TX MPWQE in XDP xmit flow · c2273219

Shay Agroskin authored Mar 14, 2019

Upon high packet rate with multiple CPUs TX workloads, much of the HCA's
resources are spent on prefetching TX descriptors, thus affecting
transmission rates.
This patch comes to mitigate this problem by moving some workload to the
CPU and reducing the HW data prefetch overhead for small packets (<= 256B).

When forwarding packets with XDP, a packet that is smaller
than a certain size (set to ~256 bytes) would be sent inline within
its WQE TX descrptor (mem-copied), when the hardware tx queue is congested
beyond a pre-defined water-mark.

This is added to better utilize the HW resources (which now makes
one less packet data prefetch) and allow better scalability, on the
account of CPU usage (which now 'memcpy's the packet into the WQE).

To load balance between HW and CPU and get max packet rate, we use
watermarks to detect how much the HW is congested and move the work
loads back and forth between HW and CPU.

Performance:
Tested packet rate for UDP 64Byte multi-stream
over two dual port ConnectX-5 100Gbps NICs.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

* Tested with hyper-threading disabled

XDP_TX:

|          | before | after   |       |
| 24 rings | 51Mpps | 116Mpps | +126% |
| 1 ring   | 12Mpps | 12Mpps  | same  |

XDP_REDIRECT:

** Below is the transmit rate, not the redirection rate
which might be larger, and is not affected by this patch.

|          | before  | after   |      |
| 32 rings | 64Mpps  | 92Mpps  | +43% |
| 1 ring   | 6.4Mpps | 6.4Mpps | same |

As we can see, feature significantly improves scaling, without
hurting single ring performance.
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

c2273219

net/mlx5e: XDP, Add TX MPWQE session counter · 73cab880

Shay Agroskin authored Feb 25, 2019

This counter tracks how many TX MPWQE sessions are started in XDP SQ
in XDP TX/REDIRECT flow. It counts per-channel and global stats.
Signed-off-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

73cab880

net/mlx5e: XDP, Enhance RQ indication for XDP redirect flush · 15143bf5

Tariq Toukan authored Mar 10, 2019

The XDP redirect flush indication belongs to the receive queue,
not to its XDP send queue.

For this, use a new bit on rq->flags.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

15143bf5

net/mlx5e: XDP, Fix shifted flag index in RQ bitmap · f03590f7

Tariq Toukan authored Mar 10, 2019

Values in enum mlx5e_rq_flag are used as bit indixes.
Intention was to use them with no BIT(i) wrapping.

No functional bug fix here, as the same (shifted)flag bit
is used for all set, test, and clear operations.

Fixes: 121e8927 ("net/mlx5e: Refactor RQ XDP_TX indication")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Shay Agroskin <shayag@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

f03590f7

net/mlx5e: RX, Support multiple outstanding UMR posts · fd9b4be8

Tariq Toukan authored Feb 27, 2019

The buffers mapping of the Multi-Packet WQEs (of Striding RQ)
is done via UMR posts, one UMR WQE per an RX MPWQE.

A single MPWQE is capable of serving many incoming packets,
usually larger than the budget of a single napi cycle.
Hence, posting a single UMR WQE per napi cycle (and handling its
completion in the next cycle) works fine in many common cases,
but not always.

When an XDP program is loaded, every MPWQE is capable of serving less
packets, to satisfy the packet-per-page requirement.
Thus, for the same number of packets more MPWQEs (and UMR posts)
are needed (twice as much for the default MTU), giving less latency
room for the UMR completions.

In this patch, we add support for multiple outstanding UMR posts,
to allow faster gap closure between consuming MPWQEs and reposting
them back into the WQ.

For better SW and HW locality, we combine the UMR posts in bulks of
(at least) two.

This is expected to improve packet rate in high CPU scale.

Performance test:
As expected, huge improvement in large-scale (48 cores).

xdp_redirect_map, 64B UDP multi-stream.
Redirect from ConnectX-5 100Gbps to ConnectX-6 100Gbps.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.

Before: Unstable, 7 to 30 Mpps
After:  Stable,   at 70.5 Mpps

No degradation in other tested scenarios.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

fd9b4be8

Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux · 3839f99d
Saeed Mahameed authored Apr 23, 2019

3839f99d

22 Apr, 2019 2 commits

Merge tag 'v5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux into mlx5-next · c3bdd5e6

Saeed Mahameed authored Apr 22, 2019

Linux 5.1-rc1

We forgot to reset the branch last merge window thus mlx5-next is outdated
and still based on 5.0-rc2. This merge commit is needed to sync mlx5-next
branch with 5.1-rc1.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

c3bdd5e6

ipv6: Restore RTF_ADDRCONF check in rt6_qualify_for_ecmp · be659b8d

David Ahern authored Apr 21, 2019

The RTF_ADDRCONF flag filters out routes added by RA's in determining
which routes can be appended to an existing one to create a multipath
route. Restore the flag check and add a comment to document the RA piece.

Fixes: 4e54507a ("ipv6: Simplify rt6_qualify_for_ecmp")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

be659b8d

21 Apr, 2019 7 commits

ipv6: Simplify rt6_qualify_for_ecmp · 4e54507a

David Ahern authored Apr 21, 2019

After commit c7a1ce39 ("ipv6: Change addrconf_f6i_alloc to use
ip6_route_info_create"), the gateway is no longer filled in for fib6_nh
structs in a prefix route. Accordingly, the RTF_ADDRCONF flag check can
be dropped from the 'rt6_qualify_for_ecmp'.

Further, RTF_DYNAMIC is only set in rt6_info instances, so it can be
removed from the check as well.

This reduces rt6_qualify_for_ecmp and the mlxsw version to just checking
if the nexthop has a gateway which is the real indication of whether
entries can be coalesced into a multipath route.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4e54507a

net: hippi:Fix misuse of %x in rrunner.c · fa8b9e8b

Fuqian Huang authored Apr 21, 2019

The pointer should be printed with %p or %px rather than
cast to unsigned long type and printed with %08lx.
Change %08lx to %p to print the pointer.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fa8b9e8b

net: ax25: fix misuse of %x · 966cddef

Fuqian Huang authored Apr 21, 2019

Pointers should be printed with %p or %px rather than
cast to long type and printed with %8.8lx.
Change %8.8lx to %p to print the pointer.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

966cddef

atm: iphase: fix misuse of %x · 05453ead

Fuqian Huang authored Apr 21, 2019

Pointers should be printed with %p or %px rather than
cast to long type and printed with %x.
Change %x to %p to print the pointers.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

05453ead

Merge branch 'mlxsw-Small-routing-improvements' · f9e0d65b

David S. Miller authored Apr 21, 2019

Ido Schimmel says:

====================
mlxsw: Small routing improvements

Patch #1 switches the driver to use a unique and stable ECMP/LAG seed.
This allows for consistent behavior across reboots and avoids hash
polarization at the same time.

Patch #2 relaxes the FIB rule validation in the driver to allow the
installation of rules that direct locally generated traffic (iif=lo).
This does not result in a discrepancy between both data paths because
packets received by the device would never match such rules.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

f9e0d65b

mlxsw: spectrum_router: Relax FIB rule validation · 05414dd1

Ido Schimmel authored Apr 21, 2019

Currently, mlxsw does not support policy-based routing (PBR) and
therefore forbids the installation of non-default FIB rules except for
the l3mdev rule which is used for VRFs.

Relax the check to allow the installation of FIB rules that would never
match packets received by the device. Specifically, if the iif is that
of the loopback netdev. This is useful for users that need to redirect
locally generated packets based on FIB rules.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Tested-by: Alexander Petrovskiy <alexpe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

05414dd1

mlxsw: spectrum: Use a stable ECMP/LAG seed · fa73989f

Ido Schimmel authored Apr 21, 2019

In order to get a consistent behavior of traffic flows across reboots /
module unload, we need to use the same ECMP/LAG seed.

Calculate the seed by hashing the base MAC of the device. This results
in a seed that is both unique (to avoid polarization) and consistent.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fa73989f

20 Apr, 2019 14 commits

nfp: add SR-IOV trusted VF support · 4ef6cbe8

Pablo Cascón authored Apr 19, 2019

By default VFs are not trusted. Add ndo_set_vf_trust support to toggle
a new per-VF bit. Coupled with FW with this capability allows a
trusted VF to change its MAC even after being administratively set by
the PF. Also populate the trusted field on ndo_get_vf_config. Add the
same ndo to the representors.
Signed-off-by: Pablo Cascón <pablo.cascon@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4ef6cbe8

Merge branch 'hns3-next' · 5313794b

David S. Miller authored Apr 19, 2019

Huazhong Tan says:

====================
net: hns3: add some DFX info for HNS3 driver

This patch-set adds some DFX information to HNS3 driver, for easily
debug some problems, and fixes some related bugs.

[patch 1/12 - 4/12] adds debug info about reset & interrupt events

[patch 5/12 - 7/12] adds debug info about TX time out & fixes
related bugs

[patch 8/12] adds support for setting netif message level

[patch 9/12 - 10/12] adds debugfs command to dump NCL & MAC info

[patch 11/12] adds VF's queue statistics info updating

[patch 12/12] adds a check for debugfs help function to decide which
commands are supportable
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5313794b

net: hns3: add function type check for debugfs help information · 97afd47b

Yufeng Mo authored Apr 19, 2019

PF supports all debugfs command, but VF only supports part of
debugfs command. So VF should not show unsupported help information.

This patch adds a check for PF and PF to show the supportable help
information.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

97afd47b

net: hns3: add queue's statistics update to service task · db01afeb

liuzhongzhu authored Apr 19, 2019

This patch updates VF's TQP statistic info in the service task,
and adds a limitation to prevent update too frequently.
Signed-off-by: liuzhongzhu <liuzhongzhu@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

db01afeb

net: hns3: Add handling of MAC tunnel interruption · a6345787

Weihang Li authored Apr 19, 2019

MAC tnl interruptions are different from other type of RAS and MSI-X
errors, because some bits, such as OVF/LR/RF will occur during link up
and down.

The drivers should clear status of all MAC tnl interruption bits but
shouldn't print any message that would mislead the users.

In case that link down and re-up in a short time because of some reasons,
we record when they occurred, and users can query them by debugfs.
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a6345787

net: hns3: add support for dump ncl config by debugfs · ffd140e2

Weihang Li authored Apr 19, 2019

This patch allow users to dump content of NCL_CONFIG by using debugfs
command.
Command format:
	echo dump ncl_config <offset> <length> > cmd
It will print as follows:
	hns3 0000:7d:00.0: offset |    data
	hns3 0000:7d:00.0: 0x0000 | 0x00000020
	hns3 0000:7d:00.0: 0x0004 | 0x00000400
	hns3 0000:7d:00.0: 0x0008 | 0x08020401
Signed-off-by: Weihang Li <liweihang@hisilicon.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ffd140e2

net: hns3: Add support for netif message level settings · bb87be87

Yonglong Liu authored Apr 19, 2019

This patch adds support for network interface message level
settings. The message level can be changed by module parameter
or ethtool.
Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bb87be87

net: hns3: dump more information when tx timeout happens · e511c97d

Jian Shen authored Apr 19, 2019

Currently we just print few information when tx timeout happens.
In order to find out the cause of timeout, this patch prints more
information about the packet statistics, tqp registers and
napi state.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e511c97d

net: hns3: fix loop condition of hns3_get_tx_timeo_queue_info() · fa6c4084

Jian Shen authored Apr 19, 2019

In function hns3_get_tx_timeo_queue_info(), it should use
netdev->num_tx_queues, instead of netdve->real_num_tx_queues
as the loop limitation.

Fixes: 424eb834 ("net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fa6c4084

net: hns3: refine tx timeout count handle · beab694a

Jian Shen authored Apr 19, 2019

In current codes, tx_timeout_cnt is used before increased,
then we can see the tx_timeout_count is still 0 from the
print when tx timeout happens, e.g.
"hns3 0000:7d:00.3 eth3: tx_timeout count: 0, queue id: 0, SW_NTU:
 0xa6, SW_NTC: 0xa4, HW_HEAD: 0xa4, HW_TAIL: 0xa6, INT: 0x1"

The tx_timeout_cnt should be updated before used.
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

beab694a

net: hns3: add some debug info for hclgevf_get_mbx_resp() · fbf3cd3f

Huazhong Tan authored Apr 19, 2019

When wait for response timeout or response code not match, there
should be more information for debugging.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fbf3cd3f

net: hns3: add some debug information for hclge_check_event_cause · 147175c9

Huazhong Tan authored Apr 19, 2019

When received vector0 msix and other event interrupt, it should
print the value of the register for debugging.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

147175c9

net: hns3: add reset statistics for VF · c88a6e7d

Huazhong Tan authored Apr 19, 2019

This patch adds some statistics for VF reset.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c88a6e7d

net: hns3: add reset statistics info for PF · f02eb82d

Huazhong Tan authored Apr 19, 2019

This patch adds statistics for PF's reset information,
also, provides a debugfs command to dump these statistics.
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f02eb82d

19 Apr, 2019 11 commits

tcp: properly reset skb->truesize for tx recycling · d7cc399e

Eric Dumazet authored Apr 19, 2019

tcp sendmsg() and sendpage() normally advance skb->data_len
and skb->truesize by the payload added to an skb.

But sendmsg(fd, ..., MSG_ZEROCOPY) has to account for whole pages,
even if a single byte of payload is used in the page.

This means that we can not assume skb->truesize can be adjusted
by skb->data_len. We must instead overwrite its value.

Otherwise skb->truesize is too big and can hit socket sndbuf limit,
especially if the skb is recycled multiple times :/

Fixes: 472c2e07 ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d7cc399e

mlxsw: spectrum: Assume CONFIG_NET_DEVLINK is always enabled · 0a9798c1

Jiri Pirko authored Apr 19, 2019

Since commit f6b19b35 ("net: devlink: select NET_DEVLINK
from drivers") adds implicit select of NET_DEVLINK for
mlxsw, the code does not have to deal with the case
when CONFIG_NET_DEVLINK is not enabled. So remove the ifcase
and adjust Makefile.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0a9798c1

tipc: introduce new socket option TIPC_SOCK_RECVQ_USED · 42e5425a

Tung Nguyen authored Apr 18, 2019

When using TIPC_SOCK_RECVQ_DEPTH for getsockopt(), it returns the
number of buffers in receive socket buffer which is not so helpful
for user space applications.

This commit introduces the new option TIPC_SOCK_RECVQ_USED which
returns the current allocated bytes of the receive socket buffer.
This helps user space applications dimension its buffer usage to
avoid buffer overload issue.
Signed-off-by: Tung Nguyen <tung.q.nguyen@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

42e5425a

net: dsa: mv88e6xxx: Only reconfigure MAC when something changes · a26deec6

Andrew Lunn authored Apr 18, 2019

phylink will call the mac_config() callback once per second when
polling a PHY or a fixed link. The MAC driver is not supposed to
reconfigure the MAC if nothing has changed.

Make the mv88e6xxx driver look at the current configuration of the
port, and return early if nothing has changed.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

a26deec6

net: socket: implement 64-bit timestamps · 0768e170

Arnd Bergmann authored Apr 17, 2019

The 'timeval' and 'timespec' data structures used for socket timestamps
are going to be redefined in user space based on 64-bit time_t in future
versions of the C library to deal with the y2038 overflow problem,
which breaks the ABI definition.

Unlike many modern ioctl commands, SIOCGSTAMP and SIOCGSTAMPNS do not
use the _IOR() macro to encode the size of the transferred data, so it
remains ambiguous whether the application uses the old or new layout.

The best workaround I could find is rather ugly: we redefine the command
code based on the size of the respective data structure with a ternary
operator. This lets it get evaluated as late as possible, hopefully after
that structure is visible to the caller. We cannot use an #ifdef here,
because inux/sockios.h might have been included before any libc header
that could determine the size of time_t.

The ioctl implementation now interprets the new command codes as always
referring to the 64-bit structure on all architectures, while the old
architecture specific command code still refers to the old architecture
specific layout. The new command number is only used when they are
actually different.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

0768e170

asm-generic: generalize asm/sockios.h · 5ce5d8a5

Arnd Bergmann authored Apr 17, 2019

ia64, parisc and sparc just use a copy of the generic version
of asm/sockios.h, and x86 is a redirect to the same file, so we
can just let the header file be generated.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

5ce5d8a5

net: rework SIOCGSTAMP ioctl handling · c7cbdbf2

Arnd Bergmann authored Apr 17, 2019

The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
socket protocol handlers, and all of those end up calling the same
sock_get_timestamp()/sock_get_timestampns() helper functions, which
results in a lot of duplicate code.

With the introduction of 64-bit time_t on 32-bit architectures, this
gets worse, as we then need four different ioctl commands in each
socket protocol implementation.

To simplify that, let's add a new .gettstamp() operation in
struct proto_ops, and move ioctl implementation into the common
sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
through.

We can reuse the sock_get_timestamp() implementation, but generalize
it so it can deal with both native and compat mode, as well as
timeval and timespec structures.
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c7cbdbf2

Merge branch 'net-support-binding-vlan-dev-link-state-to-vlan-member-bridge-ports' · 1ab83928

David S. Miller authored Apr 19, 2019

Mike Manning says:

====================
net: support binding vlan dev link state to vlan member bridge ports

For vlan filtering on bridges, the bridge may also have vlan devices
as upper devices. For switches, these are used to provide L3 packet
processing for ports that are members of a given vlan.

While it is correct that the admin state for these vlan devices is
either set directly for the device or inherited from the lower device,
the link state is also transferred from the lower device. So this is
always up if the bridge is in admin up state and there is at least one
bridge port that is up, regardless of the vlan that the port is in.

The link state of the vlan device may need to track only the state of
the subset of ports that are also members of the corresponding vlan,
rather than that of all ports.

This series provides an optional vlan flag so that the link state of
the vlan device is only up if there is at least one bridge port that is
up AND is a member of the corresponding vlan.

v2:
   - Address review comments from Nikolay Aleksandrov
     in patches 3 & 4 and add patch 5 to address bridge link down due to STP
v3:
   - Address review comment from Nikolay Aleksandrov
     in patch 4 so as to remove unnecessary inline #ifdef
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

1ab83928

bridge: update vlan dev link state for bridge netdev changes · 8e1acd4f

Mike Manning authored Apr 18, 2019

If vlan bridge binding is enabled, then the link state of a vlan device
that is an upper device of the bridge tracks the state of bridge ports
that are members of that vlan. But this can only be done when the link
state of the bridge is up. If it is down, then the link state of the
vlan devices must also be down. This is to maintain existing behavior
for when STP is enabled and there are no live ports, in which case the
link state for the bridge and any vlan devices is down.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8e1acd4f

bridge: update vlan dev state when port added to or deleted from vlan · 80900acd

Mike Manning authored Apr 18, 2019

If vlan bridge binding is enabled, then the link state of a vlan device
that is an upper device of the bridge should track the state of bridge
ports that are members of that vlan. So if a bridge port becomes or
stops being a member of a vlan, then update the link state of the
vlan device if necessary.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

80900acd

bridge: support binding vlan dev link state to vlan member bridge ports · 9c0ec2e7

Mike Manning authored Apr 18, 2019

In the case of vlan filtering on bridges, the bridge may also have the
corresponding vlan devices as upper devices. A vlan bridge binding mode
is added to allow the link state of the vlan device to track only the
state of the subset of bridge ports that are also members of the vlan,
rather than that of all bridge ports. This mode is set with a vlan flag
rather than a bridge sysfs so that the 8021q module is aware that it
should not set the link state for the vlan device.

If bridge vlan is configured, the bridge device event handling results
in the link state for an upper device being set, if it is a vlan device
with the vlan bridge binding mode enabled. This also sets a
vlan_bridge_binding flag so that subsequent UP/DOWN/CHANGE events for
the ports in that bridge result in a link state update of the vlan
device if required.

The link state of the vlan device is up if there is at least one bridge
port that is a vlan member that is admin & oper up, otherwise its oper
state is IF_OPER_LOWERLAYERDOWN.
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9c0ec2e7