Commit abf36703 authored by David S. Miller's avatar David S. Miller

Merge branch 'vxlan-MDB-support'

Ido Schimmel says:

====================
vxlan: Add MDB support

tl;dr
=====

This patchset implements MDB support in the VXLAN driver, allowing it to
selectively forward IP multicast traffic to VTEPs with interested
receivers instead of flooding it to all the VTEPs as BUM. The motivating
use case is intra and inter subnet multicast forwarding using EVPN
[1][2], which means that MDB entries are only installed by the user
space control plane and no snooping is implemented, thereby avoiding a
lot of unnecessary complexity in the kernel.

Background
==========

Both the bridge and VXLAN drivers have an FDB that allows them to
forward Ethernet frames based on their destination MAC addresses and
VLAN/VNI. These FDBs are managed using the same PF_BRIDGE/RTM_*NEIGH
netlink messages and bridge(8) utility.

However, only the bridge driver has an MDB that allows it to selectively
forward IP multicast packets to bridge ports with interested receivers
behind them, based on (S, G) and (*, G) MDB entries. When these packets
reach the VXLAN driver they are flooded using the "all-zeros" FDB entry
(00:00:00:00:00:00). The entry either includes the list of all the VTEPs
in the tenant domain (when ingress replication is used) or the multicast
address of the BUM tunnel (when P2MP tunnels are used), to which all the
VTEPs join.

Networks that make heavy use of multicast in the overlay can benefit
from a solution that allows them to selectively forward IP multicast
traffic only to VTEPs with interested receivers. Such a solution is
described in the next section.

Motivation
==========

RFC 7432 [3] defines a "MAC/IP Advertisement route" (type 2) [4] that
allows VTEPs in the EVPN network to advertise and learn reachability
information for unicast MAC addresses. Traffic destined to a unicast MAC
address can therefore be selectively forwarded to a single VTEP behind
which the MAC is located.

The same is not true for IP multicast traffic. Such traffic is simply
flooded as BUM to all VTEPs in the broadcast domain (BD) / subnet,
regardless if a VTEP has interested receivers for the multicast stream
or not. This is especially problematic for overlay networks that make
heavy use of multicast.

The issue is addressed by RFC 9251 [1] that defines a "Selective
Multicast Ethernet Tag Route" (type 6) [5] which allows VTEPs in the
EVPN network to advertise multicast streams that they are interested in.
This is done by having each VTEP suppress IGMP/MLD packets from being
transmitted to the NVE network and instead communicate the information
over BGP to other VTEPs.

The draft in [2] further extends RFC 9251 with procedures to allow
efficient forwarding of IP multicast traffic not only in a given subnet,
but also between different subnets in a tenant domain.

The required changes in the bridge driver to support the above were
already merged in merge commit 8150f0cf ("Merge branch
'bridge-mcast-extensions-for-evpn'"). However, full support entails MDB
support in the VXLAN driver so that it will be able to selectively
forward IP multicast traffic only to VTEPs with interested receivers.
The implementation of this MDB is described in the next section.

Implementation
==============

The user interface is extended to allow user space to specify the
destination VTEP(s) and related parameters. Example usage:

 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 198.51.100.1
 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 192.0.2.1

 $ bridge -d -s mdb show
 dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 192.0.2.1    0.00
 dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 198.51.100.1    0.00

Since the MDB is fully managed by user space and since snooping is not
implemented, only permanent entries can be installed and temporary
entries are rejected by the kernel.

The netlink interface is extended with a few new attributes in the
RTM_NEWMDB / RTM_DELMDB request messages:

[ struct nlmsghdr ]
[ struct br_port_msg ]
[ MDBA_SET_ENTRY ]
	struct br_mdb_entry
[ MDBA_SET_ENTRY_ATTRS ]
	[ MDBE_ATTR_SOURCE ]
		struct in_addr / struct in6_addr
	[ MDBE_ATTR_SRC_LIST ]
		[ MDBE_SRC_LIST_ENTRY ]
			[ MDBE_SRCATTR_ADDRESS ]
				struct in_addr / struct in6_addr
		[ ...]
	[ MDBE_ATTR_GROUP_MODE ]
		u8
	[ MDBE_ATTR_RTPORT ]
		u8
	[ MDBE_ATTR_DST ]	// new
		struct in_addr / struct in6_addr
	[ MDBE_ATTR_DST_PORT ]	// new
		u16
	[ MDBE_ATTR_VNI ]	// new
		u32
	[ MDBE_ATTR_IFINDEX ]	// new
		s32
	[ MDBE_ATTR_SRC_VNI ]	// new
		u32

RTM_NEWMDB / RTM_DELMDB responses and notifications are extended with
corresponding attributes.

One MDB entry that can be installed in the VXLAN MDB, but not in the
bridge MDB is the catchall entry (0.0.0.0 / ::). It is used to transmit
unregistered multicast traffic that is not link-local and is especially
useful when inter-subnet multicast forwarding is required. See patch #12
for a detailed explanation and motivation. It is similar to the
"all-zeros" FDB entry that can be installed in the VXLAN FDB, but not
the bridge FDB.

"added_by_star_ex" entries
--------------------------

The bridge driver automatically installs (S, G) MDB port group entries
marked as "added_by_star_ex" whenever it detects that an (S, G) entry
can prevent traffic from being forwarded via a port associated with an
EXCLUDE (*, G) entry. The bridge will add the port to the port group of
the (S, G) entry, thereby creating a new port group entry. The
complexity associated with these entries is not trivial, but it needs to
reside in the bridge driver because it automatically installs MDB
entries in response to snooped IGMP / MLD packets.

The same in not true for the VXLAN MDB which is entirely managed by user
space who is fully capable of forming the correct replication lists on
its own. In addition, the complexity associated with the
"added_by_star_ex" entries in the VXLAN driver is higher compared to the
bridge: Whenever a remote VTEP is added to the catchall entry, it needs
to be added to all the existing MDB entries, as such a remote requested
all the multicast traffic to be forwarded to it. Similarly, whenever an
(*, G) or (S, G) entry is added, all the remotes associated with the
catchall entry need to be added to it.

Given the above, this patchset does not implement support for such
entries.  One argument against this decision can be that in the future
someone might want to populate the VXLAN MDB in response to decapsulated
IGMP / MLD packets and not according to EVPN routes. Regardless of my
doubts regarding this possibility, it can be implemented using a new
VXLAN device knob that will also enable the "added_by_star_ex"
functionality.

Testing
=======

Tested using existing VXLAN and MDB selftests under "net/" and
"net/forwarding/". Added a dedicated selftest in the last patch.

Patchset overview
=================

Patches #1-#3 are small preparations in the bridge driver. I plan to
submit them separately together with an MDB dump test case.

Patches #4-#6 are additional preparations centered around the extraction
of the MDB netlink handlers from the bridge driver to the common
rtnetlink code. This allows reusing the existing MDB netlink messages
for the configuration of the VXLAN MDB.

Patches #7-#9 include more small preparations in the common rtnetlink
code and the VXLAN driver.

Patch #10 implements the MDB control path in the VXLAN driver, which
will allow user space to create, delete, replace and dump MDB entries.

Patches #11-#12 implement the MDB data path in the VXLAN driver,
allowing it to selectively forward IP multicast traffic according to the
matched MDB entry.

Patch #13 finally enables MDB support in the VXLAN driver.

iproute2 patches can be found here [6].

Note that in order to fully support the specifications in [1] and [2],
additional functionality is required from the data path. However, it can
be achieved using existing kernel interfaces which is why it is not
described here.

Changelog
=========

Since v1 [7]:

Patch #9: Use htons() in 'case' instead of ntohs() in 'switch'.

Since RFC [8]:

Patch #3: Use NL_ASSERT_DUMP_CTX_FITS().
Patch #3: memset the entire context when moving to the next device.
Patch #3: Reset sequence counters when moving to the next device.
Patch #3: Use NL_SET_ERR_MSG_ATTR() in rtnl_validate_mdb_entry().
Patch #7: Remove restrictions regarding mixing of multicast and unicast
remote destination IPs in an MDB entry. While such configuration does
not make sense to me, it is no forbidden by the VXLAN FDB code and does
not crash the kernel.
Patch #7: Fix check regarding all-zeros MDB entry and source.
Patch #11: New patch.

[1] https://datatracker.ietf.org/doc/html/rfc9251
[2] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast
[3] https://datatracker.ietf.org/doc/html/rfc7432
[4] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2
[5] https://datatracker.ietf.org/doc/html/rfc9251#section-9.1
[6] https://github.com/idosch/iproute2/commits/submit/mdb_vxlan_rfc_v1
[7] https://lore.kernel.org/netdev/20230313145349.3557231-1-idosch@nvidia.com/
[8] https://lore.kernel.org/netdev/20230204170801.3897900-1-idosch@nvidia.com/
====================
Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
parents ec47dcb4 62199e3f
......@@ -4,4 +4,4 @@
obj-$(CONFIG_VXLAN) += vxlan.o
vxlan-objs := vxlan_core.o vxlan_multicast.o vxlan_vnifilter.o
vxlan-objs := vxlan_core.o vxlan_multicast.o vxlan_vnifilter.o vxlan_mdb.o
......@@ -71,53 +71,6 @@ static inline bool vxlan_collect_metadata(struct vxlan_sock *vs)
ip_tunnel_collect_metadata();
}
#if IS_ENABLED(CONFIG_IPV6)
static int vxlan_nla_get_addr(union vxlan_addr *ip, struct nlattr *nla)
{
if (nla_len(nla) >= sizeof(struct in6_addr)) {
ip->sin6.sin6_addr = nla_get_in6_addr(nla);
ip->sa.sa_family = AF_INET6;
return 0;
} else if (nla_len(nla) >= sizeof(__be32)) {
ip->sin.sin_addr.s_addr = nla_get_in_addr(nla);
ip->sa.sa_family = AF_INET;
return 0;
} else {
return -EAFNOSUPPORT;
}
}
static int vxlan_nla_put_addr(struct sk_buff *skb, int attr,
const union vxlan_addr *ip)
{
if (ip->sa.sa_family == AF_INET6)
return nla_put_in6_addr(skb, attr, &ip->sin6.sin6_addr);
else
return nla_put_in_addr(skb, attr, ip->sin.sin_addr.s_addr);
}
#else /* !CONFIG_IPV6 */
static int vxlan_nla_get_addr(union vxlan_addr *ip, struct nlattr *nla)
{
if (nla_len(nla) >= sizeof(struct in6_addr)) {
return -EAFNOSUPPORT;
} else if (nla_len(nla) >= sizeof(__be32)) {
ip->sin.sin_addr.s_addr = nla_get_in_addr(nla);
ip->sa.sa_family = AF_INET;
return 0;
} else {
return -EAFNOSUPPORT;
}
}
static int vxlan_nla_put_addr(struct sk_buff *skb, int attr,
const union vxlan_addr *ip)
{
return nla_put_in_addr(skb, attr, ip->sin.sin_addr.s_addr);
}
#endif
/* Find VXLAN socket based on network namespace, address family, UDP port,
* enabled unshareable flags and socket device binding (see l3mdev with
* non-default VRF).
......@@ -2442,9 +2395,8 @@ static int encap_bypass_if_local(struct sk_buff *skb, struct net_device *dev,
return 0;
}
static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
__be32 default_vni, struct vxlan_rdst *rdst,
bool did_rsc)
void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
__be32 default_vni, struct vxlan_rdst *rdst, bool did_rsc)
{
struct dst_cache *dst_cache;
struct ip_tunnel_info *info;
......@@ -2791,6 +2743,21 @@ static netdev_tx_t vxlan_xmit(struct sk_buff *skb, struct net_device *dev)
#endif
}
if (vxlan->cfg.flags & VXLAN_F_MDB) {
struct vxlan_mdb_entry *mdb_entry;
rcu_read_lock();
mdb_entry = vxlan_mdb_entry_skb_get(vxlan, skb, vni);
if (mdb_entry) {
netdev_tx_t ret;
ret = vxlan_mdb_xmit(vxlan, mdb_entry, skb);
rcu_read_unlock();
return ret;
}
rcu_read_unlock();
}
eth = eth_hdr(skb);
f = vxlan_find_mac(vxlan, eth->h_dest, vni);
did_rsc = false;
......@@ -2926,8 +2893,14 @@ static int vxlan_init(struct net_device *dev)
if (err)
goto err_free_percpu;
err = vxlan_mdb_init(vxlan);
if (err)
goto err_gro_cells_destroy;
return 0;
err_gro_cells_destroy:
gro_cells_destroy(&vxlan->gro_cells);
err_free_percpu:
free_percpu(dev->tstats);
err_vnigroup_uninit:
......@@ -2952,6 +2925,8 @@ static void vxlan_uninit(struct net_device *dev)
{
struct vxlan_dev *vxlan = netdev_priv(dev);
vxlan_mdb_fini(vxlan);
if (vxlan->cfg.flags & VXLAN_F_VNIFILTER)
vxlan_vnigroup_uninit(vxlan);
......@@ -3108,6 +3083,9 @@ static const struct net_device_ops vxlan_netdev_ether_ops = {
.ndo_fdb_del = vxlan_fdb_delete,
.ndo_fdb_dump = vxlan_fdb_dump,
.ndo_fdb_get = vxlan_fdb_get,
.ndo_mdb_add = vxlan_mdb_add,
.ndo_mdb_del = vxlan_mdb_del,
.ndo_mdb_dump = vxlan_mdb_dump,
.ndo_fill_metadata_dst = vxlan_fill_metadata_dst,
};
......
This diff is collapsed.
......@@ -85,6 +85,39 @@ bool vxlan_addr_equal(const union vxlan_addr *a, const union vxlan_addr *b)
return a->sin.sin_addr.s_addr == b->sin.sin_addr.s_addr;
}
static inline int vxlan_nla_get_addr(union vxlan_addr *ip,
const struct nlattr *nla)
{
if (nla_len(nla) >= sizeof(struct in6_addr)) {
ip->sin6.sin6_addr = nla_get_in6_addr(nla);
ip->sa.sa_family = AF_INET6;
return 0;
} else if (nla_len(nla) >= sizeof(__be32)) {
ip->sin.sin_addr.s_addr = nla_get_in_addr(nla);
ip->sa.sa_family = AF_INET;
return 0;
} else {
return -EAFNOSUPPORT;
}
}
static inline int vxlan_nla_put_addr(struct sk_buff *skb, int attr,
const union vxlan_addr *ip)
{
if (ip->sa.sa_family == AF_INET6)
return nla_put_in6_addr(skb, attr, &ip->sin6.sin6_addr);
else
return nla_put_in_addr(skb, attr, ip->sin.sin_addr.s_addr);
}
static inline bool vxlan_addr_is_multicast(const union vxlan_addr *ip)
{
if (ip->sa.sa_family == AF_INET6)
return ipv6_addr_is_multicast(&ip->sin6.sin6_addr);
else
return ipv4_is_multicast(ip->sin.sin_addr.s_addr);
}
#else /* !CONFIG_IPV6 */
static inline
......@@ -93,8 +126,41 @@ bool vxlan_addr_equal(const union vxlan_addr *a, const union vxlan_addr *b)
return a->sin.sin_addr.s_addr == b->sin.sin_addr.s_addr;
}
static inline int vxlan_nla_get_addr(union vxlan_addr *ip,
const struct nlattr *nla)
{
if (nla_len(nla) >= sizeof(struct in6_addr)) {
return -EAFNOSUPPORT;
} else if (nla_len(nla) >= sizeof(__be32)) {
ip->sin.sin_addr.s_addr = nla_get_in_addr(nla);
ip->sa.sa_family = AF_INET;
return 0;
} else {
return -EAFNOSUPPORT;
}
}
static inline int vxlan_nla_put_addr(struct sk_buff *skb, int attr,
const union vxlan_addr *ip)
{
return nla_put_in_addr(skb, attr, ip->sin.sin_addr.s_addr);
}
static inline bool vxlan_addr_is_multicast(const union vxlan_addr *ip)
{
return ipv4_is_multicast(ip->sin.sin_addr.s_addr);
}
#endif
static inline size_t vxlan_addr_size(const union vxlan_addr *ip)
{
if (ip->sa.sa_family == AF_INET6)
return sizeof(struct in6_addr);
else
return sizeof(__be32);
}
static inline struct vxlan_vni_node *
vxlan_vnifilter_lookup(struct vxlan_dev *vxlan, __be32 vni)
{
......@@ -127,6 +193,8 @@ int vxlan_fdb_update(struct vxlan_dev *vxlan,
__be16 port, __be32 src_vni, __be32 vni,
__u32 ifindex, __u16 ndm_flags, u32 nhid,
bool swdev_notify, struct netlink_ext_ack *extack);
void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
__be32 default_vni, struct vxlan_rdst *rdst, bool did_rsc);
int vxlan_vni_in_use(struct net *src_net, struct vxlan_dev *vxlan,
struct vxlan_config *conf, __be32 vni);
......@@ -159,4 +227,20 @@ int vxlan_igmp_join(struct vxlan_dev *vxlan, union vxlan_addr *rip,
int rifindex);
int vxlan_igmp_leave(struct vxlan_dev *vxlan, union vxlan_addr *rip,
int rifindex);
/* vxlan_mdb.c */
int vxlan_mdb_dump(struct net_device *dev, struct sk_buff *skb,
struct netlink_callback *cb);
int vxlan_mdb_add(struct net_device *dev, struct nlattr *tb[], u16 nlmsg_flags,
struct netlink_ext_ack *extack);
int vxlan_mdb_del(struct net_device *dev, struct nlattr *tb[],
struct netlink_ext_ack *extack);
struct vxlan_mdb_entry *vxlan_mdb_entry_skb_get(struct vxlan_dev *vxlan,
struct sk_buff *skb,
__be32 src_vni);
netdev_tx_t vxlan_mdb_xmit(struct vxlan_dev *vxlan,
const struct vxlan_mdb_entry *mdb_entry,
struct sk_buff *skb);
int vxlan_mdb_init(struct vxlan_dev *vxlan);
void vxlan_mdb_fini(struct vxlan_dev *vxlan);
#endif
......@@ -1307,6 +1307,17 @@ struct netdev_net_notifier {
* Used to add FDB entries to dump requests. Implementers should add
* entries to skb and update idx with the number of entries.
*
* int (*ndo_mdb_add)(struct net_device *dev, struct nlattr *tb[],
* u16 nlmsg_flags, struct netlink_ext_ack *extack);
* Adds an MDB entry to dev.
* int (*ndo_mdb_del)(struct net_device *dev, struct nlattr *tb[],
* struct netlink_ext_ack *extack);
* Deletes the MDB entry from dev.
* int (*ndo_mdb_dump)(struct net_device *dev, struct sk_buff *skb,
* struct netlink_callback *cb);
* Dumps MDB entries from dev. The first argument (marker) in the netlink
* callback is used by core rtnetlink code.
*
* int (*ndo_bridge_setlink)(struct net_device *dev, struct nlmsghdr *nlh,
* u16 flags, struct netlink_ext_ack *extack)
* int (*ndo_bridge_getlink)(struct sk_buff *skb, u32 pid, u32 seq,
......@@ -1569,6 +1580,16 @@ struct net_device_ops {
const unsigned char *addr,
u16 vid, u32 portid, u32 seq,
struct netlink_ext_ack *extack);
int (*ndo_mdb_add)(struct net_device *dev,
struct nlattr *tb[],
u16 nlmsg_flags,
struct netlink_ext_ack *extack);
int (*ndo_mdb_del)(struct net_device *dev,
struct nlattr *tb[],
struct netlink_ext_ack *extack);
int (*ndo_mdb_dump)(struct net_device *dev,
struct sk_buff *skb,
struct netlink_callback *cb);
int (*ndo_bridge_setlink)(struct net_device *dev,
struct nlmsghdr *nlh,
u16 flags,
......
......@@ -3,6 +3,7 @@
#define __NET_VXLAN_H 1
#include <linux/if_vlan.h>
#include <linux/rhashtable-types.h>
#include <net/udp_tunnel.h>
#include <net/dst_metadata.h>
#include <net/rtnetlink.h>
......@@ -302,6 +303,10 @@ struct vxlan_dev {
struct vxlan_vni_group __rcu *vnigrp;
struct hlist_head fdb_head[FDB_HASH_SIZE];
struct rhashtable mdb_tbl;
struct hlist_head mdb_list;
unsigned int mdb_seq;
};
#define VXLAN_F_LEARN 0x01
......@@ -322,6 +327,7 @@ struct vxlan_dev {
#define VXLAN_F_IPV6_LINKLOCAL 0x8000
#define VXLAN_F_TTL_INHERIT 0x10000
#define VXLAN_F_VNIFILTER 0x20000
#define VXLAN_F_MDB 0x40000
/* Flags that are used in the receive path. These flags must match in
* order for a socket to be shareable
......
......@@ -633,6 +633,11 @@ enum {
MDBA_MDB_EATTR_GROUP_MODE,
MDBA_MDB_EATTR_SOURCE,
MDBA_MDB_EATTR_RTPROT,
MDBA_MDB_EATTR_DST,
MDBA_MDB_EATTR_DST_PORT,
MDBA_MDB_EATTR_VNI,
MDBA_MDB_EATTR_IFINDEX,
MDBA_MDB_EATTR_SRC_VNI,
__MDBA_MDB_EATTR_MAX
};
#define MDBA_MDB_EATTR_MAX (__MDBA_MDB_EATTR_MAX - 1)
......@@ -728,6 +733,11 @@ enum {
MDBE_ATTR_SRC_LIST,
MDBE_ATTR_GROUP_MODE,
MDBE_ATTR_RTPROT,
MDBE_ATTR_DST,
MDBE_ATTR_DST_PORT,
MDBE_ATTR_VNI,
MDBE_ATTR_IFINDEX,
MDBE_ATTR_SRC_VNI,
__MDBE_ATTR_MAX,
};
#define MDBE_ATTR_MAX (__MDBE_ATTR_MAX - 1)
......
......@@ -468,6 +468,9 @@ static const struct net_device_ops br_netdev_ops = {
.ndo_fdb_del_bulk = br_fdb_delete_bulk,
.ndo_fdb_dump = br_fdb_dump,
.ndo_fdb_get = br_fdb_get,
.ndo_mdb_add = br_mdb_add,
.ndo_mdb_del = br_mdb_del,
.ndo_mdb_dump = br_mdb_dump,
.ndo_bridge_getlink = br_getlink,
.ndo_bridge_setlink = br_setlink,
.ndo_bridge_dellink = br_dellink,
......
......@@ -380,82 +380,37 @@ static int br_mdb_fill_info(struct sk_buff *skb, struct netlink_callback *cb,
return err;
}
static int br_mdb_valid_dump_req(const struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
int br_mdb_dump(struct net_device *dev, struct sk_buff *skb,
struct netlink_callback *cb)
{
struct net_bridge *br = netdev_priv(dev);
struct br_port_msg *bpm;
struct nlmsghdr *nlh;
int err;
if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*bpm))) {
NL_SET_ERR_MSG_MOD(extack, "Invalid header for mdb dump request");
return -EINVAL;
}
nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, RTM_GETMDB, sizeof(*bpm),
NLM_F_MULTI);
if (!nlh)
return -EMSGSIZE;
bpm = nlmsg_data(nlh);
if (bpm->ifindex) {
NL_SET_ERR_MSG_MOD(extack, "Filtering by device index is not supported for mdb dump request");
return -EINVAL;
}
if (nlmsg_attrlen(nlh, sizeof(*bpm))) {
NL_SET_ERR_MSG(extack, "Invalid data after header in mdb dump request");
return -EINVAL;
}
return 0;
}
static int br_mdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
{
struct net_device *dev;
struct net *net = sock_net(skb->sk);
struct nlmsghdr *nlh = NULL;
int idx = 0, s_idx;
if (cb->strict_check) {
int err = br_mdb_valid_dump_req(cb->nlh, cb->extack);
if (err < 0)
return err;
}
s_idx = cb->args[0];
memset(bpm, 0, sizeof(*bpm));
bpm->ifindex = dev->ifindex;
rcu_read_lock();
for_each_netdev_rcu(net, dev) {
if (netif_is_bridge_master(dev)) {
struct net_bridge *br = netdev_priv(dev);
struct br_port_msg *bpm;
if (idx < s_idx)
goto skip;
nlh = nlmsg_put(skb, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, RTM_GETMDB,
sizeof(*bpm), NLM_F_MULTI);
if (nlh == NULL)
break;
bpm = nlmsg_data(nlh);
memset(bpm, 0, sizeof(*bpm));
bpm->ifindex = dev->ifindex;
if (br_mdb_fill_info(skb, cb, dev) < 0)
goto out;
if (br_rports_fill_info(skb, &br->multicast_ctx) < 0)
goto out;
cb->args[1] = 0;
nlmsg_end(skb, nlh);
skip:
idx++;
}
}
err = br_mdb_fill_info(skb, cb, dev);
if (err)
goto out;
err = br_rports_fill_info(skb, &br->multicast_ctx);
if (err)
goto out;
out:
if (nlh)
nlmsg_end(skb, nlh);
rcu_read_unlock();
cb->args[0] = idx;
return skb->len;
nlmsg_end(skb, nlh);
return err;
}
static int nlmsg_populate_mdb_fill(struct sk_buff *skb,
......@@ -683,60 +638,6 @@ static const struct nla_policy br_mdbe_attrs_pol[MDBE_ATTR_MAX + 1] = {
[MDBE_ATTR_RTPROT] = NLA_POLICY_MIN(NLA_U8, RTPROT_STATIC),
};
static int validate_mdb_entry(const struct nlattr *attr,
struct netlink_ext_ack *extack)
{
struct br_mdb_entry *entry = nla_data(attr);
if (nla_len(attr) != sizeof(struct br_mdb_entry)) {
NL_SET_ERR_MSG_MOD(extack, "Invalid MDBA_SET_ENTRY attribute length");
return -EINVAL;
}
if (entry->ifindex == 0) {
NL_SET_ERR_MSG_MOD(extack, "Zero entry ifindex is not allowed");
return -EINVAL;
}
if (entry->addr.proto == htons(ETH_P_IP)) {
if (!ipv4_is_multicast(entry->addr.u.ip4)) {
NL_SET_ERR_MSG_MOD(extack, "IPv4 entry group address is not multicast");
return -EINVAL;
}
if (ipv4_is_local_multicast(entry->addr.u.ip4)) {
NL_SET_ERR_MSG_MOD(extack, "IPv4 entry group address is local multicast");
return -EINVAL;
}
#if IS_ENABLED(CONFIG_IPV6)
} else if (entry->addr.proto == htons(ETH_P_IPV6)) {
if (ipv6_addr_is_ll_all_nodes(&entry->addr.u.ip6)) {
NL_SET_ERR_MSG_MOD(extack, "IPv6 entry group address is link-local all nodes");
return -EINVAL;
}
#endif
} else if (entry->addr.proto == 0) {
/* L2 mdb */
if (!is_multicast_ether_addr(entry->addr.u.mac_addr)) {
NL_SET_ERR_MSG_MOD(extack, "L2 entry group is not multicast");
return -EINVAL;
}
} else {
NL_SET_ERR_MSG_MOD(extack, "Unknown entry protocol");
return -EINVAL;
}
if (entry->state != MDB_PERMANENT && entry->state != MDB_TEMPORARY) {
NL_SET_ERR_MSG_MOD(extack, "Unknown entry state");
return -EINVAL;
}
if (entry->vid >= VLAN_VID_MASK) {
NL_SET_ERR_MSG_MOD(extack, "Invalid entry VLAN id");
return -EINVAL;
}
return 0;
}
static bool is_valid_mdb_source(struct nlattr *attr, __be16 proto,
struct netlink_ext_ack *extack)
{
......@@ -1299,49 +1200,16 @@ static int br_mdb_config_attrs_init(struct nlattr *set_attrs,
return 0;
}
static const struct nla_policy mdba_policy[MDBA_SET_ENTRY_MAX + 1] = {
[MDBA_SET_ENTRY_UNSPEC] = { .strict_start_type = MDBA_SET_ENTRY_ATTRS + 1 },
[MDBA_SET_ENTRY] = NLA_POLICY_VALIDATE_FN(NLA_BINARY,
validate_mdb_entry,
sizeof(struct br_mdb_entry)),
[MDBA_SET_ENTRY_ATTRS] = { .type = NLA_NESTED },
};
static int br_mdb_config_init(struct net *net, const struct nlmsghdr *nlh,
struct br_mdb_config *cfg,
static int br_mdb_config_init(struct br_mdb_config *cfg, struct net_device *dev,
struct nlattr *tb[], u16 nlmsg_flags,
struct netlink_ext_ack *extack)
{
struct nlattr *tb[MDBA_SET_ENTRY_MAX + 1];
struct br_port_msg *bpm;
struct net_device *dev;
int err;
err = nlmsg_parse_deprecated(nlh, sizeof(*bpm), tb,
MDBA_SET_ENTRY_MAX, mdba_policy, extack);
if (err)
return err;
struct net *net = dev_net(dev);
memset(cfg, 0, sizeof(*cfg));
cfg->filter_mode = MCAST_EXCLUDE;
cfg->rt_protocol = RTPROT_STATIC;
cfg->nlflags = nlh->nlmsg_flags;
bpm = nlmsg_data(nlh);
if (!bpm->ifindex) {
NL_SET_ERR_MSG_MOD(extack, "Invalid bridge ifindex");
return -EINVAL;
}
dev = __dev_get_by_index(net, bpm->ifindex);
if (!dev) {
NL_SET_ERR_MSG_MOD(extack, "Bridge device doesn't exist");
return -ENODEV;
}
if (!netif_is_bridge_master(dev)) {
NL_SET_ERR_MSG_MOD(extack, "Device is not a bridge");
return -EOPNOTSUPP;
}
cfg->nlflags = nlmsg_flags;
cfg->br = netdev_priv(dev);
......@@ -1355,11 +1223,6 @@ static int br_mdb_config_init(struct net *net, const struct nlmsghdr *nlh,
return -EINVAL;
}
if (NL_REQ_ATTR_CHECK(extack, NULL, tb, MDBA_SET_ENTRY)) {
NL_SET_ERR_MSG_MOD(extack, "Missing MDBA_SET_ENTRY attribute");
return -EINVAL;
}
cfg->entry = nla_data(tb[MDBA_SET_ENTRY]);
if (cfg->entry->ifindex != cfg->br->dev->ifindex) {
......@@ -1383,6 +1246,12 @@ static int br_mdb_config_init(struct net *net, const struct nlmsghdr *nlh,
}
}
if (cfg->entry->addr.proto == htons(ETH_P_IP) &&
ipv4_is_zeronet(cfg->entry->addr.u.ip4)) {
NL_SET_ERR_MSG_MOD(extack, "IPv4 entry group address 0.0.0.0 is not allowed");
return -EINVAL;
}
if (tb[MDBA_SET_ENTRY_ATTRS])
return br_mdb_config_attrs_init(tb[MDBA_SET_ENTRY_ATTRS], cfg,
extack);
......@@ -1397,16 +1266,15 @@ static void br_mdb_config_fini(struct br_mdb_config *cfg)
br_mdb_config_src_list_fini(cfg);
}
static int br_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
int br_mdb_add(struct net_device *dev, struct nlattr *tb[], u16 nlmsg_flags,
struct netlink_ext_ack *extack)
{
struct net *net = sock_net(skb->sk);
struct net_bridge_vlan_group *vg;
struct net_bridge_vlan *v;
struct br_mdb_config cfg;
int err;
err = br_mdb_config_init(net, nlh, &cfg, extack);
err = br_mdb_config_init(&cfg, dev, tb, nlmsg_flags, extack);
if (err)
return err;
......@@ -1500,16 +1368,15 @@ static int __br_mdb_del(const struct br_mdb_config *cfg)
return err;
}
static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
int br_mdb_del(struct net_device *dev, struct nlattr *tb[],
struct netlink_ext_ack *extack)
{
struct net *net = sock_net(skb->sk);
struct net_bridge_vlan_group *vg;
struct net_bridge_vlan *v;
struct br_mdb_config cfg;
int err;
err = br_mdb_config_init(net, nlh, &cfg, extack);
err = br_mdb_config_init(&cfg, dev, tb, 0, extack);
if (err)
return err;
......@@ -1534,17 +1401,3 @@ static int br_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
br_mdb_config_fini(&cfg);
return err;
}
void br_mdb_init(void)
{
rtnl_register_module(THIS_MODULE, PF_BRIDGE, RTM_GETMDB, NULL, br_mdb_dump, 0);
rtnl_register_module(THIS_MODULE, PF_BRIDGE, RTM_NEWMDB, br_mdb_add, NULL, 0);
rtnl_register_module(THIS_MODULE, PF_BRIDGE, RTM_DELMDB, br_mdb_del, NULL, 0);
}
void br_mdb_uninit(void)
{
rtnl_unregister(PF_BRIDGE, RTM_GETMDB);
rtnl_unregister(PF_BRIDGE, RTM_NEWMDB);
rtnl_unregister(PF_BRIDGE, RTM_DELMDB);
}
......@@ -1886,7 +1886,6 @@ int __init br_netlink_init(void)
{
int err;
br_mdb_init();
br_vlan_rtnl_init();
rtnl_af_register(&br_af_ops);
......@@ -1898,13 +1897,11 @@ int __init br_netlink_init(void)
out_af:
rtnl_af_unregister(&br_af_ops);
br_mdb_uninit();
return err;
}
void br_netlink_fini(void)
{
br_mdb_uninit();
br_vlan_rtnl_uninit();
rtnl_af_unregister(&br_af_ops);
rtnl_link_unregister(&br_link_ops);
......
......@@ -981,8 +981,12 @@ void br_multicast_get_stats(const struct net_bridge *br,
u32 br_multicast_ngroups_get(const struct net_bridge_mcast_port *pmctx);
void br_multicast_ngroups_set_max(struct net_bridge_mcast_port *pmctx, u32 max);
u32 br_multicast_ngroups_get_max(const struct net_bridge_mcast_port *pmctx);
void br_mdb_init(void);
void br_mdb_uninit(void);
int br_mdb_add(struct net_device *dev, struct nlattr *tb[], u16 nlmsg_flags,
struct netlink_ext_ack *extack);
int br_mdb_del(struct net_device *dev, struct nlattr *tb[],
struct netlink_ext_ack *extack);
int br_mdb_dump(struct net_device *dev, struct sk_buff *skb,
struct netlink_callback *cb);
void br_multicast_host_join(const struct net_bridge_mcast *brmctx,
struct net_bridge_mdb_entry *mp, bool notify);
void br_multicast_host_leave(struct net_bridge_mdb_entry *mp, bool notify);
......@@ -1374,12 +1378,22 @@ static inline bool br_multicast_querier_exists(struct net_bridge_mcast *brmctx,
return false;
}
static inline void br_mdb_init(void)
static inline int br_mdb_add(struct net_device *dev, struct nlattr *tb[],
u16 nlmsg_flags, struct netlink_ext_ack *extack)
{
return -EOPNOTSUPP;
}
static inline int br_mdb_del(struct net_device *dev, struct nlattr *tb[],
struct netlink_ext_ack *extack)
{
return -EOPNOTSUPP;
}
static inline void br_mdb_uninit(void)
static inline int br_mdb_dump(struct net_device *dev, struct sk_buff *skb,
struct netlink_callback *cb)
{
return 0;
}
static inline int br_mdb_hash_init(struct net_bridge *br)
......
......@@ -54,6 +54,9 @@
#include <net/rtnetlink.h>
#include <net/net_namespace.h>
#include <net/devlink.h>
#if IS_ENABLED(CONFIG_IPV6)
#include <net/addrconf.h>
#endif
#include "dev.h"
......@@ -6063,6 +6066,217 @@ static int rtnl_stats_set(struct sk_buff *skb, struct nlmsghdr *nlh,
return 0;
}
static int rtnl_mdb_valid_dump_req(const struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
{
struct br_port_msg *bpm;
if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*bpm))) {
NL_SET_ERR_MSG(extack, "Invalid header for mdb dump request");
return -EINVAL;
}
bpm = nlmsg_data(nlh);
if (bpm->ifindex) {
NL_SET_ERR_MSG(extack, "Filtering by device index is not supported for mdb dump request");
return -EINVAL;
}
if (nlmsg_attrlen(nlh, sizeof(*bpm))) {
NL_SET_ERR_MSG(extack, "Invalid data after header in mdb dump request");
return -EINVAL;
}
return 0;
}
struct rtnl_mdb_dump_ctx {
long idx;
};
static int rtnl_mdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
{
struct rtnl_mdb_dump_ctx *ctx = (void *)cb->ctx;
struct net *net = sock_net(skb->sk);
struct net_device *dev;
int idx, s_idx;
int err;
NL_ASSERT_DUMP_CTX_FITS(struct rtnl_mdb_dump_ctx);
if (cb->strict_check) {
err = rtnl_mdb_valid_dump_req(cb->nlh, cb->extack);
if (err)
return err;
}
s_idx = ctx->idx;
idx = 0;
for_each_netdev(net, dev) {
if (idx < s_idx)
goto skip;
if (!dev->netdev_ops->ndo_mdb_dump)
goto skip;
err = dev->netdev_ops->ndo_mdb_dump(dev, skb, cb);
if (err == -EMSGSIZE)
goto out;
/* Moving on to next device, reset markers and sequence
* counters since they are all maintained per-device.
*/
memset(cb->ctx, 0, sizeof(cb->ctx));
cb->prev_seq = 0;
cb->seq = 0;
skip:
idx++;
}
out:
ctx->idx = idx;
return skb->len;
}
static int rtnl_validate_mdb_entry(const struct nlattr *attr,
struct netlink_ext_ack *extack)
{
struct br_mdb_entry *entry = nla_data(attr);
if (nla_len(attr) != sizeof(struct br_mdb_entry)) {
NL_SET_ERR_MSG_ATTR(extack, attr, "Invalid attribute length");
return -EINVAL;
}
if (entry->ifindex == 0) {
NL_SET_ERR_MSG(extack, "Zero entry ifindex is not allowed");
return -EINVAL;
}
if (entry->addr.proto == htons(ETH_P_IP)) {
if (!ipv4_is_multicast(entry->addr.u.ip4) &&
!ipv4_is_zeronet(entry->addr.u.ip4)) {
NL_SET_ERR_MSG(extack, "IPv4 entry group address is not multicast or 0.0.0.0");
return -EINVAL;
}
if (ipv4_is_local_multicast(entry->addr.u.ip4)) {
NL_SET_ERR_MSG(extack, "IPv4 entry group address is local multicast");
return -EINVAL;
}
#if IS_ENABLED(CONFIG_IPV6)
} else if (entry->addr.proto == htons(ETH_P_IPV6)) {
if (ipv6_addr_is_ll_all_nodes(&entry->addr.u.ip6)) {
NL_SET_ERR_MSG(extack, "IPv6 entry group address is link-local all nodes");
return -EINVAL;
}
#endif
} else if (entry->addr.proto == 0) {
/* L2 mdb */
if (!is_multicast_ether_addr(entry->addr.u.mac_addr)) {
NL_SET_ERR_MSG(extack, "L2 entry group is not multicast");
return -EINVAL;
}
} else {
NL_SET_ERR_MSG(extack, "Unknown entry protocol");
return -EINVAL;
}
if (entry->state != MDB_PERMANENT && entry->state != MDB_TEMPORARY) {
NL_SET_ERR_MSG(extack, "Unknown entry state");
return -EINVAL;
}
if (entry->vid >= VLAN_VID_MASK) {
NL_SET_ERR_MSG(extack, "Invalid entry VLAN id");
return -EINVAL;
}
return 0;
}
static const struct nla_policy mdba_policy[MDBA_SET_ENTRY_MAX + 1] = {
[MDBA_SET_ENTRY_UNSPEC] = { .strict_start_type = MDBA_SET_ENTRY_ATTRS + 1 },
[MDBA_SET_ENTRY] = NLA_POLICY_VALIDATE_FN(NLA_BINARY,
rtnl_validate_mdb_entry,
sizeof(struct br_mdb_entry)),
[MDBA_SET_ENTRY_ATTRS] = { .type = NLA_NESTED },
};
static int rtnl_mdb_add(struct sk_buff *skb, struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
{
struct nlattr *tb[MDBA_SET_ENTRY_MAX + 1];
struct net *net = sock_net(skb->sk);
struct br_port_msg *bpm;
struct net_device *dev;
int err;
err = nlmsg_parse_deprecated(nlh, sizeof(*bpm), tb,
MDBA_SET_ENTRY_MAX, mdba_policy, extack);
if (err)
return err;
bpm = nlmsg_data(nlh);
if (!bpm->ifindex) {
NL_SET_ERR_MSG(extack, "Invalid ifindex");
return -EINVAL;
}
dev = __dev_get_by_index(net, bpm->ifindex);
if (!dev) {
NL_SET_ERR_MSG(extack, "Device doesn't exist");
return -ENODEV;
}
if (NL_REQ_ATTR_CHECK(extack, NULL, tb, MDBA_SET_ENTRY)) {
NL_SET_ERR_MSG(extack, "Missing MDBA_SET_ENTRY attribute");
return -EINVAL;
}
if (!dev->netdev_ops->ndo_mdb_add) {
NL_SET_ERR_MSG(extack, "Device does not support MDB operations");
return -EOPNOTSUPP;
}
return dev->netdev_ops->ndo_mdb_add(dev, tb, nlh->nlmsg_flags, extack);
}
static int rtnl_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
{
struct nlattr *tb[MDBA_SET_ENTRY_MAX + 1];
struct net *net = sock_net(skb->sk);
struct br_port_msg *bpm;
struct net_device *dev;
int err;
err = nlmsg_parse_deprecated(nlh, sizeof(*bpm), tb,
MDBA_SET_ENTRY_MAX, mdba_policy, extack);
if (err)
return err;
bpm = nlmsg_data(nlh);
if (!bpm->ifindex) {
NL_SET_ERR_MSG(extack, "Invalid ifindex");
return -EINVAL;
}
dev = __dev_get_by_index(net, bpm->ifindex);
if (!dev) {
NL_SET_ERR_MSG(extack, "Device doesn't exist");
return -ENODEV;
}
if (NL_REQ_ATTR_CHECK(extack, NULL, tb, MDBA_SET_ENTRY)) {
NL_SET_ERR_MSG(extack, "Missing MDBA_SET_ENTRY attribute");
return -EINVAL;
}
if (!dev->netdev_ops->ndo_mdb_del) {
NL_SET_ERR_MSG(extack, "Device does not support MDB operations");
return -EOPNOTSUPP;
}
return dev->netdev_ops->ndo_mdb_del(dev, tb, extack);
}
/* Process one rtnetlink message. */
static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
......@@ -6297,4 +6511,8 @@ void __init rtnetlink_init(void)
rtnl_register(PF_UNSPEC, RTM_GETSTATS, rtnl_stats_get, rtnl_stats_dump,
0);
rtnl_register(PF_UNSPEC, RTM_SETSTATS, rtnl_stats_set, NULL, 0);
rtnl_register(PF_BRIDGE, RTM_GETMDB, NULL, rtnl_mdb_dump, 0);
rtnl_register(PF_BRIDGE, RTM_NEWMDB, rtnl_mdb_add, NULL, 0);
rtnl_register(PF_BRIDGE, RTM_DELMDB, rtnl_mdb_del, NULL, 0);
}
......@@ -81,6 +81,7 @@ TEST_GEN_FILES += sctp_hello
TEST_GEN_FILES += csum
TEST_GEN_FILES += nat6to4.o
TEST_GEN_FILES += ip_local_port_range
TEST_PROGS += test_vxlan_mdb.sh
TEST_FILES := settings
......
......@@ -48,3 +48,4 @@ CONFIG_BAREUDP=m
CONFIG_IPV6_IOAM6_LWTUNNEL=y
CONFIG_CRYPTO_SM4_GENERIC=y
CONFIG_AMT=m
CONFIG_VXLAN=m
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment