- 11 Oct, 2018 8 commits
-
-
David Ahern authored
ipv6_route_table_template is exported but there are no users outside of route.c. Make it static. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
The NLM_F_DUMP_PROPER_HDR netlink flag was replaced by a setsockopt. Update the comment in rtnl_stats_dump. Fixes: 841891ec ("rtnetlink: Update rtnl_stats_dump for strict data checking") Reported-by: Christian Brauner <christian@brauner.io> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Move setting of local variable ifm to after the message parsing in valid_fdb_dump_legacy. Avoid potential future use of unchecked variable. Fixes: 8dfbda19 ("rtnetlink: Move input checking for rtnl_fdb_dump to helper") Reported-by: Christian Brauner <christian@brauner.io> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Ido Schimmel says: ==================== mlxsw: selftests: Few small updates First patch fixes a typo in mlxsw. Second patch fixes a race in a recent test. Third patch makes a recent test executable. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Petr Machata authored
This is a self-standing test and as such should be itself executable. Fixes: b5638d46 ("selftests: mlxsw: Add a test for UC behavior under MC flood") Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Petr Machata authored
Immediately after mlxsw module is probed and lldpad started, added APP entries are briefly in "unknown" state before becoming "pending". That's the state that lldpad_app_wait_set() typically sees, and since there are no pending entries at that time, it bails out. However the entries have not been pushed to the kernel yet at that point, and thus the test case fails. Fix by waiting for both unknown and pending entries to disappear before proceeding. Fixes: d159261f ("selftests: mlxsw: Add test for trust-DSCP") Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Nir Dotan authored
Signed-off-by: Nir Dotan <nird@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Colin Ian King authored
There are several variables being initialized that are being set later and hence the initialization is redundant and can be removed. Remove then. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 10 Oct, 2018 17 commits
-
-
David S. Miller authored
Sunil Goutham says: ==================== octeontx2-af: Add RVU Admin Function driver Resource virtualization unit (RVU) on Marvell's OcteonTX2 SOC maps HW resources from the network, crypto and other functional blocks into PCI-compatible physical and virtual functions. Each functional block again has multiple local functions (LFs) for provisioning to PCI devices. RVU supports multiple PCIe SRIOV physical functions (PFs) and virtual functions (VFs). PF0 is called the administrative / admin function (AF) and has privileges to provision RVU functional block's LFs to each of the PF/VF. RVU managed networking functional blocks - Network pool allocator (NPA) - Network interface controller (NIX) - Network parser CAM (NPC) - Schedule/Synchronize/Order unit (SSO) RVU managed non-networking functional blocks - Crypto accelerator (CPT) - Scheduled timers unit (TIM) - Schedule/Synchronize/Order unit (SSO) Used for both networking and non networking usecases - Compression (upcoming in future variants of the silicons) Resource provisioning examples - A PF/VF with NIX-LF & NPA-LF resources works as a pure network device - A PF/VF with CPT-LF resource works as a pure cyrpto offload device. This admin function driver neither receives any data nor processes it i.e no I/O, a configuration only driver. PF/VFs communicates with AF via a shared memory region (mailbox). Upon receiving requests from PF/VF, AF does resource provisioning and other HW configuration. AF is always attached to host, but PF/VFs may be used by host kernel itself, or attached to VMs or to userspace applications like DPDK etc. So AF has to handle provisioning/configuration requests sent by any device from any domain. This patch series adds logic for the following - RVU AF driver with functional blocks provisioning support. - Mailbox infrastructure for communication between AF and PFs. - CGX (MAC controller) driver which communicates with firmware for managing physical ethernet interfaces. AF collects info from this driver and forwards the same to the PF/VFs uaing these interfaces. This is the first set of patches out of 80+ patches. Changes from v8: 1 Removed unnecessary typecasts in entire series - Suggested by David Miller 2 Added COMPILE_TEST to AF driver - Suggested by Arnd Bergmann 3 Changed udelay() to usleep_range() in rvu_poll_reg - Suggested by Arnd Bergmann 4 MSIX vector base IOMMU mapping is done using dma_map_resource() API instead of dma_map_single() as it accepts physical address. - Issue pointed by Arnd Bergmann Changes from v7: 1 Removed unnecessary typecasts in mbox infra code. - Suggested by David Miller 2 Fixed MAINTAINERS patch - Suggested by Joe Perches Changes from v6: Fixed ordering of local variables from longest to shortest line. - Suggested by David Miller Changes from v5: Modified bitfield based command structures to bitmasks for communication with firmware, to address endianness issues. - Suggested by Arnd Bergmann Changes from v4: 1 Removed module author/version/description from CGX driver as it's now merged with AF driver module. - Suggested by Arnd Bergmann 2 Added big-endian bitfields for CGX's kernel <=> firmware communication command structures. - Suggested by Arnd Bergmann Changes from v3: Moved driver from drivers/soc to drivers/net/ethernet - Suggested by Arnd Bergmann https://patchwork.kernel.org/cover/10587635/ Changes from v2: No changes, submitted again with netdev mailing list in loop. - Suggested by Arnd Bergmann and Andrew Lunn Changes from v1: 1 Merged RVU admin function and CGX drivers into a single module - Suggested by Arnd Bergmann 2 Pulled mbox communication APIs into a separate module to remove admin function driver dependency in a VM where AF is not attached. - Suggested by Arnd Bergmann ==================== Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sunil Goutham authored
Added maintainers entry for Marvell OcteonTX2 SOC's RVU admin function driver. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Linu Cherian authored
Added support in RVU AF driver to register for CGX LMAC link status change events from firmware and managing them. Processing part will be added in followup patches. - Introduced eventqueue for posting events from cgx lmac. Queueing mechanism will ensure that events can be posted and firmware can be acked immediately and hence event reception and processing are decoupled. - Events gets added to the queue by notification callback. Notification callback is expected to be atomic, since it is called from interrupt context. - Events are dequeued and processed in a worker thread. Signed-off-by: Linu Cherian <lcherian@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Linu Cherian authored
CGX LMAC initialization, link status polling etc is done by low level secure firmware. For link management this patch adds a interface or communication mechanism between firmware and this kernel CGX driver. - Firmware interface specification is defined in cgx_fw_if.h. - Support to send/receive commands/events to/form firmware. - events/commands implemented * link up * link down * reading firmware version Signed-off-by: Linu Cherian <lcherian@marvell.com> Signed-off-by: Nithya Mani <nmani@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Linu Cherian authored
Each of the enabled CGX LMAC is considered a physical interface and RVU PFs are mapped to these. VFs of these SRIOV PFs will be virtual interfaces and share CGX LMAC along with PF. This mapping info will be used later on for Rx/Tx pkt steering. Signed-off-by: Linu Cherian <lcherian@marvell.com> Signed-off-by: Geetha sowjanya <gakula@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sunil Goutham authored
This patch adds basic template for Marvell OcteonTX2's CGX ethernet interface driver. Just the probe. RVU AF driver will use APIs exported by this driver for various things like PF to physical interface mapping, loopback mode, interface stats etc. Hence marged both drivers into a single module. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Geetha sowjanya authored
HW interprets RVU_AF_MSIXTR_BASE address as an IOVA, hence create a IOMMU mapping for the physcial address configured by firmware and reconfig RVU_AF_MSIXTR_BASE with IOVA. Signed-off-by: Geetha sowjanya <gakula@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sunil Goutham authored
Firmware configures a certain number of MSIX vectors to each of enabled RVU PF/VF. When a block LF is attached to a PF/VF, number of MSIX vectors needed by that LF are set aside (out of PF/VF's total MSIX vectors) and LF's msix_offset is configured in HW. Also added support for a RVU PF/VF to retrieve that block LF's MSIX vector offset information from AF via mbox. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sunil Goutham authored
Added support for a RVU PF/VF to request AF via mailbox to attach or detach NPA/NIX/SSO/SSOW/TIM/CPT block LFs. Also supports partial detachment and modifying current LF attached count of a certian block type. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sunil Goutham authored
Scan all RVU blocks to find any 'LF to RVU PF/VF' mapping done by low level firmware. If found any, mark them as used in respective block's LF bitmap and also save mapped PF/VF's PF_FUNC info. This is done to avoid reattaching a block LF to a different RVU PF/VF. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Aleksey Makarov authored
With 10's of mailbox messages expected to be handled in future, checking for message id could become a lengthy switch case. Hence added a macro to auto generate the switch case for each msg id. Signed-off-by: Aleksey Makarov <amakarov@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sunil Goutham authored
This patch adds support for mailbox interrupt and message handling. Mapped mailbox region and registered a workqueue for message handling. Enabled mailbox IRQ of RVU PFs and registered a interrupt handler. When IRQ is triggered work is added to the mbox workqueue for msgs to get processed. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Aleksey Makarov authored
This patch adds mailbox support infrastructure APIs. Each RVU device has a dedicated 64KB mailbox region shared with it's peer for communication. RVU AF has a separate mailbox region shared with each of RVU PFs and a RVU PF has a separate region shared with each of it's VF. These set of APIs are used by this driver (RVU AF) and other RVU PF/VF drivers eg netdev, crypto e.t.c. Signed-off-by: Aleksey Makarov <amakarov@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: Lukasz Bartosik <lbartosik@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sunil Goutham authored
This patch gathers NPA/NIX/SSO/SSOW/TIM/CPT RVU blocks's HW info like number of LFs. Important register offsets saved for later use to avoid code duplication for each block. A bitmap is allocated for each of the blocks which later on will be used to allocate a LF for a RVU PF/VF. Also added RVU NIX/NPA block registers and few registers of other blocks. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sunil Goutham authored
Go through all BLKADDRs and check which ones are implemented on this silicon and do a HW reset of each implemented block. Also added all RVU AF and PF register offsets. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sunil Goutham authored
This patch adds basic template for Marvell OcteonTX2's resource virtualization unit (RVU) admin function (AF) driver. Just the driver registration and probe. Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sudarsana Reddy Kalluru authored
Currently driver registers to physical link notifications (of the device) from Management firmware (MFW). Driver doesn't get notified if there's a change in the virtual link e.g., link-flap on the peer PF interface. Virtual link indication from MFW reflects the per PF link status instead of the physical link. The patch adds driver support for, - Advertising the virtual link support to MFW. - Handling the virtual link notification from MFW. Please consider applying it to 'net-next'. Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Tomer Tayar <Tomer.Tayar@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 09 Oct, 2018 4 commits
-
-
Ganesh Goudar authored
Add thermal zone support to monitor ASIC's temperature. Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alaa Hleihel authored
When memory is limited (on kdump kernel), reduce size of rx and tx rings. Also reduce the number of rx rings. Signed-off-by: Alaa Hleihel <alaa@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller authored
Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-10-08 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow BPF programs to perform lookups for sockets in a network namespace. This would allow programs to determine early on in processing whether the stack is expecting to receive the packet, and perform some action (eg drop, forward somewhere) based on this information. 2) per-cpu cgroup local storage from Roman Gushchin. Per-cpu cgroup local storage is very similar to simple cgroup storage except all the data is per-cpu. The main goal of per-cpu variant is to implement super fast counters (e.g. packet counters), which don't require neither lookups, neither atomic operations in a fast path. The example of these hybrid counters is in selftests/bpf/netcnt_prog.c 3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet 4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski 5) rename of libbpf interfaces from Andrey Ignatov. libbpf is maturing as a library and should follow good practices in library design and implementation to play well with other libraries. This patch set brings consistent naming convention to global symbols. 6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov to let Apache2 projects use libbpf 7) various AF_XDP fixes from Björn and Magnus ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller authored
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree: 1) Support for matching on ipsec policy already set in the route, from Florian Westphal. 2) Split set destruction into deactivate and destroy phase to make it fit better into the transaction infrastructure, also from Florian. This includes a patch to warn on imbalance when setting the new activate and deactivate interfaces. 3) Release transaction list from the workqueue to remove expensive synchronize_rcu() from configuration plane path. This speeds up configuration plane quite a bit. From Florian Westphal. 4) Add new xfrm/ipsec extension, this new extension allows you to match for ipsec tunnel keys such as source and destination address, spi and reqid. From Máté Eckl and Florian Westphal. 5) Add secmark support, this includes connsecmark too, patches from Christian Gottsche. 6) Allow to specify remaining bytes in xt_quota, from Chenbo Feng. One follow up patch to calm a clang warning for this one, from Nathan Chancellor. 7) Flush conntrack entries based on layer 3 family, from Kristian Evensen. 8) New revision for cgroups2 to shrink the path field. 9) Get rid of obsolete need_conntrack(), as a result from recent demodularization works. 10) Use WARN_ON instead of BUG_ON, from Florian Westphal. 11) Unused exported symbol in nf_nat_ipv4_fn(), from Florian. 12) Remove superfluous check for timeout netlink parser and dump functions in layer 4 conntrack helpers. 13) Unnecessary redundant rcu read side locks in NAT redirect, from Taehee Yoo. 14) Pass nf_hook_state structure to error handlers, patch from Florian Westphal. 15) Remove ->new() interface from layer 4 protocol trackers. Place them in the ->packet() interface. From Florian. 16) Place conntrack ->error() handling in the ->packet() interface. Patches from Florian Westphal. 17) Remove unused parameter in the pernet initialization path, also from Florian. 18) Remove additional parameter to specify layer 3 protocol when looking up for protocol tracker. From Florian. 19) Shrink array of layer 4 protocol trackers, from Florian. 20) Check for linear skb only once from the ALG NAT mangling codebase, from Taehee Yoo. 21) Use rhashtable_walk_enter() instead of deprecated rhashtable_walk_init(), also from Taehee. 22) No need to flush all conntracks when only one single address is gone, from Tan Hu. 23) Remove redundant check for NAT flags in flowtable code, from Taehee Yoo. 24) Use rhashtable_lookup() instead of rhashtable_lookup_fast() from netfilter codebase, since rcu read lock side is already assumed in this path. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 08 Oct, 2018 11 commits
-
-
Arnd Bergmann authored
The newly added TCP and UDP handling fails to link when CONFIG_INET is disabled: net/core/filter.o: In function `sk_lookup': filter.c:(.text+0x7ff8): undefined reference to `tcp_hashinfo' filter.c:(.text+0x7ffc): undefined reference to `tcp_hashinfo' filter.c:(.text+0x8020): undefined reference to `__inet_lookup_established' filter.c:(.text+0x8058): undefined reference to `__inet_lookup_listener' filter.c:(.text+0x8068): undefined reference to `udp_table' filter.c:(.text+0x8070): undefined reference to `udp_table' filter.c:(.text+0x808c): undefined reference to `__udp4_lib_lookup' net/core/filter.o: In function `bpf_sk_release': filter.c:(.text+0x82e8): undefined reference to `sock_gen_put' Wrap the related sections of code in #ifdefs for the config option. Furthermore, sk_lookup() should always have been marked 'static', this also avoids a warning about a missing prototype when building with 'make W=1'. Fixes: 6acc9b43 ("bpf: Add helper to retrieve socket in BPF") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Joe Stringer <joe@wand.net.nz> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
-
Nathan Chancellor authored
Clang warns: net/netfilter/xt_quota.c:47:44: warning: 'aligned' attribute ignored when parsing type [-Wignored-attributes] BUILD_BUG_ON(sizeof(atomic64_t) != sizeof(__aligned_u64)); ^~~~~~~~~~~~~ Use 'sizeof(__u64)' instead, as the alignment doesn't affect the size of the type. Fixes: e9837e55 ("netfilter: xt_quota: fix the behavior of xt_quota module") Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
-
Ioana Ciocoi Radulescu authored
Until now, both Rx and Tx confirmation frames handled during NAPI poll were counted toward the NAPI budget. However, Tx confirmations are lighter to process than Rx frames, which can skew the amount of work actually done inside one NAPI cycle. Update the code to only count Rx frames toward the NAPI budget and set a separate threshold on how many Tx conf frames can be processed in one poll cycle. The NAPI poll routine stops when either the budget is consumed by Rx frames or when Tx confirmation frames reach this threshold. Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
YueHaibing authored
Fixes gcc '-Wunused-but-set-variable' warning: drivers/net/ethernet/mscc/ocelot_board.c: In function 'mscc_ocelot_probe': drivers/net/ethernet/mscc/ocelot_board.c:262:17: warning: variable 'phy_mode' set but not used [-Wunused-but-set-variable] enum phy_mode phy_mode; It never used since introduction in commit 71e32a20 ("net: mscc: ocelot: make use of SerDes PHYs for handling their configuration") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Sabrina Dubroca says: ==================== selftests: add more PMTU tests The current selftests for PMTU cover VTI tunnels, but there's nothing about the generation and handling of PMTU exceptions by intermediate routers. This series adds and improves existing helpers, then adds IPv4 and IPv6 selftests with a setup involving an intermediate router. Joint work with Stefano Brivio. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sabrina Dubroca authored
Commit d1f1b9cb ("selftests: net: Introduce first PMTU test") and follow-ups introduced some PMTU tests, but they all rely on tunneling, and, particularly, on VTI. These new tests use simple routing to exercise the generation and update of PMTU exceptions in IPv4 and IPv6. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Sabrina Dubroca authored
The mtu_parse helper introduced in commit f2c929fe ("selftests: pmtu: Factor out MTU parsing helper") can only handle "mtu 1234", but not "mtu lock 1234". Extend it, so that we can do IPv4 tests with PMTU smaller than net.ipv4.route.min_pmtu Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Stefano Brivio authored
Introduce and use a function that checks PMTU values against expected values and logs error messages, to remove some clutter. Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Gustavo A. R. Silva authored
Notice that in this particular case, I replaced the "--v-- fall through --v--" comment with a proper "fall through", which is what GCC is expecting to find. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
David Ahern says: ==================== rtnetlink: Add support for rigid checking of data in dump request There are many use cases where a user wants to influence what is returned in a dump for some rtnetlink command: one is wanting data for a different namespace than the one the request is received and another is limiting the amount of data returned in the dump to a specific set of interest to userspace, reducing the cpu overhead of both kernel and userspace. Unfortunately, the kernel has historically not been strict with checking for the proper header or checking the values passed in the header. This lenient implementation has allowed iproute2 and other packages to pass any struct or data in the dump request as long as the family is the first byte. For example, ifinfomsg struct is used by iproute2 for all generic dump requests - links, addresses, routes and rules when it is really only valid for link requests. There is 1 is example where the kernel deals with the wrong struct: link dumps after VF support was added. Older iproute2 was sending rtgenmsg as the header instead of ifinfomsg so a patch was added to try and detect old userspace vs new: e5eca6d4 ("rtnetlink: fix userspace API breakage for iproute2 < v3.9.0") The latest example is Christian's patch set wanting to return addresses for a target namespace. It guesses the header struct is an ifaddrmsg and if it guesses wrong a netlink warning is generated in the kernel log on every address dump which is unacceptable. Another example where the kernel is a bit lenient is route dumps: iproute2 can send either a request with either ifinfomsg or a rtmsg as the header struct, yet the kernel always treats the header as an rtmsg (see inet_dump_fib and rtm_flags check). The header inconsistency impacts the ability to add kernel side filters for route dumps - a necessary feature for scale setups with 100k+ routes. How to resolve the problem of not breaking old userspace yet be able to move forward with new features such as kernel side filtering which are crucial for efficient operation at high scale? This patch set addresses the problem by adding a new socket flag, NETLINK_DUMP_STRICT_CHK, that userspace can use with setsockopt to request strict checking of headers and attributes on dump requests and hence unlock the ability to use kernel side filters as they are added. Kernel side, the dump handlers are updated to verify the message contains at least the expected header struct: RTM_GETLINK: ifinfomsg RTM_GETADDR: ifaddrmsg RTM_GETMULTICAST: ifaddrmsg RTM_GETANYCAST: ifaddrmsg RTM_GETADDRLABEL: ifaddrlblmsg RTM_GETROUTE: rtmsg RTM_GETSTATS: if_stats_msg RTM_GETNEIGH: ndmsg RTM_GETNEIGHTBL: ndtmsg RTM_GETNSID: rtgenmsg RTM_GETRULE: fib_rule_hdr RTM_GETNETCONF: netconfmsg RTM_GETMDB: br_port_msg And then every field in the header struct should be 0 with the exception of the family. There are a few exceptions to this rule where the kernel already influences the data returned by values in the struct. Next the message should not contain attributes unless the kernel implements filtering for it. Any unexpected data causes the dump to fail with EINVAL. If the new flag is honored by the kernel and the dump contents adjusted by any data passed in the request, the dump handler can set the NLM_F_DUMP_FILTERED flag in the netlink message header. For old userspace on new kernel there is no impact as all checks are wrapped in a check on the new strict flag. For new userspace on old kernel, the data in the headers and any appended attributes are silently ignored though the setsockopt failing is the clue to userspace the feature is not supported. New userspace on new kernel gets the requested data dump. iproute2 patches can be found here: https://github.com/dsahern/iproute2 dump-enhancements Major changes since v1 - inner header is supposed to be 4-bytes aligned. So for dumps that should not have attributes appended changed the check to use: if (nlmsg_attrlen(nlh, sizeof(hdr))) Only impacts patches with headers that are not multiples of 4-bytes (rtgenmsg, netconfmsg), but applied the change to all patches not calling nlmsg_parse for consistency. - Added nlmsg_parse_strict and nla_parse_strict for tighter control on attribute parsing. There should be no unknown attribute types or extra bytes. - Moved validation to a helper in most cases Changes since rfc-v2 - dropped the NLM_F_DUMP_FILTERED flag from target nsid dumps per Jiri's objections - changed the opt-in uapi from a netlink message flag to a socket flag. setsockopt provides an api for userspace to definitively know if the kernel supports strict checking on dumps. - re-ordered patches to peel off the extack on dumps if needed to keep this set size within limits - misc cleanups in patches based on testing ==================== Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Update rtnl_fdb_dump for strict data checking. If the flag is set, the dump request is expected to have an ndmsg struct as the header potentially followed by one or more attributes. Any data passed in the header or as an attribute is taken as a request to influence the data returned. Only values supported by the dump handler are allowed to be non-0 or set in the request. At the moment only the NDA_IFINDEX and NDA_MASTER attributes are supported. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Christian Brauner <christian@brauner.io> Signed-off-by: David S. Miller <davem@davemloft.net>
-