Commits · b2309a71c1f2fc841feb184195b2e46b2e139bf4 · Kirill Smelkov / linux

09 Feb, 2022 3 commits

net: add dev->dev_registered_tracker · b2309a71

Eric Dumazet authored Feb 07, 2022

Convert one dev_hold()/dev_put() pair in register_netdevice()
and unregister_netdevice_many() to dev_hold_track()
and dev_put_track().

This would allow to detect a rogue dev_put() a bit earlier.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220207184107.1401096-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b2309a71

et131x: support arbitrary MAX_SKB_FRAGS · 99f5a5f2

Eric Dumazet authored Feb 07, 2022

This NIC does not support TSO, it is very unlikely it would
have to send packets with many fragments.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220208004855.1887345-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

99f5a5f2

Merge branch 'iwl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux · a501ab3f

Jakub Kicinski authored Feb 08, 2022

Nguyen, Anthony L says:

====================
iwl-next Intel Wired LAN Driver Updates 2022-02-07

Dave adds support for ice driver to provide DSCP QoS mappings to irdma
driver.

[1] https://lore.kernel.org/netdev/20220202191921.1638-1-shiraz.saleem@intel.com/

* 'iwl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux:
  ice: add support for DSCP QoS for IDC
====================

Link: https://lore.kernel.org/r/20220207235921.1303522-1-anthony.l.nguyen@intel.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

a501ab3f

08 Feb, 2022 7 commits

Merge branch 'inet-separate-dscp-from-ecn-bits-using-new-dscp_t-type' · c3e676b9

Jakub Kicinski authored Feb 07, 2022

Guillaume Nault says:

====================
inet: Separate DSCP from ECN bits using new dscp_t type

The networking stack currently doesn't clearly distinguish between DSCP
and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or
structure fields), and each part of the stack handles them in their own
way, using different macros. This has created several bugs in the past
and some uncommon code paths are still unfixed.

Such bugs generally manifest by selecting invalid routes because of ECN
bits interfering with FIB routes and rules lookups (more details in the
LPC 2021 talk[1] and in the RFC of this series[2]).

This patch series aims at preventing the introduction of such bugs (and
detecting existing ones), by introducing a dscp_t type, representing
"sanitised" DSCP values (that is, with no ECN information), as opposed
to plain u8 values that contain both DSCP and ECN information. dscp_t
makes it clear for the reader what we're working on, and Sparse can
flag invalid interactions between dscp_t and plain u8.

This series converts only a few variables and structures:

  * Patch 1 converts the tclass field of struct fib6_rule. It
    effectively forbids the use of ECN bits in the tos/dsfield option
    of ip -6 rule. Rules now match packets solely based on their DSCP
    bits, so ECN doesn't influence the result any more. This contrasts
    with the previous behaviour where all 8 bits of the Traffic Class
    field were used. It is believed that this change is acceptable as
    matching ECN bits wasn't usable for IPv4, so only IPv6-only
    deployments could be depending on it. Also the previous behaviour
    made DSCP-based ip6-rules fail for packets with both a DSCP and an
    ECN mark, which is another reason why any such deploy is unlikely.

  * Patch 2 converts the tos field of struct fib4_rule. This one too
    effectively forbids defining ECN bits, this time in ip -4 rule.
    Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
    rejected. But even when accepted, the rule would never match, as
    the packets would have their ECN bits cleared before doing the
    rule lookup.

  * Patch 3 converts the fc_tos field of struct fib_config. This is
    equivalent to patch 2, but for IPv4 routes. Routes using a
    tos/dsfield option with any ECN bit set is now rejected. Before
    this patch, they were accepted but, as with ip4 rules, these routes
    couldn't match any packet, since their ECN bits are cleared before
    the lookup.

  * Patch 4 converts the fa_tos field of struct fib_alias. This one is
    pure internal u8 to dscp_t conversion. While patches 1-3 had user
    facing consequences, this patch shouldn't have any side effect and
    is there to give an overview of what future conversion patches will
    look like. Conversions are quite mechanical, but imply some code
    churn, which is the price for the extra clarity a possibility of
    type checking.

To summarise, all the behaviour changes required for the dscp_t type
approach to work should be contained in patches 1-3. These changes are
edge cases of ip-route and ip-rule that don't currently work properly.
So they should be safe. Also, a kernel selftest is added for each of
them.

Finally, this work also paves the way for allowing the usage of the 3
high order DSCP bits in IPv4 (a few call paths already handle them, but
in general the stack clears them before IPv4 rule and route lookups).

References:
  [1] LPC 2021 talk:
        - https://linuxplumbersconf.org/event/11/contributions/943/
        - Direct link to slide deck:
            https://linuxplumbersconf.org/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
  [2] RFC version of this series:
      - https://lore.kernel.org/netdev/cover.1638814614.git.gnault@redhat.com/
====================

Link: https://lore.kernel.org/r/cover.1643981839.git.gnault@redhat.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

c3e676b9

ipv4: Use dscp_t in struct fib_alias · 32ccf110

Guillaume Nault authored Feb 04, 2022

Use the new dscp_t type to replace the fa_tos field of fib_alias. This
ensures ECN bits are ignored and makes the field compatible with the
fc_dscp field of struct fib_config.

Converting old *tos variables and fields to dscp_t allows sparse to
flag incorrect uses of DSCP and ECN bits. This patch is entirely about
type annotation and shouldn't change any existing behaviour.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

32ccf110

ipv4: Reject routes specifying ECN bits in rtm_tos · f55fbb6a

Guillaume Nault authored Feb 04, 2022

Use the new dscp_t type to replace the fc_tos field of fib_config, to
ensure IPv4 routes aren't influenced by ECN bits when configured with
non-zero rtm_tos.

Before this patch, IPv4 routes specifying an rtm_tos with some of the
ECN bits set were accepted. However they wouldn't work (never match) as
IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB
lookup (although a few buggy code paths don't).

After this patch, IPv4 routes specifying an rtm_tos with any ECN bit
set is rejected.

Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted,
but treated as if it were 0.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

f55fbb6a

ipv4: Stop taking ECN bits into account in fib4-rules · 563f8e97

Guillaume Nault authored Feb 04, 2022

Use the new dscp_t type to replace the tos field of struct fib4_rule,
so that fib4-rules consistently ignore ECN bits.

Before this patch, fib4-rules did accept rules with the high order ECN
bit set (but not the low order one). Also, it relied on its callers
masking the ECN bits of ->flowi4_tos to prevent those from influencing
the result. This was brittle and a few call paths still do the lookup
without masking the ECN bits first.

After this patch fib4-rules only compare the DSCP bits. ECN can't
influence the result anymore, even if the caller didn't mask these
bits. Also, fib4-rules now must have both ECN bits cleared or they will
be rejected.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

563f8e97

ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules · a410a0cf

Guillaume Nault authored Feb 04, 2022

Define a dscp_t type and its appropriate helpers that ensure ECN bits
are not taken into account when handling DSCP.

Use this new type to replace the tclass field of struct fib6_rule, so
that fib6-rules don't get influenced by ECN bits anymore.

Before this patch, fib6-rules didn't make any distinction between the
DSCP and ECN bits. Therefore, rules specifying a DSCP (tos or dsfield
options in iproute2) stopped working as soon a packets had at least one
of its ECN bits set (as a work around one could create four rules for
each DSCP value to match, one for each possible ECN value).

After this patch fib6-rules only compare the DSCP bits. ECN doesn't
influence the result anymore. Also, fib6-rules now must have the ECN
bits cleared or they will be rejected.
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

a410a0cf

net: stmmac: optimize locking around PTP clock reads · 642436a1

Yannick Vignon authored Feb 04, 2022

Reading the PTP clock is a simple operation requiring only 3 register
reads. Under a PREEMPT_RT kernel, protecting those reads by a spin_lock is
counter-productive: if the 2nd task preempting the 1st has a higher prio
but needs to read time as well, it will require 2 context switches, which
will pretty much always be more costly than just disabling preemption for
the duration of the reads. Moreover, with the code logic recently added
to get_systime(), disabling preemption is not even required anymore:
reads and writes just need to be protected from each other, to prevent a
clock read while the clock is being updated.

Improve the above situation by replacing the PTP spinlock by a rwlock, and
using read_lock for PTP clock reads so simultaneous reads do not block
each other.
Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com>
Link: https://lore.kernel.org/r/20220204135545.2770625-1-yannick.vignon@oss.nxp.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

642436a1

net: typhoon: include <net/vxlan.h> · d1d5bd64

Eric Dumazet authored Feb 07, 2022

We need this to get vxlan_features_check() definition.

Fixes: d2692eee ("net: typhoon: implement ndo_features_check method")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220208003502.1799728-1-eric.dumazet@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d1d5bd64

07 Feb, 2022 20 commits

net: dsa: mv88e6xxx: Unlock on error in mv88e6xxx_port_bridge_join() · ff624338

Dan Carpenter authored Feb 07, 2022

Call mv88e6xxx_reg_unlock(chip) before returning on this error path.

Fixes: 7af4a361 ("net: dsa: mv88e6xxx: Improve isolation of standalone ports")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ff624338

net: dsa: mv88e6xxx: Fix off by in one in mv88e6185_phylink_get_caps() · dde41a69

Dan Carpenter authored Feb 07, 2022

The <= ARRAY_SIZE() needs to be < ARRAY_SIZE() to prevent an out of
bounds error.

Fixes: d4ebf12b ("net: dsa: mv88e6xxx: populate supported_interfaces and mac_capabilities")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

dde41a69

net: hns3: add support for TX push mode · 87a9b2fd

Yufeng Mo authored Feb 07, 2022

For the device that supports the TX push capability, the BD can
be directly copied to the device memory. However, due to hardware
restrictions, the push mode can be used only when there are no
more than two BDs, otherwise, the doorbell mode based on device
memory is used.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com>
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

87a9b2fd

net: asix: add proper error handling of usb read errors · 920a9fa2

Pavel Skripkin authored Feb 06, 2022

Syzbot once again hit uninit value in asix driver. The problem still the
same -- asix_read_cmd() reads less bytes, than was requested by caller.

Since all read requests are performed via asix_read_cmd() let's catch
usb related error there and add __must_check notation to be sure all
callers actually check return value.

So, this patch adds sanity check inside asix_read_cmd(), that simply
checks if bytes read are not less, than was requested and adds missing
error handling of asix_read_cmd() all across the driver code.

Fixes: d9fe64e5 ("net: asix: Add in_pm parameter")
Reported-and-tested-by: syzbot+6ca9f7867b77c2d316ac@syzkaller.appspotmail.com
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

920a9fa2

r8169: factor out redundant RTL8168d PHY config functionality to rtl8168d_1_common() · b845bac8

Heiner Kallweit authored Feb 06, 2022

rtl8168d_2_hw_phy_config() shares quite some functionality with
rtl8168d_1_hw_phy_config(), so let's factor out the common part to a
new function rtl8168d_1_common(). In addition improve the code a little.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b845bac8

ip6mr: fix use-after-free in ip6mr_sk_done() · 7d9b1b57

Eric Dumazet authored Feb 06, 2022

Apparently addrconf_exit_net() is called before igmp6_net_exit()
and ndisc_net_exit() at netns dismantle time:

 net_namespace: call ip6table_mangle_net_exit()
 net_namespace: call ip6_tables_net_exit()
 net_namespace: call ipv6_sysctl_net_exit()
 net_namespace: call ioam6_net_exit()
 net_namespace: call seg6_net_exit()
 net_namespace: call ping_v6_proc_exit_net()
 net_namespace: call tcpv6_net_exit()
 ip6mr_sk_done sk=ffffa354c78a74c0
 net_namespace: call ipv6_frags_exit_net()
 net_namespace: call addrconf_exit_net()
 net_namespace: call ip6addrlbl_net_exit()
 net_namespace: call ip6_flowlabel_net_exit()
 net_namespace: call ip6_route_net_exit_late()
 net_namespace: call fib6_rules_net_exit()
 net_namespace: call xfrm6_net_exit()
 net_namespace: call fib6_net_exit()
 net_namespace: call ip6_route_net_exit()
 net_namespace: call ipv6_inetpeer_exit()
 net_namespace: call if6_proc_net_exit()
 net_namespace: call ipv6_proc_exit_net()
 net_namespace: call udplite6_proc_exit_net()
 net_namespace: call raw6_exit_net()
 net_namespace: call igmp6_net_exit()
 ip6mr_sk_done sk=ffffa35472b2a180
 ip6mr_sk_done sk=ffffa354c78a7980
 net_namespace: call ndisc_net_exit()
 ip6mr_sk_done sk=ffffa35472b2ab00
 net_namespace: call ip6mr_net_exit()
 net_namespace: call inet6_net_exit()

This was fine because ip6mr_sk_done() would not reach the point decreasing
net->ipv6.devconf_all->mc_forwarding until my patch in ip6mr_sk_done().

To fix this without changing struct pernet_operations ordering,
we can clear net->ipv6.devconf_dflt and net->ipv6.devconf_all
when they are freed from addrconf_exit_net()

BUG: KASAN: use-after-free in instrument_atomic_read include/linux/instrumented.h:71 [inline]
BUG: KASAN: use-after-free in atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
BUG: KASAN: use-after-free in ip6mr_sk_done+0x11b/0x410 net/ipv6/ip6mr.c:1578
Read of size 4 at addr ffff88801ff08688 by task kworker/u4:4/963

CPU: 0 PID: 963 Comm: kworker/u4:4 Not tainted 5.17.0-rc2-syzkaller-00650-g5a8fb33e #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: netns cleanup_net
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
 print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
 __kasan_report mm/kasan/report.c:442 [inline]
 kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
 check_region_inline mm/kasan/generic.c:183 [inline]
 kasan_check_range+0x13d/0x180 mm/kasan/generic.c:189
 instrument_atomic_read include/linux/instrumented.h:71 [inline]
 atomic_read include/linux/atomic/atomic-instrumented.h:27 [inline]
 ip6mr_sk_done+0x11b/0x410 net/ipv6/ip6mr.c:1578
 rawv6_close+0x58/0x80 net/ipv6/raw.c:1201
 inet_release+0x12e/0x280 net/ipv4/af_inet.c:428
 inet6_release+0x4c/0x70 net/ipv6/af_inet6.c:478
 __sock_release net/socket.c:650 [inline]
 sock_release+0x87/0x1b0 net/socket.c:678
 inet_ctl_sock_destroy include/net/inet_common.h:65 [inline]
 igmp6_net_exit+0x6b/0x170 net/ipv6/mcast.c:3173
 ops_exit_list+0xb0/0x170 net/core/net_namespace.c:168
 cleanup_net+0x4ea/0xb00 net/core/net_namespace.c:600
 process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
 worker_thread+0x657/0x1110 kernel/workqueue.c:2454
 kthread+0x2e9/0x3a0 kernel/kthread.c:377
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
 </TASK>

Fixes: f2f2325e ("ip6mr: ip6mr_sk_done() can exit early in common cases")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d9b1b57

caif: cleanup double word in comment · 0812beb7

Tom Rix authored Feb 06, 2022

Replace the second 'so' with 'free'.
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0812beb7

Merge branch 'mlxsw-dip-sip-mangling' · f485da3c

David S. Miller authored Feb 07, 2022

Ido Schimmel says:

====================
mlxsw: Add SIP and DIP mangling support

Danielle says:

On Spectrum-2 onwards, it is possible to overwrite SIP and DIP address
of an IPv4 or IPv6 packet in the ACL engine. That corresponds to pedit
munges of, respectively, ip src and ip dst fields, and likewise for ip6.
Offload these munges on the systems where they are supported.

Patchset overview:
Patch #1: introduces SIP_DIP_ACTION and its fields.
Patch #2-#3: adds the new pedit fields, and dispatches on them on
	     Spectrum-2 and above.
Patch #4 adds a selftest.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

f485da3c

selftests: forwarding: Add a test for pedit munge SIP and DIP · 92ad3828

Danielle Ratson authored Feb 06, 2022

Add a test that checks that pedit adjusts source and destination
addresses of IPv4 and IPv6 packets.

Output example:

$ ./pedit_ip.sh
TEST: ping                                                          [ OK ]
TEST: ping6                                                         [ OK ]
TEST: dev swp2 ingress pedit ip src set 198.51.100.1                [ OK ]
TEST: dev swp3 egress pedit ip src set 198.51.100.1                 [ OK ]
TEST: dev swp2 ingress pedit ip dst set 198.51.100.1                [ OK ]
TEST: dev swp3 egress pedit ip dst set 198.51.100.1                 [ OK ]
TEST: dev swp2 ingress pedit ip6 src set 2001:db8:2::1              [ OK ]
TEST: dev swp3 egress pedit ip6 src set 2001:db8:2::1               [ OK ]
TEST: dev swp2 ingress pedit ip6 dst set 2001:db8:2::1              [ OK ]
TEST: dev swp3 egress pedit ip6 dst set 2001:db8:2::1               [ OK ]
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

92ad3828

mlxsw: Support FLOW_ACTION_MANGLE for SIP and DIP IPv6 addresses · 463e1ab8

Danielle Ratson authored Feb 06, 2022

Spectrum-2 supports an ACL action SIP_DIP, which allows IPv4 and IPv6
source and destination addresses change. Offload suitable mangles to
the IPv6 address change action.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

463e1ab8

mlxsw: Support FLOW_ACTION_MANGLE for SIP and DIP IPv4 addresses · d7809b62

Danielle Ratson authored Feb 06, 2022

Spectrum-2 supports an ACL action SIP_DIP, which allows IPv4 and IPv6
source and destination addresses change. Offload suitable mangles to
the IPv4 address change action.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d7809b62

mlxsw: core_acl_flex_actions: Add SIP_DIP_ACTION · e3541022

Danielle Ratson authored Feb 06, 2022

Add fields related to SIP_DIP_ACTION, which is used for changing of SIP
and DIP addresses.
Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e3541022

Merge branch 'ipv6-kfree_skb_reason' · 47c1a9a4

David S. Miller authored Feb 07, 2022

Menglong Dong says:

====================
net: use kfree_skb_reason() for ip/udp packet receive

In this series patches, kfree_skb() is replaced with kfree_skb_reason()
during ipv4 and udp4 packet receiving path, and following drop reasons
are introduced:

SKB_DROP_REASON_SOCKET_FILTER
SKB_DROP_REASON_NETFILTER_DROP
SKB_DROP_REASON_OTHERHOST
SKB_DROP_REASON_IP_CSUM
SKB_DROP_REASON_IP_INHDR
SKB_DROP_REASON_IP_RPFILTER
SKB_DROP_REASON_UNICAST_IN_L2_MULTICAST
SKB_DROP_REASON_XFRM_POLICY
SKB_DROP_REASON_IP_NOPROTO
SKB_DROP_REASON_SOCKET_RCVBUFF
SKB_DROP_REASON_PROTO_MEM

TCP is more complex, so I left it in the next series.

I just figure out how __print_symbolic() works. It doesn't base on the
array index, but searching for symbols by loop. So I'm a little afraid
it's performance.

Changes since v3:
- fix some small problems in the third patch (net: ipv4: use
  kfree_skb_reason() in ip_rcv_core()), as David Ahern said

Changes since v2:
- use SKB_DROP_REASON_PKT_TOO_SMALL for a path in ip_rcv_core()

Changes since v1:
- add document for all drop reasons, as David advised
- remove unreleated cleanup
- remove EARLY_DEMUX and IP_ROUTE_INPUT drop reason
- replace {UDP, TCP}_FILTER with SOCKET_FILTER
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

47c1a9a4

net: udp: use kfree_skb_reason() in __udp_queue_rcv_skb() · 08d4c037

Menglong Dong authored Feb 05, 2022

Replace kfree_skb() with kfree_skb_reason() in __udp_queue_rcv_skb().
Following new drop reasons are introduced:

SKB_DROP_REASON_SOCKET_RCVBUFF
SKB_DROP_REASON_PROTO_MEM
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

08d4c037

net: udp: use kfree_skb_reason() in udp_queue_rcv_one_skb() · 1379a92d

Menglong Dong authored Feb 05, 2022

Replace kfree_skb() with kfree_skb_reason() in udp_queue_rcv_one_skb().
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

1379a92d

net: ipv4: use kfree_skb_reason() in ip_protocol_deliver_rcu() · 10580c47

Menglong Dong authored Feb 05, 2022

Replace kfree_skb() with kfree_skb_reason() in ip_protocol_deliver_rcu().
Following new drop reasons are introduced:

SKB_DROP_REASON_XFRM_POLICY
SKB_DROP_REASON_IP_NOPROTO
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

10580c47

net: ipv4: use kfree_skb_reason() in ip_rcv_finish_core() · c1f166d1

Menglong Dong authored Feb 05, 2022

Replace kfree_skb() with kfree_skb_reason() in ip_rcv_finish_core(),
following drop reasons are introduced:

SKB_DROP_REASON_IP_RPFILTER
SKB_DROP_REASON_UNICAST_IN_L2_MULTICAST
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

c1f166d1

net: ipv4: use kfree_skb_reason() in ip_rcv_core() · 33cba429

Menglong Dong authored Feb 05, 2022

Replace kfree_skb() with kfree_skb_reason() in ip_rcv_core(). Three new
drop reasons are introduced:

SKB_DROP_REASON_OTHERHOST
SKB_DROP_REASON_IP_CSUM
SKB_DROP_REASON_IP_INHDR
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

33cba429

net: netfilter: use kfree_drop_reason() for NF_DROP · 2df3041b

Menglong Dong authored Feb 05, 2022

Replace kfree_skb() with kfree_skb_reason() in nf_hook_slow() when
skb is dropped by reason of NF_DROP. Following new drop reasons
are introduced:

SKB_DROP_REASON_NETFILTER_DROP
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2df3041b

net: skb_drop_reason: add document for drop reasons · 88590b36

Menglong Dong authored Feb 05, 2022

Add document for following existing drop reasons:

SKB_DROP_REASON_NOT_SPECIFIED
SKB_DROP_REASON_NO_SOCKET
SKB_DROP_REASON_PKT_TOO_SMALL
SKB_DROP_REASON_TCP_CSUM
SKB_DROP_REASON_SOCKET_FILTER
SKB_DROP_REASON_UDP_CSUM
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

88590b36

06 Feb, 2022 3 commits

ref_tracker: remove filter_irq_stacks() call · c2d1e3df

Eric Dumazet authored Feb 05, 2022

After commit e9400660 ("lib/stackdepot: always do filter_irq_stacks()
in stack_depot_save()") it became unnecessary to filter the stack
before calling stack_depot_save().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c2d1e3df

net: initialize init_net earlier · 9c1be193

Eric Dumazet authored Feb 05, 2022

While testing a patch that will follow later
("net: add netns refcount tracker to struct nsproxy")
I found that devtmpfs_init() was called before init_net
was initialized.

This is a bug, because devtmpfs_setup() calls
ksys_unshare(CLONE_NEWNS);

This has the effect of increasing init_net refcount,
which will be later overwritten to 1, as part of setup_net(&init_net)

We had too many prior patches [1] trying to work around the root cause.

Really, make sure init_net is in BSS section, and that net_ns_init()
is called earlier at boot time.

Note that another patch ("vfs: add netns refcount tracker
to struct fs_context") also will need net_ns_init() being called
before vfs_caches_init()

As a bonus, this patch saves around 4KB in .data section.

[1]

f8c46cb3 ("netns: do not call pernet ops for not yet set up init_net namespace")
b5082df8 ("net: Initialise init_net.count to 1")
734b6541 ("net: Statically initialize init_net.dev_base_head")

v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9c1be193

net: hsr: use hlist_head instead of list_head for mac addresses · 4acc45db

Juhee Kang authored Feb 05, 2022

Currently, HSR manages mac addresses of known HSR nodes by using list_head.
It takes a lot of time when there are a lot of registered nodes due to
finding specific mac address nodes by using linear search. We can be
reducing the time by using hlist. Thus, this patch moves list_head to
hlist_head for mac addresses and this allows for further improvement of
network performance.

    Condition: registered 10,000 known HSR nodes
    Before:
    # iperf3 -c 192.168.10.1 -i 1 -t 10
    Connecting to host 192.168.10.1, port 5201
    [  5] local 192.168.10.2 port 59442 connected to 192.168.10.1 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.49   sec  3.75 MBytes  21.1 Mbits/sec    0    158 KBytes
    [  5]   1.49-2.05   sec  1.25 MBytes  18.7 Mbits/sec    0    166 KBytes
    [  5]   2.05-3.06   sec  2.44 MBytes  20.3 Mbits/sec   56   16.9 KBytes
    [  5]   3.06-4.08   sec  1.43 MBytes  11.7 Mbits/sec   11   38.0 KBytes
    [  5]   4.08-5.00   sec   951 KBytes  8.49 Mbits/sec    0   56.3 KBytes

    After:
    # iperf3 -c 192.168.10.1 -i 1 -t 10
    Connecting to host 192.168.10.1, port 5201
    [  5] local 192.168.10.2 port 36460 connected to 192.168.10.1 port 5201
    [ ID] Interval           Transfer     Bitrate         Retr  Cwnd
    [  5]   0.00-1.00   sec  7.39 MBytes  62.0 Mbits/sec    3    130 KBytes
    [  5]   1.00-2.00   sec  5.06 MBytes  42.4 Mbits/sec   16    113 KBytes
    [  5]   2.00-3.00   sec  8.58 MBytes  72.0 Mbits/sec   42   94.3 KBytes
    [  5]   3.00-4.00   sec  7.44 MBytes  62.4 Mbits/sec    2    131 KBytes
    [  5]   4.00-5.07   sec  8.13 MBytes  63.5 Mbits/sec   38   92.9 KBytes
Signed-off-by: Juhee Kang <claudiajkang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4acc45db

05 Feb, 2022 7 commits

skmsg: convert struct sk_msg_sg::copy to a bitmap · 5a8fb33e

Eric Dumazet authored Feb 04, 2022

We have plans for increasing MAX_SKB_FRAGS, but sk_msg_sg::copy
is currently an unsigned long, limiting MAX_SKB_FRAGS to 30 on 32bit arches.

Convert it to a bitmap, as Jakub suggested.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5a8fb33e

net: typhoon: implement ndo_features_check method · d2692eee

Eric Dumazet authored Feb 04, 2022

Instead of disabling TSO at compile time if MAX_SKB_FRAGS > 32,
implement ndo_features_check() method for this driver for
a more dynamic handling.

If skb has more than 32 frags and is a GSO packet, force
software segmentation.

Most locally generated packets will use a small number
of fragments anyway.

For forwarding workloads, we can limit gro_max_size at ingress,
we might also implement gro_max_segs if needed.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d2692eee

net: sundance: Replace one-element array with non-array object · 5f215513

Gustavo A. R. Silva authored Feb 04, 2022

It seems this one-element array is not actually being used as an
array of variable size, so we can just replace it with just a
non-array object of type struct desc_frag and refactor a bit the
rest of the code.

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

This issue was found with the help of Coccinelle and audited and fixed,
manually.

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/79Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

5f215513

bnx2x: Replace one-element array with flexible-array member · 76ad950c

Gustavo A. R. Silva authored Feb 04, 2022

There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use “flexible array members”[1] for these cases. The older
style of one-element or zero-length arrays should no longer be used[2].

This helps with the ongoing efforts to globally enable -Warray-bounds
and get us closer to being able to tighten the FORTIFY_SOURCE routines
on memcpy().

This issue was found with the help of Coccinelle and audited and fixed,
manually.

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/79Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

76ad950c

Merge branch 'net-mana-next' · b58ef6b7

David S. Miller authored Feb 05, 2022

Haiyang Zhang says:

====================
net: mana: Add handling of CQE_RX_TRUNCATED and a cleanup

Add handling of CQE_RX_TRUNCATED and a cleanup patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

b58ef6b7

net: mana: Remove unnecessary check of cqe_type in mana_process_rx_cqe() · 68f83135

Haiyang Zhang authored Feb 04, 2022

The switch statement already ensures cqe_type == CQE_RX_OKAY at that
point.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

68f83135

net: mana: Add handling of CQE_RX_TRUNCATED · e4b76219

Haiyang Zhang authored Feb 04, 2022

The proper way to drop this kind of CQE is advancing rxq tail
without indicating the packet to the upper network layer.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e4b76219