Commits · 71fa6887eeca7b631528f9c7a39815498de8028c · Kirill Smelkov / linux

03 Nov, 2022 11 commits

net: mana: Assign interrupts to CPUs based on NUMA nodes · 71fa6887

Saurabh Sengar authored Oct 31, 2022

In large VMs with multiple NUMA nodes, network performance is usually
best if network interrupts are all assigned to the same virtual NUMA
node. This patch assigns online CPU according to a numa aware policy,
local cpus are returned first, followed by non-local ones, then it wraps
around.
Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Link: https://lore.kernel.org/r/1667282761-11547-1-git-send-email-ssengar@linux.microsoft.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

71fa6887

net: fec: add initial XDP support · 6d6b39f1

Shenwei Wang authored Oct 31, 2022

This patch adds the initial XDP support to Freescale driver. It supports
XDP_PASS, XDP_DROP and XDP_REDIRECT actions. Upcoming patches will add
support for XDP_TX and Zero Copy features.

As the patch is rather large, the part of codes to collect the
statistics is separated and will prepare a dedicated patch for that
part.

I just tested with the application of xdpsock.
  -- Native here means running command of "xdpsock -i eth0"
  -- SKB-Mode means running command of "xdpsock -S -i eth0"

The following are the testing result relating to XDP mode:

root@imx8qxpc0mek:~/bpf# ./xdpsock -i eth0
 sock0@eth0:0 rxdrop xdp-drv
                   pps            pkts           1.00
rx                 371347         2717794
tx                 0              0

root@imx8qxpc0mek:~/bpf# ./xdpsock -S -i eth0
 sock0@eth0:0 rxdrop xdp-skb
                   pps            pkts           1.00
rx                 202229         404528
tx                 0              0

root@imx8qxpc0mek:~/bpf# ./xdp2 eth0
proto 0:     496708 pkt/s
proto 0:     505469 pkt/s
proto 0:     505283 pkt/s
proto 0:     505443 pkt/s
proto 0:     505465 pkt/s

root@imx8qxpc0mek:~/bpf# ./xdp2 -S eth0
proto 0:          0 pkt/s
proto 17:     118778 pkt/s
proto 17:     118989 pkt/s
proto 0:          1 pkt/s
proto 17:     118987 pkt/s
proto 0:          0 pkt/s
proto 17:     118943 pkt/s
proto 17:     118976 pkt/s
proto 0:          1 pkt/s
proto 17:     119006 pkt/s
proto 0:          0 pkt/s
proto 17:     119071 pkt/s
proto 17:     119092 pkt/s
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/20221031185350.2045675-1-shenwei.wang@nxp.comSigned-off-by: Paolo Abeni <pabeni@redhat.com>

6d6b39f1

net: tun: bump the link speed from 10Mbps to 10Gbps · 598d2982

Ilya Maximets authored Oct 31, 2022

The 10Mbps link speed was set in 2004 when the ethtool interface was
initially added to the tun driver. It might have been a good
assumption 18 years ago, but CPUs and network stack came a long way
since then.

Other virtual ports typically report much higher speeds. For example,
veth reports 10Gbps since its introduction in 2007.

Some userspace applications rely on the current link speed in
certain situations. For example, Open vSwitch is using link speed
as an upper bound for QoS configuration if user didn't specify the
maximum rate. Advertised 10Mbps doesn't match reality in a modern
world, so users have to always manually override the value with
something more sensible to avoid configuration issues, e.g. limiting
the traffic too much. This also creates additional confusion among
users.

Bump the advertised speed to at least match the veth.

Alternative might be to explicitly report UNKNOWN and let the user
decide on a right value for them. And it is indeed "the right way"
of fixing the problem. However, that may cause issues with bonding
or with some userspace applications that may rely on speed value to
be reported (even though they should not). Just changing the speed
value should be a safer option.

Users can still override the speed with ethtool, if necessary.

RFC discussion is linked below.

Link: https://lore.kernel.org/lkml/20221021114921.3705550-1-i.maximets@ovn.org/
Link: https://mail.openvswitch.org/pipermail/ovs-discuss/2022-July/051958.htmlSigned-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Link: https://lore.kernel.org/r/20221031173953.614577-1-i.maximets@ovn.orgSigned-off-by: Paolo Abeni <pabeni@redhat.com>

598d2982

Merge branch 'rocker-two-small-changes' · d3a47063

Jakub Kicinski authored Nov 02, 2022

Ido Schimmel says:

====================
rocker: Two small changes

Patch #1 avoids allocating and scheduling a work item when it is not
doing any work.

Patch #2 aligns rocker with other switchdev drivers to explicitly mark
FDB entries as offloaded. Needed for upcoming MAB offload [1].

[1] https://lore.kernel.org/netdev/20221025100024.1287157-1-idosch@nvidia.com/
====================

Link: https://lore.kernel.org/r/20221101123936.1900453-1-idosch@nvidia.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

d3a47063

rocker: Explicitly mark learned FDB entries as offloaded · 386b4174

Ido Schimmel authored Nov 01, 2022

Currently, FDB entries that are notified to the bridge driver via
'SWITCHDEV_FDB_ADD_TO_BRIDGE' are always marked as offloaded by the
bridge. With MAB enabled, this will no longer be universally true.
Device drivers will report locked FDB entries to the bridge to let it
know that the corresponding hosts required authorization, but it does
not mean that these entries are necessarily programmed in the underlying
hardware.

We would like to solve it by having the bridge driver determine the
offload indication based of the 'offloaded' bit in the FDB notification
[1].

Prepare for that change by having rocker explicitly mark learned FDB
entries as offloaded. This is consistent with all the other switchdev
drivers.

[1] https://lore.kernel.org/netdev/20221025100024.1287157-4-idosch@nvidia.com/Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

386b4174

rocker: Avoid unnecessary scheduling of work item · 42e51de9

Ido Schimmel authored Nov 01, 2022

The work item function ofdpa_port_fdb_learn_work() does not do anything
when 'OFDPA_OP_FLAG_LEARNED' is not set in the work item's flags.

Therefore, do not allocate and do not schedule the work item when the
flag is not set.
Suggested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

42e51de9

sfc (gcc13): synchronize ef100_enqueue_skb()'s return type · 3319dbb3

Jiri Slaby (SUSE) authored Oct 31, 2022

ef100_enqueue_skb() generates a valid warning with gcc-13:
drivers/net/ethernet/sfc/ef100_tx.c:370:5: error: conflicting types for 'ef100_enqueue_skb' due to enum/integer mismatch; have 'int(struct efx_tx_queue *, struct sk_buff *)'
drivers/net/ethernet/sfc/ef100_tx.h:25:13: note: previous declaration of 'ef100_enqueue_skb' with type 'netdev_tx_t(struct efx_tx_queue *, struct sk_buff *)'

I.e. the type of the ef100_enqueue_skb()'s return value in the declaration is
int, while the definition spells enum netdev_tx_t. Synchronize them to the
latter.

Cc: Martin Liska <mliska@suse.cz>
Cc: Edward Cree <ecree.xilinx@gmail.com>
Cc: Martin Habets <habetsm.xilinx@gmail.com>
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20221031114440.10461-1-jirislaby@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

3319dbb3

bonding (gcc13): synchronize bond_{a,t}lb_xmit() types · 777fa87c

Jiri Slaby (SUSE) authored Oct 31, 2022

Both bond_alb_xmit() and bond_tlb_xmit() produce a valid warning with
gcc-13:
drivers/net/bonding/bond_alb.c:1409:13: error: conflicting types for 'bond_tlb_xmit' due to enum/integer mismatch; have 'netdev_tx_t(struct sk_buff *, struct net_device *)' ...
include/net/bond_alb.h:160:5: note: previous declaration of 'bond_tlb_xmit' with type 'int(struct sk_buff *, struct net_device *)'

drivers/net/bonding/bond_alb.c:1523:13: error: conflicting types for 'bond_alb_xmit' due to enum/integer mismatch; have 'netdev_tx_t(struct sk_buff *, struct net_device *)' ...
include/net/bond_alb.h:159:5: note: previous declaration of 'bond_alb_xmit' with type 'int(struct sk_buff *, struct net_device *)'

I.e. the return type of the declaration is int, while the definitions
spell netdev_tx_t. Synchronize both of them to the latter.

Cc: Martin Liska <mliska@suse.cz>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20221031114409.10417-1-jirislaby@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

777fa87c

qed (gcc13): use u16 for fid to be big enough · 7d841182

Jiri Slaby (SUSE) authored Oct 31, 2022

gcc 13 correctly reports overflow in qed_grc_dump_addr_range():
In file included from drivers/net/ethernet/qlogic/qed/qed.h:23,
                 from drivers/net/ethernet/qlogic/qed/qed_debug.c:10:
drivers/net/ethernet/qlogic/qed/qed_debug.c: In function 'qed_grc_dump_addr_range':
include/linux/qed/qed_if.h:1217:9: error: overflow in conversion from 'int' to 'u8' {aka 'unsigned char'} changes value from '(int)vf_id << 8 | 128' to '128' [-Werror=overflow]

We do:
  u8 fid;
  ...
  fid = vf_id << 8 | 128;

Since fid is 16bit (and the stored value above too), fid should be u16,
not u8. Fix that.

Cc: Martin Liska <mliska@suse.cz>
Cc: Ariel Elior <aelior@marvell.com>
Cc: Manish Chopra <manishc@marvell.com>
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Link: https://lore.kernel.org/r/20221031114354.10398-1-jirislaby@kernel.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

7d841182

net: broadcom: bcm4908_enet: report queued and transmitted bytes · 471ef777

Rafał Miłecki authored Oct 31, 2022

This allows BQL to operate avoiding buffer bloat and reducing latency.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
Link: https://lore.kernel.org/r/20221031104856.32388-1-zajec5@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

471ef777

octeontx2-af: Allow mkex profile without DMAC and add L2M/L2B header extraction support · 2cee6401

Suman Ghosh authored Oct 31, 2022

1. It is possible to have custom mkex profiles which do not extract
DMAC at all into the key. Hence allow mkex profiles which do not
have DMAC to be loaded into MCAM hardware. This patch also adds
debugging prints needed to identify profiles with wrong
configuration.

2. If a mkex profile set "l2l3mb" field for Rx interface,
then Rx multicast and broadcast entry should be configured.
Signed-off-by: Suman Ghosh <sumang@marvell.com>
Link: https://lore.kernel.org/r/20221031090856.1404303-1-sumang@marvell.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

2cee6401

02 Nov, 2022 17 commits

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next · b54a0d40

Jakub Kicinski authored Nov 02, 2022

Daniel Borkmann says:

====================
bpf-next 2022-11-02

We've added 70 non-merge commits during the last 14 day(s) which contain
a total of 96 files changed, 3203 insertions(+), 640 deletions(-).

The main changes are:

1) Make cgroup local storage available to non-cgroup attached BPF programs
   such as tc BPF ones, from Yonghong Song.

2) Avoid unnecessary deadlock detection and failures wrt BPF task storage
   helpers, from Martin KaFai Lau.

3) Add LLVM disassembler as default library for dumping JITed code
   in bpftool, from Quentin Monnet.

4) Various kprobe_multi_link fixes related to kernel modules,
   from Jiri Olsa.

5) Optimize x86-64 JIT with emitting BMI2-based shift instructions,
   from Jie Meng.

6) Improve BPF verifier's memory type compatibility for map key/value
   arguments, from Dave Marchevsky.

7) Only create mmap-able data section maps in libbpf when data is exposed
   via skeletons, from Andrii Nakryiko.

8) Add an autoattach option for bpftool to load all object assets,
   from Wang Yufen.

9) Various memory handling fixes for libbpf and BPF selftests,
   from Xu Kuohai.

10) Initial support for BPF selftest's vmtest.sh on arm64,
    from Manu Bretelle.

11) Improve libbpf's BTF handling to dedup identical structs,
    from Alan Maguire.

12) Add BPF CI and denylist documentation for BPF selftests,
    from Daniel Müller.

13) Check BPF cpumap max_entries before doing allocation work,
    from Florian Lehner.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (70 commits)
  samples/bpf: Fix typo in README
  bpf: Remove the obsolte u64_stats_fetch_*_irq() users.
  bpf: check max_entries before allocating memory
  bpf: Fix a typo in comment for DFS algorithm
  bpftool: Fix spelling mistake "disasembler" -> "disassembler"
  selftests/bpf: Fix bpftool synctypes checking failure
  selftests/bpf: Panic on hard/soft lockup
  docs/bpf: Add documentation for new cgroup local storage
  selftests/bpf: Add test cgrp_local_storage to DENYLIST.s390x
  selftests/bpf: Add selftests for new cgroup local storage
  selftests/bpf: Fix test test_libbpf_str/bpf_map_type_str
  bpftool: Support new cgroup local storage
  libbpf: Support new cgroup local storage
  bpf: Implement cgroup storage available to non-cgroup-attached bpf progs
  bpf: Refactor some inode/task/sk storage functions for reuse
  bpf: Make struct cgroup btf id global
  selftests/bpf: Tracing prog can still do lookup under busy lock
  selftests/bpf: Ensure no task storage failure for bpf_lsm.s prog due to deadlock detection
  bpf: Add new bpf_task_storage_delete proto with no deadlock detection
  bpf: bpf_task_storage_delete_recur does lookup first before the deadlock check
  ...
====================

Link: https://lore.kernel.org/r/20221102062120.5724-1-daniel@iogearbox.netSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b54a0d40

Merge branch 'renesas-eswitch' · ef2dd61a

David S. Miller authored Nov 02, 2022

Yoshihiro Shimoda says:

====================
net: ethernet: renesas: Add support for "Ethernet Switch"

This patch series is based on next-20221027.

Add initial support for Renesas "Ethernet Switch" device of R-Car S4-8.
The hardware has features about forwarding for an ethernet switch
device. But, for now, it acts as ethernet controllers so that any
forwarding offload features are not supported. So, any switchdev
header files and DSA framework are not used.

Notes that this driver requires some special settings on marvell10g,
Especially host mactype and host speed. And, I need further investigation
to modify the marvell10g driver for upstream. But, the special settings
are applied, this rswitch driver can work correcfly without any changes
of this rswitch driver. So, I believe the rswitch driver can go for
upstream.

Changes from v6:
https://lore.kernel.org/all/20221028065458.2417293-1-yoshihiro.shimoda.uh@renesas.com/
 - Add Reviewed-by tag in the patch [1/3].
 - Fix ordering of initialization because NFS root can start mounting
   the filesystem before register_netdev() even returns

Changes from v5:
https://lore.kernel.org/all/20221027134034.2343230-1-yoshihiro.shimoda.uh@renesas.com/
 - Add maxItems for the ethernet-port/port/reg property.

Changes from v4:
https://lore.kernel.org/all/20221019083518.933070-1-yoshihiro.shimoda.uh@renesas.com/
 - Rebased on next-20221027.
 - Drop some unneeded properties on the dt-bindings doc.
 - Change the subject and commit descriptions on the patch [2/3].
 - Use phylink instead of phylib.
 - Modify struct rswitch_*_desc to remove similar functions ([gs]et_dptr).

====================
Signed-off-by: David S. Miller <davem@davemloft.net>

ef2dd61a

net: ethernet: renesas: rswitch: Add R-Car Gen4 gPTP support · 6c6fa1a0

Yoshihiro Shimoda authored Oct 31, 2022

Add R-Car Gen4 gPTP support into the rswitch driver.
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6c6fa1a0

net: ethernet: renesas: Add support for "Ethernet Switch" · 3590918b

Yoshihiro Shimoda authored Oct 31, 2022

Add initial support for Renesas "Ethernet Switch" device of R-Car S4-8.
The hardware has features about forwarding for an ethernet switch
device. But, for now, it acts as ethernet controllers so that any
forwarding offload features are not supported. So, any switchdev
header files and DSA framework are not used.
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

3590918b

dt-bindings: net: renesas: Document Renesas Ethernet Switch · f9edd827

Yoshihiro Shimoda authored Oct 31, 2022

Document Renesas Etherent Switch for R-Car S4-8 (r8a779f0).
Signed-off-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

f9edd827

Merge branch 'txgbe' · d312bad4

David S. Miller authored Nov 02, 2022

Mengyuan Lou says:

====================
net: WangXun txgbe/ngbe ethernet driver

This patch series adds support for WangXun NICS, to initialize
interface from software to firmware.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d312bad4

net: ngbe: Initialize sw info and register netdev · 02338c48

Mengyuan Lou authored Oct 31, 2022

Initialize ngbe mac/phy type.
Check whether the firmware is initialized.
Initialize ngbe hw and register netdev.
Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

02338c48

net: txgbe: Add operations to interact with firmware · 049fe536

Jiawen Wu authored Oct 31, 2022

Add firmware interaction to get EEPROM information.
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

049fe536

net: libwx: Implement interaction with firmware · 1efa9bfe

Jiawen Wu authored Oct 31, 2022

Add mailbox commands to interact with firmware.
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1efa9bfe

veth: Avoid drop packets when xdp_redirect performs · 2e0de636

Heng Qi authored Oct 31, 2022

In the current processing logic, when xdp_redirect occurs, it transmits
the xdp frame based on napi.

If napi of the peer veth is not ready, the veth will drop the packets.
This doesn't meet our expectations.

In this context, we enable napi of the peer veth automatically when the
veth loads the xdp. Then if the veth unloads the xdp, we need to
correctly judge whether to disable napi of the peer veth, because the peer
veth may have loaded xdp, or even the user has enabled GRO.
Signed-off-by: Heng Qi <henqqi@linux.alibaba.com>
Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2e0de636

nfc: Add KCOV annotations · 7e8cdc97

Dmitry Vyukov authored Oct 30, 2022

Add remote KCOV annotations for NFC processing that is done
in background threads. This enables efficient coverage-guided
fuzzing of the NFC subsystem.

The intention is to add annotations to background threads that
process skb's that were allocated in syscall context
(thus have a KCOV handle associated with the current fuzz test).
This includes nci_recv_frame() that is called by the virtual nci
driver in the syscall context.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Bongsu Jeon <bongsu.jeon@samsung.com>
Cc: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

7e8cdc97

gve: Reduce alloc and copy costs in the GQ rx path · 82fd151d

Shailend Chand authored Oct 29, 2022

Previously, even if just one of the many fragments of a 9k packet
required a copy, we'd copy the whole packet into a freshly-allocated
9k-sized linear SKB, and this led to performance issues.

By having a pool of pages to copy into, each fragment can be
independently handled, leading to a reduced incidence of
allocation and copy.
Signed-off-by: Shailend Chand <shailend@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

82fd151d

net: wwan: iosm: add rpc interface for xmm modems · d08b0f8f

Shane Parslow authored Oct 29, 2022

Add a new iosm wwan port that connects to the modem rpc interface. This
interface provides a configuration channel, and in the case of the 7360, is
the only way to configure the modem (as it does not support mbim).

The new interface is compatible with existing software, such as
open_xdatachannel.py from the xmm7360-pci project [1].

[1] https://github.com/xmm7360/xmm7360-pciSigned-off-by: Shane Parslow <shaneparslow808@gmail.com>
Reviewed-by: Loic Poulain <loic.poulain@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

d08b0f8f

net: wwan: t7xx: Add port for modem logging · 3349e4a4

M Chetan Kumar authored Oct 28, 2022

The Modem Logging (MDL) port provides an interface to collect modem
logs for debugging purposes. MDL is supported by the relay interface,
and the mtk_t7xx port infrastructure. MDL allows user-space apps to
control logging via mbim command and to collect logs via the relay
interface, while port infrastructure facilitates communication between
the driver and the modem.
Signed-off-by: Moises Veleta <moises.veleta@linux.intel.com>
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Signed-off-by: Devegowda Chandrashekar <chandrashekar.devegowda@intel.com>
Acked-by: Ricardo Martinez <ricardo.martinez@linux.intel.com>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3349e4a4

net: wwan: t7xx: use union to group port type specific data · fece7a8c

M Chetan Kumar authored Oct 28, 2022

Use union inside t7xx_port to group port type specific data members.
Signed-off-by: M Chetan Kumar <m.chetan.kumar@linux.intel.com>
Reviewed-by: Sergey Ryazanov <ryazanov.s.a@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fece7a8c

tcp: refine tcp_prune_ofo_queue() logic · b0e01253

Eric Dumazet authored Nov 01, 2022

After commits 36a6503f ("tcp: refine tcp_prune_ofo_queue()
to not drop all packets") and 72cd43ba
("tcp: free batches of packets in tcp_prune_ofo_queue()")
tcp_prune_ofo_queue() drops a fraction of ooo queue,
to make room for incoming packet.

However it makes no sense to drop packets that are
before the incoming packet, in sequence space.

In order to recover from packet losses faster,
it makes more sense to only drop ooo packets
which are after the incoming packet.

Tested:
packetdrill test:
   0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3800], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 0>
  +.1 < . 1:1(0) ack 1 win 1024
   +0 accept(3, ..., ...) = 4

 +.01 < . 200:300(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 200:300>

 +.01 < . 400:500(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 400:500 200:300>

 +.01 < . 600:700(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 600:700 400:500 200:300>

 +.01 < . 800:900(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 800:900 600:700 400:500 200:300>

 +.01 < . 1000:1100(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 1000:1100 800:900 600:700 400:500>

 +.01 < . 1200:1300(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>

// this packet is dropped because we have no room left.
 +.01 < . 1400:1500(100) ack 1 win 1024

 +.01 < . 1:200(199) ack 1 win 1024
// Make sure kernel did not drop 200:300 sequence
   +0 > . 1:1(0) ack 300 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
// Make room, since our RCVBUF is very small
   +0 read(4, ..., 299) = 299

 +.01 < . 300:400(100) ack 1 win 1024
   +0 > . 1:1(0) ack 500 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>

 +.01 < . 500:600(100) ack 1 win 1024
   +0 > . 1:1(0) ack 700 <nop,nop, sack 1200:1300 1000:1100 800:900>

   +0 read(4, ..., 400) = 400

 +.01 < . 700:800(100) ack 1 win 1024
   +0 > . 1:1(0) ack 900 <nop,nop, sack 1200:1300 1000:1100>

 +.01 < . 900:1000(100) ack 1 win 1024
   +0 > . 1:1(0) ack 1100 <nop,nop, sack 1200:1300>

 +.01 < . 1100:1200(100) ack 1 win 1024
// This checks that 1200:1300 has not been removed from ooo queue
   +0 > . 1:1(0) ack 1300
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20221101035234.3910189-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b0e01253

net: core: inet[46]_pton strlen len types · 44827016

Dr. David Alan Gilbert authored Oct 29, 2022

inet[46]_pton check the input length against
a sane length limit (INET[6]_ADDRSTRLEN), but
the strlen value gets truncated due to being stored in an int,
so there's a theoretical potential for a >4G string to pass
the limit test.
Use size_t since that's what strlen actually returns.

I've had a hunt for callers that could hit this, but
I've not managed to find anything that doesn't get checked with
some other limit first; but it's possible that I've missed
something in the depth of the storage target paths.
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://lore.kernel.org/r/20221029014604.114024-1-linux@treblig.orgSigned-off-by: Jakub Kicinski <kuba@kernel.org>

44827016

01 Nov, 2022 12 commits

samples/bpf: Fix typo in README · 3a07dcf8

Kang Minchul authored Oct 31, 2022

Fix 'cofiguration' typo in BPF samples README.
Signed-off-by: Kang Minchul <tegongkang@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20221030180254.34138-1-tegongkang@gmail.com

3a07dcf8

Merge branch 'inet-add-drop-monitor-support' · 6f1a298b

Jakub Kicinski authored Oct 31, 2022

Eric Dumazet says:

====================
inet: add drop monitor support

I recently tried to analyse flakes in ip_defrag selftest.
This failed miserably.

IPv4 and IPv6 reassembly units are causing false kfree_skb()
notifications. It is time to deal with this issue.

First two patches are changing core networking to better
deal with eventual skb frag_list chains, in respect
of kfree_skb/consume_skb status.

Last three patches are adding three new drop reasons,
and make sure skbs that have been reassembled into
a large datagram are no longer viewed as dropped ones.

After this, understanding why ip_defrag selftest is flaky
is possible using standard drop monitoring tools.
====================

Link: https://lore.kernel.org/r/20221029154520.2747444-1-edumazet@google.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

6f1a298b

net: dropreason: add SKB_DROP_REASON_FRAG_TOO_FAR · 3bdfb04f

Eric Dumazet authored Oct 29, 2022

IPv4 reassembly unit can decide to drop frags based on
/proc/sys/net/ipv4/ipfrag_max_dist sysctl.

Add a specific drop reason to track this specific
and weird case.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

3bdfb04f

net: dropreason: add SKB_DROP_REASON_FRAG_REASM_TIMEOUT · 77adfd3a

Eric Dumazet authored Oct 29, 2022

Used to track skbs freed after a timeout happened
in a reassmbly unit.

Passing a @reason argument to inet_frag_rbtree_purge()
allows to use correct consumed status for frags
that have been successfully re-assembled.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

77adfd3a

net: dropreason: add SKB_DROP_REASON_DUP_FRAG · 4ecbb1c2

Eric Dumazet authored Oct 29, 2022

This is used to track when a duplicate segment received by various
reassembly units is dropped.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

4ecbb1c2

net: dropreason: propagate drop_reason to skb_release_data() · 511a3eda

Eric Dumazet authored Oct 29, 2022

When an skb with a frag list is consumed, we currently
pretend all skbs in the frag list were dropped.

In order to fix this, add a @reason argument to skb_release_data()
and skb_release_all().
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

511a3eda

net: dropreason: add SKB_CONSUMED reason · 0e84afe8

Eric Dumazet authored Oct 29, 2022

This will allow to simply use in the future:

	kfree_skb_reason(skb, reason);

Instead of repeating sequences like:

	if (dropped)
	    kfree_skb_reason(skb, reason);
	else
	    consume_skb(skb);

For instance, following patch in the series is adding
@reason to skb_release_data() and skb_release_all(),
so that we can propagate a meaningful @reason whenever
consume_skb()/kfree_skb() have to take care of a potential frag_list.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

0e84afe8

net: systemport: Add support for RDMA overflow statistic counter · b98deb2f

Florian Fainelli authored Oct 28, 2022

RDMA overflows can happen if the Ethernet controller does not have
enough bandwidth allocated at the memory controller level, report RDMA
overflows and deal with saturation, similar to the RBUF overflow
counter.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20221028222141.3208429-1-f.fainelli@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

b98deb2f

net: ftmac100: allow increasing MTU to make most use of single-segment buffers · 37c84890

Sergei Antonov authored Oct 28, 2022

If the FTMAC100 is used as a DSA master, then it is expected that frames
which are MTU sized on the wire facing the external switch port (1500
octets in L2 payload, plus L2 header) also get a DSA tag when seen by
the host port.

This extra tag increases the length of the packet as the host port sees
it, and the FTMAC100 is not prepared to handle frames whose length
exceeds 1518 octets (including FCS) at all.

Only a minimal rework is needed to support this configuration. Since
MTU-sized DSA-tagged frames still fit within a single buffer (RX_BUF_SIZE),
we just need to optimize the resource management rather than implement
multi buffer RX.

In ndo_change_mtu(), we toggle the FTMAC100_MACCR_RX_FTL bit to tell the
hardware to drop (or not) frames with an L2 payload length larger than
1500. We need to replicate the MACCR configuration in ftmac100_start_hw()
as well, since there is a hardware reset there which clears previous
settings.

The advantage of dynamically changing FTMAC100_MACCR_RX_FTL is that when
dev->mtu is at the default value of 1500, large frames are automatically
dropped in hardware and we do not spend CPU cycles dropping them.
Suggested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Link: https://lore.kernel.org/r/20221028183220.155948-3-saproj@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

37c84890

net: ftmac100: report the correct maximum MTU of 1500 · 30f837b7

Vladimir Oltean authored Oct 28, 2022

The driver uses the MAX_PKT_SIZE (1518) for both MTU reporting and for
TX. However, the 2 places do not measure the same thing.

On TX, skb->len measures the entire L2 packet length (without FCS, which
software does not possess). So the comparison against 1518 there is
correct.

What is not correct is the reporting of dev->max_mtu as 1518. Since MTU
measures L2 *payload* length (excluding L2 overhead) and not total L2
packet length, it means that the correct max_mtu supported by this
device is the standard 1500. Anything higher than that will be dropped
on RX currently.

To fix this, subtract VLAN_ETH_HLEN from MAX_PKT_SIZE when reporting the
max_mtu, since that is the difference between L2 payload length and
total L2 length as seen by software.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Link: https://lore.kernel.org/r/20221028183220.155948-2-saproj@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

30f837b7

net: ftmac100: prepare data path for receiving single segment packets > 1514 · 55f6f3db

Vladimir Oltean authored Oct 28, 2022

Eliminate one check in the data path and move it elsewhere, to where our
real limitation is. We'll want to start processing "too long" frames in
the driver (currently there is a hardware MAC setting which drops
theses).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Sergei Antonov <saproj@gmail.com>
Link: https://lore.kernel.org/r/20221028183220.155948-1-saproj@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

55f6f3db

net: dsa: mv88e6xxx: Add RGMII delay to 88E6320 · 91e87045

Steffen Bätz authored Oct 28, 2022

Currently, the .port_set_rgmii_delay hook is missing for the 88E6320
family, which causes failure to retrieve an IP address via DHCP.

Add mv88e6320_port_set_rgmii_delay() that allows applying the RGMII
delay for ports 2, 5, and 6, which are the only ports that can be used
in RGMII mode.

Tested on a custom i.MX8MN board connected to an 88E6320 switch.

This change also applies safely to the 88E6321 variant.

The only difference between 88E6320 versus 88E6321 is the temperature
grade and pinout.

They share exactly the same MDIO register map for ports 2, 5, and 6,
which are the only ports that can be used in RGMII mode.
Signed-off-by: Steffen Bätz <steffen@innosonix.de>
[fabio: Improved commit log and extended it to mv88e6321_ops]
Signed-off-by: Fabio Estevam <festevam@denx.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20221028163158.198108-1-festevam@gmail.comSigned-off-by: Jakub Kicinski <kuba@kernel.org>

91e87045