Commits · ce1f352162828ba07470328828a32f47aa759020 · Kirill Smelkov / linux

18 Mar, 2020 8 commits

net: ena: fix incorrect setting of the number of msix vectors · ce1f3521

Arthur Kiyanovski authored Mar 17, 2020

Overview:
We don't frequently change the msix vectors throughout the life cycle of
the driver. We do so in two functions: ena_probe() and ena_restore().
ena_probe() is only called when the driver is loaded. ena_restore() on the
other hand is called during device reset / resume operations.

We use num_io_queues for calculating and allocating the number of msix
vectors. At ena_probe() this value is equal to max_num_io_queues and thus
this is not an issue, however ena_restore() might be called after the
number of io queues has changed.

A possible bug scenario is as follows:

* Change number of queues from 8 to 4.
  (num_io_queues = 4, max_num_io_queues = 8, msix_vecs = 9,)
* Trigger reset occurs -> ena_restore is called.
  (num_io_queues = 4, max_num_io_queues =8 , msix_vecs = 5)
* Change number of queues from 4 to 6.
  (num_io_queues = 6, max_num_io_queues = 8, msix_vecs = 5)
* The driver will reset due to failure of check_for_rx_interrupt_queue()

Fix:
This can be easily fixed by always using max_num_io_queues to init the
msix_vecs, since this number won't change as opposed to num_io_queues.

Fixes: 4d192660 ("net: ena: multiple queue creation related cleanups")
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ce1f3521

net: phy: mdio-mux-bcm-iproc: check clk_prepare_enable() return value · 872307ab

Rayagonda Kokatanur authored Mar 17, 2020

Check clk_prepare_enable() return value.

Fixes: 2c723044 ("net: phy: Add pm support to Broadcom iProc mdio mux driver")
Signed-off-by: Rayagonda Kokatanur <rayagonda.kokatanur@broadcom.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

872307ab

Merge branch 'net-bcmgenet-revisit-MAC-reset' · af4e6671

David S. Miller authored Mar 17, 2020

Doug Berger says:

====================
net: bcmgenet: revisit MAC reset

Commit 3a55402c ("net: bcmgenet: use RGMII loopback for MAC
reset") was intended to resolve issues with reseting the UniMAC
core within the GENET block by providing better control over the
clocks used by the UniMAC core. Unfortunately, it is not
compatible with all of the supported system configurations so an
alternative method must be applied.

This commit set provides such an alternative. The first commit
reverts the previous change and the second commit provides the
alternative reset sequence that addresses the concerns observed
with the previous implementation.

This replacement implementation should be applied to the stable
branches wherever commit 3a55402c ("net: bcmgenet: use RGMII
loopback for MAC reset") has been applied.

Unfortunately, reverting that commit may conflict with some
restructuring changes introduced by commit 4f8d81b7 ("net:
bcmgenet: Refactor register access in bcmgenet_mii_config").
The first commit in this set has been manually edited to
resolve the conflict on net/master. I would be happy to help
stable maintainers with resolving any such conflicts if they
occur. However, I do not expect that commit to have been
backported to stable branch so hopefully the revert can be
applied cleanly.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

af4e6671

net: bcmgenet: keep MAC in reset until PHY is up · 88f6c8bf

Doug Berger authored Mar 16, 2020

As noted in commit 28c2d1a7 ("net: bcmgenet: enable loopback
during UniMAC sw_reset") the UniMAC must be clocked at least 5
cycles while the sw_reset is asserted to ensure a clean reset.

That commit enabled local loopback to provide an Rx clock from the
GENET sourced Tx clk. However, when connected in MII mode the Tx
clk is sourced by the PHY so if an EPHY is not supplying clocks
(e.g. when the link is down) the UniMAC does not receive the
necessary clocks.

This commit extends the sw_reset window until the PHY reports that
the link is up thereby ensuring that the clocks are being provided
to the MAC to produce a clean reset.

One consequence is that if the system attempts to enter a Wake on
LAN suspend state when the PHY link has not been active the MAC
may not have had a chance to initialize cleanly. In this case, we
remove the sw_reset and enable the WoL reception path as normal
with the hope that the PHY will provide the necessary clocks to
drive the WoL blocks if the link becomes active after the system
has entered suspend.

Fixes: 1c1008c7 ("net: bcmgenet: add main driver file")
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

88f6c8bf

Revert "net: bcmgenet: use RGMII loopback for MAC reset" · 612eb1c3

Doug Berger authored Mar 16, 2020

This reverts commit 3a55402c.

This is not a good solution when connecting to an external switch
that may not support the isolation of the TXC signal resulting in
output driver contention on the pin.

A different solution is necessary.
Signed-off-by: Doug Berger <opendmb@gmail.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

612eb1c3

Merge branch 'net-mvmdio-avoid-error-message-for-optional-IRQ' · d36963b8

David S. Miller authored Mar 17, 2020

Chris Packham says:

====================
net: mvmdio: avoid error message for optional IRQ

I've gone ahead an sent a revert. This is the same as the original v1 except
I've added Andrew's review to the commit message.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d36963b8

net: mvmdio: avoid error message for optional IRQ · fa2632f7

Chris Packham authored Mar 16, 2020

Per the dt-binding the interrupt is optional so use
platform_get_irq_optional() instead of platform_get_irq(). Since
commit 7723f4c5 ("driver core: platform: Add an error message to
platform_get_irq*()") platform_get_irq() produces an error message

  orion-mdio f1072004.mdio: IRQ index 0 not found

which is perfectly normal if one hasn't specified the optional property
in the device tree.
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

fa2632f7

Revert "net: mvmdio: avoid error message for optional IRQ" · 028fd76b

Chris Packham authored Mar 16, 2020

This reverts commit e1f550dc.
platform_get_irq_optional() will still return -ENXIO when no interrupt
is provided so the additional error handling caused the driver prone to
fail when no interrupt was specified. Revert the change so we can apply
the correct minimal fix.
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

028fd76b

17 Mar, 2020 1 commit

net: ip_gre: Accept IFLA_INFO_DATA-less configuration · 32ca98fe

Petr Machata authored Mar 16, 2020

The fix referenced below causes a crash when an ERSPAN tunnel is created
without passing IFLA_INFO_DATA. Fix by validating passed-in data in the
same way as ipgre does.

Fixes: e1f8f78f ("net: ip_gre: Separate ERSPAN newlink / changelink callbacks")
Reported-by: syzbot+1b4ebf4dae4e510dd219@syzkaller.appspotmail.com
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

32ca98fe

16 Mar, 2020 23 commits

net: mvneta: Fix the case where the last poll did not process all rx · 065fd83e

Jisheng Zhang authored Mar 16, 2020

For the case where the last mvneta_poll did not process all
RX packets, we need to xor the pp->cause_rx_tx or port->cause_rx_tx
before claculating the rx_queue.

Fixes: 2dcf75e2 ("net: mvneta: Associate RX queues with each CPU")
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

065fd83e

net: vxge: fix wrong __VA_ARGS__ usage · b317538c

Zheng Wei authored Mar 16, 2020

printk in macro vxge_debug_ll uses __VA_ARGS__ without "##" prefix,
it causes a build error when there is no variable
arguments(e.g. only fmt is specified.).
Signed-off-by: Zheng Wei <wei.zheng@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b317538c

Merge branch 'QorIQ-DPAA-ARM-RDBs-need-internal-delay-on-RGMII' · 83d00106

David S. Miller authored Mar 16, 2020

Madalin Bucur says:

====================
QorIQ DPAA ARM RDBs need internal delay on RGMII

v2: used phy_interface_mode_is_rgmii() to identify RGMII

The QorIQ DPAA 1 based RDB boards require internal delay on
both Tx and Rx to be set. The patch set ensures all RGMII
modes are treated correctly by the FMan driver and sets the
phy-connection-type to "rgmii-id" to restore functionality.
Previously Rx internal delay was set by board pull-ups and
was left untouched by the PHY driver. Since commit
1b3047b5 ("net: phy: realtek: add support for
configuring the RX delay on RTL8211F") the Realtek 8211F PHY
driver has control over the RGMII RX delay and it is
disabling it for other modes than RGMII_RXID and RGMII_ID.

Please note that u-boot in particular performs a fix-up of
the PHY connection type and will overwrite the values from
the Linux device tree. Another patch set was sent for u-boot
and one needs to apply that [1] to the boot loader, to ensure
this fix is complete, unless a different bootloader is used.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

83d00106

arm64: dts: ls1046ardb: set RGMII interfaces to RGMII_ID mode · d79e9d7c

Madalin Bucur authored Mar 16, 2020

The correct setting for the RGMII ports on LS1046ARDB is to
enable delay on both Rx and Tx so the interface mode used must
be PHY_INTERFACE_MODE_RGMII_ID.

Since commit 1b3047b5 ("net: phy: realtek: add support for
configuring the RX delay on RTL8211F") the Realtek 8211F PHY driver
has control over the RGMII RX delay and it is disabling it for
RGMII_TXID. The LS1046ARDB uses two such PHYs in RGMII_ID mode but
in the device tree the mode was described as "rgmii".

Changing the phy-connection-type to "rgmii-id" to address the issue.

Fixes: 3fa395d2 ("arm64: dts: add LS1046A DPAA FMan nodes")
Signed-off-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d79e9d7c

arm64: dts: ls1043a-rdb: correct RGMII delay mode to rgmii-id · 4022d808

Madalin Bucur authored Mar 16, 2020

The correct setting for the RGMII ports on LS1043ARDB is to
enable delay on both Rx and Tx so the interface mode used must
be PHY_INTERFACE_MODE_RGMII_ID.

Since commit 1b3047b5 ("net: phy: realtek: add support for
configuring the RX delay on RTL8211F") the Realtek 8211F PHY driver
has control over the RGMII RX delay and it is disabling it for
RGMII_TXID. The LS1043ARDB uses two such PHYs in RGMII_ID mode but
in the device tree the mode was described as "rgmii_txid".
This issue was not apparent at the time as the PHY driver took the
same action for RGMII_TXID and RGMII_ID back then but it became
visible (RX no longer working) after the above patch.

Changing the phy-connection-type to "rgmii-id" to address the issue.

Fixes: bf02f2ff ("arm64: dts: add LS1043A DPAA FMan support")
Signed-off-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4022d808

net: fsl/fman: treat all RGMII modes in memac_adjust_link() · 0fe15680

Madalin Bucur authored Mar 16, 2020

Treat all internal delay variants the same as RGMII.
Signed-off-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

0fe15680

Merge branch 'ethtool-fail-with-error-if-request-has-unknown-flags' · 23c39481

David S. Miller authored Mar 16, 2020

Michal Kubecek says:

====================
ethtool: fail with error if request has unknown flags

Jakub Kicinski pointed out that if unrecognized flags are set in netlink
header request, kernel shoud fail with an error rather than silently
ignore them so that we have more freedom in future flags semantics.

To help userspace with handling such errors, inform the client which
flags are supported by kernel. For that purpose, we need to allow
passing cookies as part of extack also in case of error (they can be
only passed on success now).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

23c39481

ethtool: reject unrecognized request flags · 2363d73a

Michal Kubecek authored Mar 15, 2020

As pointed out by Jakub Kicinski, we ethtool netlink code should respond
with an error if request head has flags set which are not recognized by
kernel, either as a mistake or because it expects functionality introduced
in later kernel versions.

To avoid unnecessary roundtrips, use extack cookie to provide the
information about supported request flags.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>

2363d73a

netlink: add nl_set_extack_cookie_u32() · f1388ec4

Michal Kubecek authored Mar 15, 2020

Similar to existing nl_set_extack_cookie_u64(), add new helper
nl_set_extack_cookie_u32() which sets extack cookie to a u32 value.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>

f1388ec4

netlink: allow extack cookie also for error messages · fe2a31d7

Michal Kubecek authored Mar 15, 2020

Commit ba0dc5f6 ("netlink: allow sending extended ACK with cookie on
success") introduced a cookie which can be sent to userspace as part of
extended ack message in the form of NLMSGERR_ATTR_COOKIE attribute.
Currently the cookie is ignored if error code is non-zero but there is
no technical reason for such limitation and it can be useful to provide
machine parseable information as part of an error message.

Include NLMSGERR_ATTR_COOKIE whenever the cookie has been set,
regardless of error code.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>

fe2a31d7

net_sched: cls_route: remove the right filter from hashtable · ef299cc3

Cong Wang authored Mar 13, 2020

route4_change() allocates a new filter and copies values from
the old one. After the new filter is inserted into the hash
table, the old filter should be removed and freed, as the final
step of the update.

However, the current code mistakenly removes the new one. This
looks apparently wrong to me, and it causes double "free" and
use-after-free too, as reported by syzbot.

Reported-and-tested-by: syzbot+f9b32aaacd60305d9687@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+2f8c233f131943d6056d@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+9c2df9fd5e9445b74e01@syzkaller.appspotmail.com
Fixes: 1109c005 ("net: sched: RCU cls_route")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ef299cc3

Merge branch 'hsr-fix-several-bugs-in-generic-netlink-callback' · 4ae649e8

David S. Miller authored Mar 16, 2020

Taehee Yoo says:

====================
hsr: fix several bugs in generic netlink callback

This patchset is to fix several bugs they are related in
generic netlink callback in hsr module.

1. The first patch is to add missing rcu_read_lock() in
hsr_get_node_{list/status}().
The hsr_get_node_{list/status}() are not protected by RTNL because
they are callback functions of generic netlink.
But it calls __dev_get_by_index() without acquiring RTNL.
So, it would use unsafe data.

2. The second patch is to avoid failure of hsr_get_node_list().
hsr_get_node_list() is a callback of generic netlink and
it is used to get node information in userspace.
But, if there are so many nodes, it fails because of buffer size.
So, in this patch, restart routine is added.

3. The third patch is to set .netnsok flag to true.
If .netnsok flag is false, non-init_net namespace is not allowed to
operate generic netlink operations.
So, currently, non-init_net namespace has no way to get node information
because .netnsok is false in the current hsr code.

Change log:
v1->v2:
 - Preserve reverse christmas tree variable ordering in the second patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4ae649e8

hsr: set .netnsok flag · 09e91dbe

Taehee Yoo authored Mar 13, 2020

The hsr module has been supporting the list and status command.
(HSR_C_GET_NODE_LIST and HSR_C_GET_NODE_STATUS)
These commands send node information to the user-space via generic netlink.
But, in the non-init_net namespace, these commands are not allowed
because .netnsok flag is false.
So, there is no way to get node information in the non-init_net namespace.

Fixes: f421436a ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

09e91dbe

hsr: add restart routine into hsr_get_node_list() · ca19c70f

Taehee Yoo authored Mar 13, 2020

The hsr_get_node_list() is to send node addresses to the userspace.
If there are so many nodes, it could fail because of buffer size.
In order to avoid this failure, the restart routine is added.

Fixes: f421436a ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ca19c70f

hsr: use rcu_read_lock() in hsr_get_node_{list/status}() · 173756b8

Taehee Yoo authored Mar 13, 2020

hsr_get_node_{list/status}() are not under rtnl_lock() because
they are callback functions of generic netlink.
But they use __dev_get_by_index() without rtnl_lock().
So, it would use unsafe data.
In order to fix it, rcu_read_lock() and dev_get_by_index_rcu()
are used instead of __dev_get_by_index().

Fixes: f421436a ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

173756b8

Merge branch 'net-Use-scnprintf-for-avoiding-potential-buffer-overflow' · dcadaec2

David S. Miller authored Mar 15, 2020

Takashi Iwai says:

====================
net: Use scnprintf() for avoiding potential buffer overflow

here is a respin of trivial patch series just to convert suspicious
snprintf() usages with the more safer one, scnprintf().

v1->v2: Align the remaining lines to the open parenthesis
        Excluded i40e patch that was already queued
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

dcadaec2

net: netdevsim: Use scnprintf() for avoiding potential buffer overflow · 2da222f6

Takashi Iwai authored Mar 15, 2020

Since snprintf() returns the would-be-output size instead of the
actual output size, the succeeding calls may go beyond the given
buffer limit.  Fix it by replacing with scnprintf().

Cc: "David S . Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

2da222f6

net: sfc: Use scnprintf() for avoiding potential buffer overflow · 5e892880

Takashi Iwai authored Mar 15, 2020

Since snprintf() returns the would-be-output size instead of the
actual output size, the succeeding calls may go beyond the given
buffer limit.  Fix it by replacing with scnprintf().

Cc: "David S . Miller" <davem@davemloft.net>
Cc: Edward Cree <ecree@solarflare.com>
Cc: Martin Habets <mhabets@solarflare.com>
Cc: Solarflare linux maintainers <linux-net-drivers@solarflare.com>
Cc: netdev@vger.kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e892880

net: ionic: Use scnprintf() for avoiding potential buffer overflow · 38e0f746

Takashi Iwai authored Mar 15, 2020

Since snprintf() returns the would-be-output size instead of the
actual output size, the succeeding calls may go beyond the given
buffer limit.  Fix it by replacing with scnprintf().
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Acked-by: Shannon Nelson <snelson@pensando.io>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: oss-drivers@netronome.com
Cc: netdev@vger.kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

38e0f746

net: nfp: Use scnprintf() for avoiding potential buffer overflow · 413ae546

Takashi Iwai authored Mar 15, 2020

Since snprintf() returns the would-be-output size instead of the
actual output size, the succeeding calls may go beyond the given
buffer limit.  Fix it by replacing with scnprintf().
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: oss-drivers@netronome.com
To: netdev@vger.kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

413ae546

net: mlx4: Use scnprintf() for avoiding potential buffer overflow · 4a348601

Takashi Iwai authored Mar 15, 2020

Since snprintf() returns the would-be-output size instead of the
actual output size, the succeeding calls may go beyond the given
buffer limit.  Fix it by replacing with scnprintf().

Cc: "David S . Miller" <davem@davemloft.net>
Cc: Tariq Toukan <tariqt@mellanox.com>
To: netdev@vger.kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

4a348601

net: caif: Use scnprintf() for avoiding potential buffer overflow · 13bde56c

Takashi Iwai authored Mar 15, 2020

Since snprintf() returns the would-be-output size instead of the
actual output size, the succeeding calls may go beyond the given
buffer limit.  Fix it by replacing with scnprintf().

Cc: "David S . Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

13bde56c

mlxsw: reg: Increase register field length to 31 bits · cb851c01

Ido Schimmel authored Mar 15, 2020

The cited commit set a value of 2^31-1 in order to "disable" the shaper
on a given a port. However, the length of the maximum shaper rate field
was not updated from 28 bits to 31 bits, which means ports are still
limited to ~268Gbps despite supporting speeds of 400Gbps.

Fix this by increasing the field's length.

Fixes: 92afbfed ("mlxsw: reg: Increase MLXSW_REG_QEEC_MAS_DIS")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cb851c01

15 Mar, 2020 8 commits

geneve: move debug check after netdev unregister · 0fda7600

Florian Westphal authored Mar 14, 2020

The debug check must be done after unregister_netdevice_many() call --
the list_del() for this is done inside .ndo_stop.

Fixes: 2843a253 ("geneve: speedup geneve tunnels dismantle")
Reported-and-tested-by: <syzbot+68a8ed58e3d17c700de5@syzkaller.appspotmail.com>
Cc: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

0fda7600

net/packet: tpacket_rcv: avoid a producer race condition · 61fad681

Willem de Bruijn authored Mar 13, 2020

PACKET_RX_RING can cause multiple writers to access the same slot if a
fast writer wraps the ring while a slow writer is still copying. This
is particularly likely with few, large, slots (e.g., GSO packets).

Synchronize kernel thread ownership of rx ring slots with a bitmap.

Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
while holding the sk receive queue lock. They release this lock before
copying and set tp_status to TP_STATUS_USER to release to userspace
when done. During copying, another writer may take the lock, also see
TP_STATUS_KERNEL, and start writing to the same slot.

Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
slot, test and set with the lock held. To release race-free, update
tp_status and owner bit as a transaction, so take the lock again.

This is the one of a variety of discussed options (see Link below):

* instead of a shadow ring, embed the data in the slot itself, such as
in tp_padding. But any test for this field may match a value left by
userspace, causing deadlock.

* avoid the lock on release. This leaves a small race if releasing the
shadow slot before setting TP_STATUS_USER. The below reproducer showed
that this race is not academic. If releasing the slot after tp_status,
the race is more subtle. See the first link for details.

* add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
of two fields. But, legacy applications may interpret all non-zero
tp_status as owned by the user. As libpcap does. So this is possible
only opt-in by newer processes. It can be added as an optional mode.

* embed the struct at the tail of pg_vec to avoid extra allocation.
The implementation proved no less complex than a separate field.

The additional locking cost on release adds contention, no different
than scaling on multicore or multiqueue h/w. In practice, below
reproducer nor small packet tcpdump showed a noticeable change in
perf report in cycles spent in spinlock. Where contention is
problematic, packet sockets support mitigation through PACKET_FANOUT.
And we can consider adding opt-in state TP_KERNEL_OWNED.

Easy to reproduce by running multiple netperf or similar TCP_STREAM
flows concurrently with `tcpdump -B 129 -n greater 60000`.

Based on an earlier patchset by Jon Rosen. See links below.

I believe this issue goes back to the introduction of tpacket_rcv,
which predates git history.

Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.htmlSuggested-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

61fad681

net: ip_gre: Separate ERSPAN newlink / changelink callbacks · e1f8f78f

Petr Machata authored Mar 13, 2020

ERSPAN shares most of the code path with GRE and gretap code. While that
helps keep the code compact, it is also error prone. Currently a broken
userspace can turn a gretap tunnel into a de facto ERSPAN one by passing
IFLA_GRE_ERSPAN_VER. There has been a similar issue in ip6gretap in the
past.

To prevent these problems in future, split the newlink and changelink code
paths. Split the ERSPAN code out of ipgre_netlink_parms() into a new
function erspan_netlink_parms(). Extract a piece of common logic from
ipgre_newlink() and ipgre_changelink() into ipgre_newlink_encap_setup().
Add erspan_newlink() and erspan_changelink().

Fixes: 84e54fe0 ("gre: introduce native tunnel support for ERSPAN")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e1f8f78f

cxgb4: fix delete filter entry fail in unload path · 46ea929b

Shahjada Abul Husain authored Mar 13, 2020

Currently, the hardware TID index is assumed to start from index 0.
However, with the following changeset,

commit c2193999 ("cxgb4: add support for high priority filters")

hardware TID index can start after the high priority region, which
has introduced a regression resulting in remove filters entry
failure for cxgb4 unload path. This patch fix that.

Fixes: c2193999 ("cxgb4: add support for high priority filters")
Signed-off-by: Shahjada Abul Husain <shahjada@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

46ea929b

net: stmmac: platform: Fix misleading interrupt error msg · fc191af1

Markus Fuchs authored Mar 06, 2020

Not every stmmac based platform makes use of the eth_wake_irq or eth_lpi
interrupts. Use the platform_get_irq_byname_optional variant for these
interrupts, so no error message is displayed, if they can't be found.
Rather print an information to hint something might be wrong to assist
debugging on platforms which use these interrupts.
Signed-off-by: Markus Fuchs <mklntf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc191af1

net/bpfilter: fix dprintf usage for /dev/kmsg · 13d0f7b8

Bruno Meneguele authored Mar 12, 2020

The bpfilter UMH code was recently changed to log its informative messages to
/dev/kmsg, however this interface doesn't support SEEK_CUR yet, used by
dprintf(). As result dprintf() returns -EINVAL and doesn't log anything.

However there already had some discussions about supporting SEEK_CUR into
/dev/kmsg interface in the past it wasn't concluded. Since the only user of
that from userspace perspective inside the kernel is the bpfilter UMH
(userspace) module it's better to correct it here instead waiting a conclusion
on the interface.

Fixes: 36c4357c ("net: bpfilter: print umh messages to /dev/kmsg")
Signed-off-by: Bruno Meneguele <bmeneg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

13d0f7b8

net_sched: keep alloc_hash updated after hash allocation · 0d1c3530

Cong Wang authored Mar 11, 2020

In commit 599be01e ("net_sched: fix an OOB access in cls_tcindex")
I moved cp->hash calculation before the first
tcindex_alloc_perfect_hash(), but cp->alloc_hash is left untouched.
This difference could lead to another out of bound access.

cp->alloc_hash should always be the size allocated, we should
update it after this tcindex_alloc_perfect_hash().

Reported-and-tested-by: syzbot+dcc34d54d68ef7d2d53d@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c72da7b9ed57cde6fca2@syzkaller.appspotmail.com
Fixes: 599be01e ("net_sched: fix an OOB access in cls_tcindex")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d1c3530

net_sched: hold rtnl lock in tcindex_partial_destroy_work() · b1be2e8c

Cong Wang authored Mar 11, 2020

syzbot reported a use-after-free in tcindex_dump(). This is due to
the lack of RTNL in the deferred rcu work. We queue this work with
RTNL in tcindex_change(), later, tcindex_dump() is called:

        fh = tp->ops->get(tp, t->tcm_handle);
	...
        err = tp->ops->change(..., &fh, ...);
        tfilter_notify(..., fh, ...);

but there is nothing to serialize the pending
tcindex_partial_destroy_work() with tcindex_dump().

Fix this by simply holding RTNL in tcindex_partial_destroy_work(),
so that it won't be called until RTNL is released after
tc_new_tfilter() is completed.

Reported-and-tested-by: syzbot+653090db2562495901dc@syzkaller.appspotmail.com
Fixes: 3d210534 ("net_sched: fix a race condition in tcindex_destroy()")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b1be2e8c