Commits · 7d5ae47891929235c4a269b91996ab951cbf3c20 · Kirill Smelkov / linux

14 Apr, 2021 2 commits

net/mlx5: E-Switch, Skip querying SF enabled bits · 7d5ae478

Parav Pandit authored Mar 08, 2021

With vhca events, SF state is queried through the VHCA events. Device no
longer expects SF bitmap in the query eswitch functions command.

Hence, remove it to simplify the code.
Signed-off-by: Parav Pandit <parav@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

7d5ae478

net/mlx5: E-Switch, let user to enable disable metadata · 7bf481d7

Parav Pandit authored Oct 30, 2020

Currently each packet inserted in eswitch is tagged with a internal
metadata to indicate source vport. Metadata tagging is not always
needed. Metadata insertion is needed for multi-port RoCE, failover
between representors and stacked devices. In many other cases,
metadata enablement is not needed.

Metadata insertion slows down the packet processing rate of the E-switch
when it is in switchdev mode.

Below table show performance gain with metadata disabled for VXLAN
offload rules in both SMFS and DMFS steering mode on ConnectX-5 device.

----------------------------------------------
| steering | metadata | pkt size | rx pps    |
| mode     |          |          | (million) |
----------------------------------------------
| smfs     | disabled | 128Bytes | 42        |
----------------------------------------------
| smfs     | enabled  | 128Bytes | 36        |
----------------------------------------------
| dmfs     | disabled | 128Bytes | 42        |
----------------------------------------------
| dmfs     | enabled  | 128Bytes | 36        |
----------------------------------------------

Hence, allow user to disable metadata using driver specific devlink
parameter. Metadata setting of the eswitch is applicable only for the
switchdev mode.

Example to show and disable metadata before changing eswitch mode:
$ devlink dev param show pci/0000:06:00.0 name esw_port_metadata
pci/0000:06:00.0:
  name esw_port_metadata type driver-specific
    values:
      cmode runtime value true

$ devlink dev param set pci/0000:06:00.0 \
	  name esw_port_metadata value false cmode runtime

$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
---
changelog:
v1->v2:
 - added performance numbers in commit log
 - updated commit log and documentation for switchdev mode
 - added explicit note on when user can disable metadata in
   documentation

7bf481d7

12 Apr, 2021 23 commits

net: ethernet: ravb: Enable optional refclk · 8ef7adc6

Adam Ford authored Apr 12, 2021

For devices that use a programmable clock for the AVB reference clock,
the driver may need to enable them.  Add code to find the optional clock
and enable it when available.
Signed-off-by: Adam Ford <aford173@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

8ef7adc6

dt-bindings: net: renesas,etheravb: Add additional clocks · 6f43735b

Adam Ford authored Apr 12, 2021

The AVB driver assumes there is an external crystal, but it could
be clocked by other means.  In order to enable a programmable
clock, it needs to be added to the clocks list and enabled in the
driver.  Since there currently only one clock, there is no
clock-names list either.

Update bindings to add the additional optional clock, and explicitly
name both of them.
Signed-off-by: Adam Ford <aford173@gmail.com>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Rob Herring <robh@kernel.org>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6f43735b

Merge branch 'enetc-ptp' · d27139c5

David S. Miller authored Apr 12, 2021

Yangbo Lu says:

====================
enetc: support PTP Sync packet one-step timestamping

This patch-set is to add support for PTP Sync packet one-step timestamping.
Since ENETC single-step register has to be configured dynamically per
packet for correctionField offeset and UDP checksum update, current
one-step timestamping packet has to be sent only when the last one
completes transmitting on hardware. So, on the TX, this patch handles
one-step timestamping packet as below:

- Trasmit packet immediately if no other one in transfer, or queue to
  skb queue if there is already one in transfer.
  The test_and_set_bit_lock() is used here to lock and check state.
- Start a work when complete transfer on hardware, to release the bit
  lock and to send one skb in skb queue if has.

Changes for v2:
	- Rebased.
	- Fixed issues from patchwork checks.
	- netif_tx_lock for one-step timestamping packet sending.
Changes for v3:
	- Used system workqueue.
	- Set bit lock when transmitted one-step packet, and scheduled
	  work when completed. The worker cleared the bit lock, and
	  transmitted one skb in skb queue if has, instead of a loop.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d27139c5

enetc: support PTP Sync packet one-step timestamping · 7294380c

Yangbo Lu authored Apr 12, 2021

This patch is to add support for PTP Sync packet one-step timestamping.
Since ENETC single-step register has to be configured dynamically per
packet for correctionField offeset and UDP checksum update, current
one-step timestamping packet has to be sent only when the last one
completes transmitting on hardware. So, on the TX, this patch handles
one-step timestamping packet as below:

- Trasmit packet immediately if no other one in transfer, or queue to
  skb queue if there is already one in transfer.
  The test_and_set_bit_lock() is used here to lock and check state.
- Start a work when complete transfer on hardware, to release the bit
  lock and to send one skb in skb queue if has.

And the configuration for one-step timestamping on ENETC before
transmitting is,

- Set one-step timestamping flag in extension BD.
- Write 30 bits current timestamp in tstamp field of extension BD.
- Update PTP Sync packet originTimestamp field with current timestamp.
- Configure single-step register for correctionField offeset and UDP
  checksum update.
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7294380c

enetc: mark TX timestamp type per skb · f768e751

Yangbo Lu authored Apr 12, 2021

Mark TX timestamp type per skb on skb->cb[0], instead of
global variable for all skbs. This is a preparation for
one step timestamp support.

For one-step timestamping enablement, there will be both
one-step and two-step PTP messages to transfer. And a skb
queue is needed for one-step PTP messages making sure
start to send current message only after the last one
completed on hardware. (ENETC single-step register has to
be dynamically configured per message.) So, marking TX
timestamp type per skb is required.
Signed-off-by: Yangbo Lu <yangbo.lu@nxp.com>
Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f768e751

Merge branch 'ibmvnic-errors' · 8043edee

David S. Miller authored Apr 12, 2021

Lijun Pan says:

====================
ibmvnic: improve error printing

Patch 1 prints reset reason as a string.
Patch 2 prints adapter state as a string.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8043edee

ibmvnic: print adapter state as a string · 0666ef7f

Lijun Pan authored Apr 12, 2021

The adapter state can be added or deleted over different versions
of the source code. Print a string instead of a number.
Signed-off-by: Lijun Pan <lijunp213@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0666ef7f

ibmvnic: print reset reason as a string · caee7bf5

Lijun Pan authored Apr 12, 2021

The reset reason can be added or deleted over different versions
of the source code. Print a string instead of a number.
Signed-off-by: Lijun Pan <lijunp213@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

caee7bf5

ibmvnic: clean up the remaining debugfs data structures · c82eaa40

Lijun Pan authored Apr 12, 2021

Commit e704f043 ("ibmvnic: Remove debugfs support") did not
clean up everything. Remove the remaining code.
Signed-off-by: Lijun Pan <lijunp213@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c82eaa40

Merge branch 'netns-sysctl-isolation' · 645b34a7

David S. Miller authored Apr 12, 2021

Jonathon Reinhart says:

====================
Ensuring net sysctl isolation

This patchset is the result of an audit of /proc/sys/net to prove that
it is safe to be mouted read-write in a container when a net namespace
is in use. See [1].

The first commit adds code to detect sysctls which are not netns-safe,
and can "leak" changes to other net namespaces.

My manual audit found, and the above feature confirmed, that there are
two nf_conntrack sysctls which are in fact not netns-safe.

I considered sending the latter to netfilter-devel, but I think it's
better to have both together on net-next: Adding only the former causes
undesirable warnings in the kernel log.

[1]: https://github.com/opencontainers/runc/issues/2826
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

645b34a7

netfilter: conntrack: Make global sysctls readonly in non-init netns · 2671fa4d

Jonathon Reinhart authored Apr 12, 2021

These sysctls point to global variables:
- NF_SYSCTL_CT_MAX (&nf_conntrack_max)
- NF_SYSCTL_CT_EXPECT_MAX (&nf_ct_expect_max)
- NF_SYSCTL_CT_BUCKETS (&nf_conntrack_htable_size_user)

Because their data pointers are not updated to point to per-netns
structures, they must be marked read-only in a non-init_net ns.
Otherwise, changes in any net namespace are reflected in (leaked into)
all other net namespaces. This problem has existed since the
introduction of net namespaces.

The current logic marks them read-only only if the net namespace is
owned by an unprivileged user (other than init_user_ns).

Commit d0febd81 ("netfilter: conntrack: re-visit sysctls in
unprivileged namespaces") "exposes all sysctls even if the namespace is
unpriviliged." Since we need to mark them readonly in any case, we can
forego the unprivileged user check altogether.

Fixes: d0febd81 ("netfilter: conntrack: re-visit sysctls in unprivileged namespaces")
Signed-off-by: Jonathon Reinhart <Jonathon.Reinhart@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2671fa4d

net: Ensure net namespace isolation of sysctls · 31c4d2f1

Jonathon Reinhart authored Apr 12, 2021

This adds an ensure_safe_net_sysctl() check during register_net_sysctl()
to validate that sysctl table entries for a non-init_net netns are
sufficiently isolated. To be netns-safe, an entry must adhere to at
least (and usually exactly) one of these rules:

1. It is marked read-only inside the netns.
2. Its data pointer does not point to kernel/module global data.

An entry which fails both of these checks is indicative of a bug,
whereby a child netns can affect global net sysctl values.

If such an entry is found, this code will issue a warning to the kernel
log, and force the entry to be read-only to prevent a leak.

To test, simply create a new netns:

    $ sudo ip netns add dummy

As it sits now, this patch will WARN for two sysctls which will be
addressed in a subsequent patch:
- /proc/sys/net/netfilter/nf_conntrack_max
- /proc/sys/net/netfilter/nf_conntrack_expect_max
Signed-off-by: Jonathon Reinhart <Jonathon.Reinhart@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

31c4d2f1

nfc: pn533: remove redundant assignment · a115d24a

wengjianfeng authored Apr 12, 2021

In many places,first assign a value to a variable and then return
the variable. which is redundant, we should directly return the value.
in pn533_rf_field funciton,return rc also in the if statement, so we
use return 0 to replace the last return rc.
Signed-off-by: wengjianfeng <wengjianfeng@yulong.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a115d24a

Merge branch 'bnxt_en-error-recovery' · 5711ffd3

David S. Miller authored Apr 12, 2021

Michael Chan says:

====================
bnxt_en: Error recovery fixes.

This series adds some fixes and enhancements to the error recovery
logic.  The health register logic is improved and we also add missing
code to free and re-create VF representors in the firmware after
error recovery.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5711ffd3

bnxt_en: Free and allocate VF-Reps during error recovery. · ac797ced

Sriharsha Basavapatna authored Apr 11, 2021

During firmware recovery, VF-Rep configuration in the firmware is lost.
Fix it by freeing and (re)allocating VF-Reps in FW at relevant points
during the error recovery process.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ac797ced

bnxt_en: Refactor __bnxt_vf_reps_destroy(). · 90f4fd02

Michael Chan authored Apr 11, 2021

Add a new helper function __bnxt_free_one_vf_rep() to free one VF rep.
We also reintialize the VF rep fields to proper initial values so that
the function can be used without freeing the VF rep data structure.  This
will be used in subsequent patches to free and recreate VF reps after
error recovery.
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Reviewed-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

90f4fd02

bnxt_en: Refactor bnxt_vf_reps_create(). · ea2d37b2

Sriharsha Basavapatna authored Apr 11, 2021

Add a new function bnxt_alloc_vf_rep() to allocate a VF representor.
This function will be needed in subsequent patches to recreate the
VF reps after error recovery.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ea2d37b2

bnxt_en: Invalidate health register mapping at the end of probe. · 190eda1a

Vasundhara Volam authored Apr 11, 2021

After probe is successful, interface may not be bought up in all
the cases and health register mapping could be invalid if firmware
undergoes reset. Fix it by invalidating the health register at the
end of probe. It will be remapped during ifup.

Fixes: 43a440c4 ("bnxt_en: Improve the status_reliable flag in bp->fw_health.")
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

190eda1a

bnxt_en: Treat health register value 0 as valid in bnxt_try_reover_fw(). · 17e1be34

Michael Chan authored Apr 11, 2021

The retry loop in bnxt_try_recover_fw() should not abort when the
health register value is 0.  It is a valid value that indicates the
firmware is booting up.

Fixes: 861aae78 ("bnxt_en: Enhance retry of the first message to the firmware.")
Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

17e1be34

net: seg6: trivial fix of a spelling mistake in comment · 0d770360

Andrea Mayer authored Apr 10, 2021

There is a comment spelling mistake "interfarence" -> "interference" in
function parse_nla_action(). Fix it.
Signed-off-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Signed-off-by: David S. Miller <davem@davemloft.net>

0d770360

net: hns3: Fix potential null pointer defererence of null ae_dev · d0494135

Colin Ian King authored Apr 09, 2021

The reset_prepare and reset_done calls have a null pointer check
on ae_dev however ae_dev is being dereferenced via the call to
ns3_is_phys_func with the ae->pdev argument. Fix this by performing
a null pointer check on ae_dev and hence short-circuiting the
dereference to ae_dev on the call to ns3_is_phys_func.

Addresses-Coverity: ("Dereference before null check")
Fixes: 715c58e9 ("net: hns3: add suspend and resume pm_ops")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d0494135

net: thunderx: Fix unintentional sign extension issue · e701a258

Colin Ian King authored Apr 09, 2021

The shifting of the u8 integers rq->caching by 26 bits to
the left will be promoted to a 32 bit signed int and then
sign-extended to a u64. In the event that rq->caching is
greater than 0x1f then all then all the upper 32 bits of
the u64 end up as also being set because of the int
sign-extension. Fix this by casting the u8 values to a
u64 before the 26 bit left shift.

Addresses-Coverity: ("Unintended sign extension")
Fixes: 4863dea3 ("net: Adding support for Cavium ThunderX network controller")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e701a258

cxgb4: Fix unintentional sign extension issues · dd2c7967

Colin Ian King authored Apr 09, 2021

The shifting of the u8 integers f->fs.nat_lip[] by 24 bits to
the left will be promoted to a 32 bit signed int and then
sign-extended to a u64. In the event that the top bit of the u8
is set then all then all the upper 32 bits of the u64 end up as
also being set because of the sign-extension. Fix this by
casting the u8 values to a u64 before the 24 bit left shift.

Addresses-Coverity: ("Unintended sign extension")
Fixes: 12b276fb ("cxgb4: add support to create hash filters")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dd2c7967

11 Apr, 2021 15 commits

Merge branch 'ipa-next' · 5b489fea

David S. Miller authored Apr 11, 2021

Alex Elder says:

====================
net: ipa: support two more platforms

This series adds IPA support for two more Qualcomm SoCs.

The first patch updates the DT binding to add compatible strings.

The second temporarily disables checksum offload support for IPA
version 4.5 and above.  Changes are required to the RMNet driver
to support the "inline" checksum offload used for IPA v4.5+, and
once those are present this capability will be enabled for IPA.

The third and fourth patches add configuration data for IPA versions
4.5 (used for the SDX55 SoC) and 4.11 (used for the SD7280 SoC).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

5b489fea

net: ipa: add IPA v4.11 configuration data · 927c5043

Alex Elder authored Apr 09, 2021

Add support for the SC7280 SoC, which includes IPA version 4.11.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

927c5043

net: ipa: add IPA v4.5 configuration data · fbb763e7

Alex Elder authored Apr 09, 2021

Add support for the SDX55 SoC, which includes IPA version 4.5.

Starting with IPA v4.5, a few of the memory regions have a different
number of "canary" values; update comments in the where the region
identifers are defined to accurately reflect that.

I'll note three differences in SDX55 versus the other two existing
platforms (SDM845 and SC7180):
  - SDX55 uses a 32-bit Linux kernel
  - SDX55 has four interconnects rather than three
  - SDX55 uses IPA v4.5, which uses inline checksum offload
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

fbb763e7

net: ipa: disable checksum offload for IPA v4.5+ · c88c34fc

Alex Elder authored Apr 09, 2021

Checksum offload for IPA v4.5+ is implemented differently, using
"inline" offload (which uses a common header format for both upload
and download offload).

The IPA hardware must be programmed to enable MAP checksum offload,
but the RMNet driver is responsible for interpreting checksum
metadata supplied with messages.

Currently, the RMNet driver does not support inline checksum offload.
This support is imminent, but until it is available, do not allow
newer versions of IPA to specify checksum offload for endpoints.
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

c88c34fc

dt-bindings: net: qcom,ipa: add some compatible strings · c3264fee

Alex Elder authored Apr 09, 2021

Add existing supported platform "qcom,sc7180-ipa" to the set of IPA
compatible strings.  Also add newly-supported "qcom,sdx55-ipa",
"qcom,sc7280-ipa".
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

c3264fee

ehea: add missing MODULE_DEVICE_TABLE · 95291ced

Qiheng Lin authored Apr 09, 2021

This patch adds missing MODULE_DEVICE_TABLE definition which generates
correct modalias for automatic loading of this driver when it is built
as an external module.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Qiheng Lin <linqiheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

95291ced

Merge branch 'veth-gro' · 23cfa4d4

David S. Miller authored Apr 11, 2021

Paolo Abeni  says:

====================
veth: allow GRO even without XDP

This series allows the user-space to enable GRO/NAPI on a veth
device even without attaching an XDP program.

It does not change the default veth behavior (no NAPI, no GRO),
except that the GRO feature bit on top of this series will be
effectively off by default on veth devices. Note that currently
the GRO bit is on by default, but GRO never takes place in
absence of XDP.

On top of this series, setting the GRO feature bit enables NAPI
and allows the GRO to take place. The TSO features on the peer
device are preserved.

The main goal is improving UDP forwarding performances for
containers in a typical virtual network setup:

(container) veth -> veth peer -> bridge/ovs -> vxlan -> NIC

Enabling the NAPI threaded mode, GRO the NETIF_F_GRO_UDP_FWD
feature on the veth peer improves the UDP stream performance
with not void netfilter configuration by 2x factor with no
measurable overhead for TCP traffic: some heuristic ensures
that TCP will not go through the additional NAPI/GRO layer.

Some self-tests are added to check the expected behavior in
the default configuration, with XDP and with plain GRO enabled.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

23cfa4d4

self-tests: add veth tests · 1c3cadbe

Paolo Abeni authored Apr 09, 2021

Add some basic veth tests, that verify the expected flags and
aggregation with different setups (default, xdp, etc...)
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1c3cadbe

veth: refine napi usage · 47e550e0

Paolo Abeni authored Apr 09, 2021

After the previous patch, when enabling GRO, locally generated
TCP traffic experiences some measurable overhead, as it traverses
the GRO engine without any chance of aggregation.

This change refine the NAPI receive path admission test, to avoid
unnecessary GRO overhead in most scenarios, when GRO is enabled
on a veth peer.

Only skbs that are eligible for aggregation enter the GRO layer,
the others will go through the traditional receive path.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

47e550e0

veth: allow enabling NAPI even without XDP · d3256efd

Paolo Abeni authored Apr 09, 2021

Currently the veth device has the GRO feature bit set, even if
no GRO aggregation is possible with the default configuration,
as the veth device does not hook into the GRO engine.

Flipping the GRO feature bit from user-space is a no-op, unless
XDP is enabled. In such scenario GRO could actually take place, but
TSO is forced to off on the peer device.

This change allow user-space to really control the GRO feature, with
no need for an XDP program.

The GRO feature bit is now cleared by default - so that there are no
user-visible behavior changes with the default configuration.

When the GRO bit is set, the per-queue NAPI instances are initialized
and registered. On xmit, when napi instances are available, we try
to use them.

Some additional checks are in place to ensure we initialize/delete NAPIs
only when needed in case of overlapping XDP and GRO configuration
changes.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d3256efd

veth: use skb_orphan_partial instead of skb_orphan · c75fb320

Paolo Abeni authored Apr 09, 2021

As described by commit 9c4c3252 ("skbuff: preserve sock
reference when scrubbing the skb."), orphaning a skb
in the TX path will cause OoO.

Let's use skb_orphan_partial() instead of skb_orphan(), so
that we keep the sk around for queue's selection sake and we
still avoid the problem fixed with commit 4bf9ffa0 ("veth:
Orphan skb before GRO")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c75fb320

Merge branch 'ethtool-eeprom' · 7dc85b59

David S. Miller authored Apr 11, 2021

Moshe Shemesh says:

====================
ethtool: Extend module EEPROM dump API

Ethtool supports module EEPROM dumps via the `ethtool -m <dev>` command.
But in current state its functionality is limited - offset and length
parameters, which are used to specify a linear desired region of EEPROM
data to dump, is not enough, considering emergence of complex module
EEPROM layouts such as CMIS 4.0.
Moreover, CMIS 4.0 extends the amount of pages that may be accessible by
introducing another parameter for page addressing - banks.

Besides, currently module EEPROM is represented as a chunk of
concatenated pages, where lower 128 bytes of all pages, except page 00h,
are omitted. Offset and length are used to address parts of this fake
linear memory. But in practice drivers, which implement
get_module_info() and get_module_eeprom() ethtool ops still calculate
page number and set I2C address on their own.

This series tackles these issues by adding ethtool op, which allows to
pass page number, bank number and I2C address in addition to offset and
length parameters to the driver, adds corresponding netlink
infrastructure and implements the new interface in mlx5 driver.

This allows to extend userspace 'ethtool -m' CLI by adding new
parameters - page, bank and i2c. New command line format:
 ethtool -m <dev> [hex on|off] [raw on|off] [offset N] [length N] [page N] [bank N] [i2c N]

The consequence of this series is a possibility to dump arbitrary EEPROM
page at a time, in contrast to dumps of concatenated pages. Therefore,
offset and length change their semantics and may be used only to specify
a part of data within half page boundary, which size is currently limited
to 128 bytes.

As for drivers that support legacy get_module_info() and
get_module_eeprom() pair, the series addresses it by implementing a
fallback mechanism. As mentioned earlier, such drivers derive a page
number from 'global' offset, so this can be done vice versa without
their involvement thanks to standardization. If kernel netlink handler
of 'ethtool -m' command detects that new ethtool op is not supported by
the driver, it calculates offset from given page number and page offset
and calls old ndos, if they are available.
====================

\Signed-off-by: David S. Miller <davem@davemloft.net>

7dc85b59

ethtool: wire in generic SFP module access · c97a31f6

Andrew Lunn authored Apr 09, 2021

If the device has a sfp bus attached, call its
sfp_get_module_eeprom_by_page() function, otherwise use the ethtool op
for the device. This follows how the IOCTL works.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

c97a31f6

phy: sfp: add netlink SFP support to generic SFP code · d740513f

Andrew Lunn authored Apr 09, 2021

The new netlink API for reading SFP data requires a new op to be
implemented. The idea of the new netlink SFP code is that userspace is
responsible to parsing the EEPROM data and requesting pages, rather
than have the kernel decide what pages are interesting and returning
them. This allows greater flexibility for newer formats.

Currently the generic SFP code only supports simple SFPs. Allow i2c
address 0x50 and 0x51 to be accessed with page and bank must always be
0. This interface will later be extended when for example QSFP support
is added.
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d740513f

ethtool: Add fallback to get_module_eeprom from netlink command · 96d971e3

Vladyslav Tarasiuk authored Apr 09, 2021

In case netlink get_module_eeprom_by_page() callback is not implemented
by the driver, try to call old get_module_info() and get_module_eeprom()
pair. Recalculate parameters to get_module_eeprom() offset and len using
page number and their sizes. Return error if this can't be done.
Signed-off-by: Vladyslav Tarasiuk <vladyslavt@nvidia.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

96d971e3