Commits · f24711fddc36aa7286af724393ef7334b92c5702 · Kirill Smelkov / linux

15 Nov, 2019 40 commits

net: mscc: ocelot: export a constant for the tag length in bytes · f24711fd

Vladimir Oltean authored Nov 14, 2019

This constant will be used in a future patch to increase the MTU on NPI
ports, and will also be used in the tagger driver for Felix.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f24711fd

net: mscc: ocelot: create a helper for changing the port MTU · fa914e9c

Vladimir Oltean authored Nov 14, 2019

Since in an NPI/DSA setup, not all ports will have the same MTU, we need
to make sure the watermarks for pause frames and/or tail dropping logic
that existed in the driver is still coherent for the new MTU values.

We need to do this because the NPI (aka external CPU) port needs an
increased MTU for the DSA tag. This will be done in a future patch.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fa914e9c

net: mscc: ocelot: move invariant configs out of adjust_link · 5bc9d2e6

Vladimir Oltean authored Nov 14, 2019

It doesn't make sense to rewrite all these registers every time the PHY
library notifies us about a link state change.

In a future patch we will customize the MTU for the CPU port, and since
the MTU was previously configured from adjust_link, if we don't make
this change, its value would have got overridden.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5bc9d2e6

net: mscc: ocelot: filter out ocelot SoC specific PCS config from common path · dc3de2a2

Claudiu Manoil authored Nov 14, 2019

The adjust_link routine should be generic enough to be (re)used by
any SoC that integrates a switch core compatible with the Ocelot
core switch driver.  Currently all configurations are generic except
for the PCS settings that are SoC specific.  Move these out to the
Ocelot SoC/board instance.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc3de2a2

net: mscc: ocelot: move resource ioremap and regmap init to common code · 259630e0

Claudiu Manoil authored Nov 14, 2019

Let's make this ioremap and regmap init code common.  It should not
be platform dependent as it should be usable by PCI devices too.
Use better names where necessary to avoid clashes.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

259630e0

Merge branch 'net-smc-improve-termination-handling-part-3' · e7be235f

David S. Miller authored Nov 15, 2019

Karsten Graul says:

====================
net/smc: improve termination handling (part 3)

Part 3 of the SMC termination patches improves the link group
termination processing and introduces the ability to immediately
terminate a link group.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

e7be235f

net/smc: immediate termination for SMCR link groups · 0b29ec64

Ursula Braun authored Nov 14, 2019

If the SMC module is unloaded or an IB device is thrown away, the
immediate link group freeing introduced for SMCD is exploited for SMCR
as well. That means SMCR-specifics are added to smc_conn_kill().
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0b29ec64

net/smc: wait for tx completions before link freeing · 6a37ad3d

Ursula Braun authored Nov 14, 2019

Make sure all pending work requests are completed before freeing
a link.
Dismiss tx pending slots already when terminating a link group to
exploit termination shortcut in tx completion queue handler.

And kill the completion queue tasklets after destroy of the
completion queues, otherwise there is a time window for another
tasklet schedule of an already killed tasklet.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a37ad3d

net/smc: abnormal termination without orderly flag · 2c1d3e50

Ursula Braun authored Nov 14, 2019

For abnormal termination issue an LLC DELETE_LINK without the
orderly flag.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c1d3e50

net/smc: no WR buffer wait for terminating link group · 15e1b99a

Ursula Braun authored Nov 14, 2019

Avoid waiting for a free work request buffer, if the link group
is already terminating.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

15e1b99a

net/smc: introduce bookkeeping of SMCD link groups · 5edd6b9c

Ursula Braun authored Nov 14, 2019

If the ism module is unloaded return control from exit routine only,
if all link groups are freed.
If an IB device is thrown away return control from device removal only,
if all link groups belonging to this device are freed.
A counters for the total number of SMCD link groups per ISM device is
introduced. ism module unloading continues only if the total number of
SMCD link groups for all ISM devices is zero. ISM device
removal continues only it the total number of SMCD link groups per ISM
device has decreased to zero.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5edd6b9c

net/smc: abnormal termination of SMCD link groups · 5421ec28

Ursula Braun authored Nov 14, 2019

A final cleanup due to SMCD device removal means immediate freeing
of all link groups belonging to this device in interrupt context.

This patch introduces a separate SMCD link group termination routine,
which terminates all link groups of an SMCD device.

This new routine smcd_terminate_all ()is reused if the smc module is
unloaded.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5421ec28

net/smc: immediate termination for SMCD link groups · 42bfba9e

Ursula Braun authored Nov 14, 2019

SMCD link group termination is called when peer signals its shutdown
of its corresponding link group. For regular shutdowns no connections
exist anymore. For abnormal shutdowns connections must be killed and
their DMBs must be unregistered immediately. That means the SMCR method
to delay the link group freeing several seconds does not fit.

This patch adds immediate termination of a link group and its SMCD
connections and makes sure all SMCD link group related cleanup steps
are finished.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

42bfba9e

net/smc: fix final cleanup sequence for SMCD devices · 50c6b20e

Ursula Braun authored Nov 14, 2019

If peer announces shutdown, use the link group terminate worker for
local cleanup of link groups and connections to terminate link group
in proper context.

Make sure link groups are cleaned up first before destroying the
event queue of the SMCD device, because link group cleanup may
raise events.

Send signal shutdown only if peer has not done it already.

Send socket abort or close only, if peer has not already announced
shutdown.
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

50c6b20e

Merge branch 'net-stmmac-CPU-Performance-Improvements' · 43da44c8

David S. Miller authored Nov 15, 2019

Jose Abreu says:

====================
net: stmmac: CPU Performance Improvements

CPU Performance improvements for stmmac. Please check bellow for results
before and after the series.

Patch 1/7, allows RX Interrupt on Completion to be disabled and only use the
RX HW Watchdog.

Patch 2/7, setups the default RX coalesce settings instead of using the
minimum value.

Patch 3/7 and 4/7, removes the uneeded computations for RX Flow Control
activation/de-activation, on some cases.

Patch 5/7, tunes-up the default coalesce settings.

Patch 6/7, re-works the TX coalesce timer activation logic.

Patch 7/7, removes the now uneeded TBU interrupt.

NetPerf UDP Results:
--------------------

Socket  Message  Elapsed      Messages                   CPU      Service
Size    Size     Time         Okay Errors   Throughput   Util     Demand
bytes   bytes    secs            #      #   10^6bits/sec % SS     us/KB
--- XGMAC@2.5G: Before
212992    1400   10.00     2100620      0     2351.7     36.69    5.112
212992           10.00     2100539            2351.6     26.18    3.648
--- XGMAC@2.5G: After
212992    1400   10.00     2108972      0     2361.5     21.73    3.015
212992           10.00     2097038            2348.1     19.21    2.666

--- GMAC5@1G: Before
212992    1400   10.00      786000      0      880.2     34.71    12.923
212992           10.00      786000             880.2     23.42    8.719
--- GMAC5@1G: After
212992    1400   10.00      842648      0      943.7     14.12    4.903
212992           10.00      842648             943.7     12.73    4.418

Perf TCP Results on RX Path:
----------------------------
--- XGMAC@2.5G: Before
22.51%  swapper          [stmmac]           [k] dwxgmac2_dma_interrupt
10.82%  swapper          [stmmac]           [k] dwxgmac2_host_mtl_irq_status
 5.21%  swapper          [stmmac]           [k] dwxgmac2_host_irq_status
 4.67%  swapper          [stmmac]           [k] dwxgmac3_safety_feat_irq_status
 3.63%  swapper          [kernel.kallsyms]  [k] stack_trace_consume_entry
 2.74%  iperf3           [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
 2.52%  swapper          [kernel.kallsyms]  [k] update_stack_state
 1.94%  ksoftirqd/0      [stmmac]           [k] dwxgmac2_dma_interrupt
 1.45%  iperf3           [kernel.kallsyms]  [k] queued_spin_lock_slowpath
 1.26%  swapper          [kernel.kallsyms]  [k] create_object
--- XGMAC@2.5G: After
 7.43%  swapper          [kernel.kallsyms]   [k] stack_trace_consume_entry
 5.86%  swapper          [stmmac]            [k] dwxgmac2_dma_interrupt
 5.68%  swapper          [kernel.kallsyms]   [k] update_stack_state
 4.71%  iperf3           [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
 2.88%  swapper          [kernel.kallsyms]   [k] create_object
 2.69%  swapper          [stmmac]            [k] dwxgmac2_host_mtl_irq_status
 2.61%  swapper          [stmmac]            [k] stmmac_napi_poll_rx
 2.52%  swapper          [kernel.kallsyms]   [k] unwind_next_frame.part.4
 1.48%  swapper          [kernel.kallsyms]   [k] unwind_get_return_address
 1.38%  swapper          [kernel.kallsyms]   [k] arch_stack_walk

--- GMAC5@1G: Before
31.29%  swapper          [stmmac]           [k] dwmac4_dma_interrupt
14.57%  swapper          [stmmac]           [k] dwmac4_irq_mtl_status
10.66%  swapper          [stmmac]           [k] dwmac4_irq_status
 1.97%  swapper          [kernel.kallsyms]  [k] stack_trace_consume_entry
 1.73%  iperf3           [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
 1.59%  swapper          [kernel.kallsyms]  [k] update_stack_state
 1.15%  iperf3           [kernel.kallsyms]  [k] do_syscall_64
 1.01%  ksoftirqd/0      [stmmac]           [k] dwmac4_dma_interrupt
 0.89%  swapper          [kernel.kallsyms]  [k] __default_send_IPI_dest_field
 0.75%  swapper          [stmmac]           [k] stmmac_napi_poll_rx
--- GMAC5@1G: After
 6.70%  swapper          [kernel.kallsyms]   [k] stack_trace_consume_entry
 5.79%  swapper          [stmmac]            [k] dwmac4_dma_interrupt
 5.29%  swapper          [kernel.kallsyms]   [k] update_stack_state
 3.52%  iperf3           [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
 2.83%  swapper          [stmmac]            [k] dwmac4_irq_mtl_status
 2.62%  swapper          [kernel.kallsyms]   [k] create_object
 2.46%  swapper          [stmmac]            [k] stmmac_napi_poll_rx
 2.32%  swapper          [kernel.kallsyms]   [k] unwind_next_frame.part.4
 2.19%  swapper          [stmmac]            [k] dwmac4_irq_status
 1.39%  swapper          [kernel.kallsyms]   [k] unwind_get_return_address
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

43da44c8

net: stmmac: xgmac: Do not enable TBU interrupt · 8d07a793

Jose Abreu authored Nov 14, 2019

Now that TX Coalesce has been rewritten we no longer need this
additional interrupt enabled. This reduces CPU usage.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8d07a793

net: stmmac: Rework TX Coalesce logic · c2837423

Jose Abreu authored Nov 14, 2019

Coalesce logic currently increments the number of packets and sets the
IC bit when the coalesced packets have passed a given limit. This does
not reflect very well what coalesce was meant for as we can have a large
number of packets that are coalesced and then a single one, sent later
on that has the IC bit.

Rework the logic so that it coalesces only upon a limit of packets and
sets the IC bit for large number of packets.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c2837423

net: stmmac: Tune-up default coalesce settings · da202451

Jose Abreu authored Nov 14, 2019

Tune-up the defalt coalesce settings for optimal values. This gives the
best performance in most of the use-cases.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

da202451

net: stmmac: xgmac: Remove uneeded computation for RFA/RFD · 52f96cd1

Jose Abreu authored Nov 14, 2019

RFA and RFD should not be dependent on FIFO size. In fact, the more FIFO
space we have, the later we can activate Flow Control. Let's use
hard-coded values for RFA and RFD for all FIFO sizes with the exception
of 4k, which is a special case.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

52f96cd1

net: stmmac: gmac4+: Remove uneeded computation for RFA/RFD · 854248e5

Jose Abreu authored Nov 14, 2019

RFA and RFD should not be dependent on FIFO size. In fact, the more FIFO
space we have, the later we can activate Flow Control. Let's use
hard-coded values for RFA and RFD for all FIFO sizes with the exception
of 4k, which is a special case.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

854248e5

net: stmmac: Setup a default RX Coalesce value instead of the minimum · 4e4337cc

Jose Abreu authored Nov 14, 2019

For performance reasons, sometimes using the minimum RX Coalesce value
is not optimal. Lets setup a default value that is optimal in most of
the use cases.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4e4337cc

net: stmmac: Do not set RX IC bit if RX Coalesce is zero · 09146abe

Jose Abreu authored Nov 14, 2019

We may only want to use the RX Watchdog so lets check if RX Coalesce
settings are non-zero and only set the RX Interrupt on Completion bit if
its not.
Signed-off-by: Jose Abreu <Jose.Abreu@synopsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

09146abe

mlxsw: spectrum_router: Allocate discard adjacency entry when needed · 983db619

Ido Schimmel authored Nov 14, 2019

Commit 0c3cbbf9 ("mlxsw: Add specific trap for packets routed via
invalid nexthops") allocated an adjacency entry during driver
initialization whose purpose is to discard packets hitting the route
pointing to it.

These adjacency entries are allocated from a resource called KVD linear
(KVDL). There are situations in which the user can decide to set the
size of this resource (via devlink-resource) to 0, in which case the
driver will not be able to load.

Therefore, instead of pre-allocating this adjacency entry, simply
allocate it only when needed. A variable indicating the validity of the
entry is added and is used to ensure it is only allocated and written
once and that it is freed after all the routes were flushed.

Fixes: 0c3cbbf9 ("mlxsw: Add specific trap for packets routed via invalid nexthops")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

983db619

net/tls: Fix unused function warning · d6649d78

YueHaibing authored Nov 14, 2019

If PROC_FS is not set, gcc warning this:

net/tls/tls_proc.c:23:12: warning:
 'tls_statistics_seq_show' defined but not used [-Wunused-function]

Use #ifdef to guard this.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6649d78

Merge branch 's390-next' · a98cdaf7

David S. Miller authored Nov 14, 2019

Julian Wiedmann says:

====================
s390/qeth: updates 2019-11-14

please apply the following qeth patches to net-next.
Along with the usual cleanups, this
(1) reduces collateral packet loss in the RX path when dealing with
    bad packets and/or allocation errors, and
(2) simplifies how the L3 driver deals with mcast IP addresses.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

a98cdaf7

s390/qeth: don't check drvdata in sysfs code · 0b81c6c6

Julian Wiedmann authored Nov 14, 2019

Given the way how the sysfs attributes are registered / unregistered,
the show/store helpers will never be called with a NULL drvdata.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0b81c6c6

s390/qeth: replace qeth_l3_get_addr_buffer() · b80c08ac

Julian Wiedmann authored Nov 14, 2019

The remaining usage effectively is a kmemdup() of the query object.
By not wrapping it, some of the callers can now use GFP_KERNEL for the
allocation.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b80c08ac

s390/qeth: remove VLAN tracking for L3 devices · 8659c189

Julian Wiedmann authored Nov 14, 2019

Use vlan_for_each() instead of tracking each registered VID internally.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8659c189

s390/qeth: consolidate L3 mcast registration code · 611abe51

Julian Wiedmann authored Nov 14, 2019

Current code processes each (VLAN) device twice - once to inspect the
IPv4 mcast addresses, and then a second time to walk the IPv6 mcast
addresses. Unify all this into a single helper, thus removing some
checks and a duplicated VLAN lookup.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

611abe51

s390/qeth: remove gratuitious RX modeset · 32a186c7

Julian Wiedmann authored Nov 14, 2019

Trust the IPv4/IPv6 code to properly remove its mcast addresses when a
VLAN device is unregistered, and then also trigger an RX modeset
whenever it's needed.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

32a186c7

s390/qeth: fine-tune L3 mcast locking · ddf28100

Julian Wiedmann authored Nov 14, 2019

Push the inet6_dev locking down into the helper that actually needs it
for walking the mc_list.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ddf28100

s390/qeth: clean up error path in qeth_core_probe_device() · 8311c7a2

Julian Wiedmann authored Nov 14, 2019

qeth_core_free_card() is meant to be the counterpart of
qeth_alloc_card() - but unfortunately was also picked as the place
to free the QDIO queues.

This gets messy when qeth_core_probe_device() fails during
qeth_add_dbf_entry(). At this point the card->qdio.state is not initialized
yet, so qeth_free_qdio_queues() ends up operating on uninitialized data.

Luckily for now, the whole qeth_card struct is zero-allocated and the value
of the QETH_QDIO_UNINITIALIZED enum is 0 as well. So there's no real impact
from this bug at the moment, it's just really fragile.

Clean this up by moving the qeth_free_qdio_queues() call up one level in
the hierarchy. This way it doesn't get called from the error path.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8311c7a2

s390/qeth: handle skb allocation error gracefully · 17caeaa4

Julian Wiedmann authored Nov 14, 2019

When current code fails to allocate an skb in the RX path, it drops the
whole RX buffer. Considering the large number of packets that a single
RX buffer might contain, this is quite drastic.

Skip over the packet instead, and try to extract the next packet from
the RX buffer.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

17caeaa4

s390/qeth: drop unwanted packets earlier in RX path · 7d4faee7

Julian Wiedmann authored Nov 14, 2019

Packets with an unexpected HW format are currently first extracted from
the RX buffer, passed upwards to the layer-specific driver and only then
finally dropped.

Enhance the RX path so that we can drop such packets before even
allocating an skb. For this, add some additional logic so that when a
packet is meant to be dropped, we can still walk along the packet's data
chunks in the RX buffer. This allows us to extract the following
packet(s) from the buffer.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d4faee7

s390/qeth: support per-frame invalidation · 5fd3fcbb

Julian Wiedmann authored Nov 14, 2019

Each RX buffer may contain up to 64KB worth of data. In case the device
needs to discard a packet _after_ already having reserved space for it
in the buffer, the whole buffer gets set to ERROR state. As the buffer
might contain any number of good packets, this can result in collateral
packet loss.

qeth can provide relief by enabling per-frame invalidation. The RX
buffer is then presented as usual, we just need to spot & drop any
individual packet that was flagged as invalid.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5fd3fcbb

s390/qeth: gather more detailed RX dropped/error statistics · 845ef904

Julian Wiedmann authored Nov 14, 2019

Where available, use the fine-grained counters in rtnl_link_stats64 to
indicate different RX error causes. For drop reasons, use driver-private
ethtool counters.

In particular this patch allows us to keep track of driver-side drops due
to unknown/unsupported HW descriptor format.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

845ef904

Merge branch 'vsock-add-multi-transports-support' · 24df31f8

David S. Miller authored Nov 14, 2019

Stefano Garzarella says:

====================
vsock: add multi-transports support

Most of the patches are reviewed by Dexuan, Stefan, and Jorgen.
The following patches need reviews:
- [11/15] vsock: add multi-transports support
- [12/15] vsock/vmci: register vmci_transport only when VMCI guest/host
          are active
- [15/15] vhost/vsock: refuse CID assigned to the guest->host transport

RFC: https://patchwork.ozlabs.org/cover/1168442/
v1: https://patchwork.ozlabs.org/cover/1181986/

v1 -> v2:
- Patch 11:
    + vmci_transport: sent reset when vsock_assign_transport() fails
      [Jorgen]
    + fixed loopback in the guests, checking if the remote_addr is the
      same of transport_g2h->get_local_cid()
    + virtio_transport_common: updated space available while creating
      the new child socket during a connection request
- Patch 12:
    + removed 'features' variable in vmci_transport_init() [Stefan]
    + added a flag to register only once the host [Jorgen]
- Added patch 15 to refuse CID assigned to the guest->host transport in
  the vhost_transport

This series adds the multi-transports support to vsock, following
this proposal: https://www.spinics.net/lists/netdev/msg575792.html

With the multi-transports support, we can use VSOCK with nested VMs
(using also different hypervisors) loading both guest->host and
host->guest transports at the same time.
Before this series, vmci_transport supported this behavior but only
using VMware hypervisor on L0, L1, etc.

The first 9 patches are cleanups and preparations, maybe some of
these can go regardless of this series.

Patch 10 changes the hvs_remote_addr_init(). setting the
VMADDR_CID_HOST as remote CID instead of VMADDR_CID_ANY to make
the choice of transport to be used work properly.

Patch 11 adds multi-transports support.

Patch 12 changes a little bit the vmci_transport and the vmci driver
to register the vmci_transport only when there are active host/guest.

Patch 13 prevents the transport modules unloading while sockets are
assigned to them.

Patch 14 fixes an issue in the bind() logic discoverable only with
the new multi-transport support.

Patch 15 refuses CID assigned to the guest->host transport in the
vhost_transport.

I've tested this series with nested KVM (vsock-transport [L0,L1],
virtio-transport[L1,L2]) and with VMware (L0) + KVM (L1)
(vmci-transport [L0,L1], vhost-transport [L1], virtio-transport[L2]).

Dexuan successfully tested the RFC series on HyperV with a Linux guest.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

24df31f8

vhost/vsock: refuse CID assigned to the guest->host transport · ed8640a9

Stefano Garzarella authored Nov 14, 2019

In a nested VM environment, we have to refuse to assign to a nested
guest the same CID assigned to our guest->host transport.
In this way, the user can use the local CID for loopback.
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ed8640a9

vsock: fix bind() behaviour taking care of CID · 36c5b48b

Stefano Garzarella authored Nov 14, 2019

When we are looking for a socket bound to a specific address,
we also have to take into account the CID.

This patch is useful with multi-transports support because it
allows the binding of the same port with different CID, and
it prevents a connection to a wrong socket bound to the same
port, but with different CID.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

36c5b48b

vsock: prevent transport modules unloading · 6a2c0962

Stefano Garzarella authored Nov 14, 2019

This patch adds 'module' member in the 'struct vsock_transport'
in order to get/put the transport module. This prevents the
module unloading while sockets are assigned to it.

We increase the module refcnt when a socket is assigned to a
transport, and we decrease the module refcnt when the socket
is destructed.
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a2c0962