Commits · 3f8c0f7efb4fcac11f31afa97584d06118c614bb · Kirill Smelkov / linux

23 Nov, 2015 1 commit

gianfar: use of_property_read_bool() · 3f8c0f7e

Saurabh Sengar authored Nov 20, 2015

use of_property_read_bool() for testing bool property
Signed-off-by: Saurabh Sengar <saurabh.truth@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3f8c0f7e

22 Nov, 2015 2 commits

bnx2x: Utilize FW 7.13.1.0. · 5e091e7a

Yuval Mintz authored Nov 22, 2015

Commit 46e8a249423ff "bnx2x: Add FW 7.13.1.0" added said .bin FW to
linux-firmware; This patch incorporates the FW in the bnx2x driver.

This introduces 2 fixes/enhancements:
 - In some management protocols there are outer-vlan configurations
that can be dynamically changed while device is running. This fixes
some corner cases where such a change did not take effect.

 - Prevent VFs from sending MAC control frames; FW would treat a VF
sending such a packet as malicious and block any further communication
done by the VF.
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Ariel Elior <Ariel.Elior@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e091e7a

net: IPv6 fib lookup tracepoint · b811580d

David Ahern authored Nov 19, 2015

Add tracepoint to show fib6 table lookups and result.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b811580d

20 Nov, 2015 26 commits

net: avoid NULL deref in napi_get_frags() · e2f9dc3b

Eric Dumazet authored Nov 19, 2015

napi_alloc_skb() can return NULL.
We should not crash should this happen.

Fixes: 93f93a44 ("net: move skb_mark_napi_id() into core networking stack")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e2f9dc3b

ravb: use clock rate as basis for GTI.TIV · b3d39a88

Simon Horman authored Nov 20, 2015

The GTI.TIV may be set to 2GHz^2 / rate, where rate is
that of the clock of the device. Rather than assuming a
rate of 130MHz use the actual rate of the clock.

The motivation for this is to use the correct rate on
the r8a7795/Salvator-X which is advertised as 133MHz but
may differ depending on the extal present on the Salvator-X.
Signed-off-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>

b3d39a88

dl2k: Implement suspend · 1777ddb8

Ondrej Zary authored Nov 19, 2015

Add suspend/resume support to dl2k driver.
This requires RX/TX rings to be reset so split out the required
functionality from alloc_list() into new rio_reset_ring().

Tested on Asus NX1101 (IP1000A) and D-Link DGE-550T (DL-2000).
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

1777ddb8

dl2k: Reorder and cleanup initialization · 966e07f4

Ondrej Zary authored Nov 19, 2015

Move HW init and stop into separate functions.
Request IRQ only after the HW has been reset (so interrupts are
disabled and no stale interrupts are pending).
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

966e07f4

dl2k: Handle memory allocation errors in alloc_list · 39536ff8

Ondrej Zary authored Nov 19, 2015

If memory allocation fails in alloc_list(), free the already allocated
memory and return -ENOMEM. In rio_open(), call alloc_list() first and
abort if it fails. Move HW access (set RFDListPtr) out ot alloc_list().
Signed-off-by: Ondrej Zary <linux@rainbow-software.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

39536ff8

Merge branch 'tipc-cleanups-improvements' · 6b99c6d5

David S. Miller authored Nov 20, 2015

Jon Maloy says:

====================
tipc: some cleanups and improvements

This series mostly contains cleanups and cosmetic code changes.
The only real functional change is in #4 and #5, where we change the
locking structure for nodes and links in order to permit full
concurrency between links working in parallel on different interfaces.
Since the groundwork for this has been done in previous commit series,
this change constitutes only the final, small step to achieve that goal.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6b99c6d5

tipc: eliminate remnants of hungarian notation · 1a90632d

Jon Paul Maloy authored Nov 19, 2015

The number of variables with Hungarian notation (l_ptr, n_ptr etc.)
has been significantly reduced over the last couple of years.

We now root out the last traces of this practice.
There are no functional changes in this commit.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1a90632d

tipc: narrow down interface towards struct tipc_link · 38206d59

Jon Paul Maloy authored Nov 19, 2015

We move the definition of struct tipc_link from link.h to link.c in
order to minimize its exposure to the rest of the code.

When needed, we define new functions to make it possible for external
entities to access and set data in the link.

Apart from the above, there are no functional changes.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

38206d59

tipc: narrow down exposure of struct tipc_node · 5be9c086

Jon Paul Maloy authored Nov 19, 2015

In our effort to have less code and include dependencies between
entities such as node, link and bearer, we try to narrow down
the exposed interface towards the node as much as possible.

In this commit, we move the definition of struct tipc_node, along
with many of its associated function declarations, from node.h to
node.c. We also move some function definitions from link.c and
name_distr.c to node.c, since they access fields in struct tipc_node
that should not be externally visible. The moved functions are renamed
according to new location, and made static whenever possible.

There are no functional changes in this commit.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5be9c086

tipc: convert node lock to rwlock · 5405ff6e

Jon Paul Maloy authored Nov 19, 2015

According to the node FSM a node in state SELF_UP_PEER_UP cannot
change state inside a lock context, except when a TUNNEL_PROTOCOL
(SYNCH or FAILOVER) packet arrives. However, the node's individual
links may still change state.

Since each link now is protected by its own spinlock, we finally have
the conditions in place to convert the node spinlock to an rwlock_t.
If the node state and arriving packet type are rigth, we can let the
link directly receive the packet under protection of its own spinlock
and the node lock in read mode. In all other cases we use the node
lock in write mode. This enables full concurrent execution between
parallel links during steady-state traffic situations, i.e., 99+ %
of the time.

This commit implements this change.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5405ff6e

tipc: introduce per-link spinlock · 2312bf61

Jon Paul Maloy authored Nov 19, 2015

As a preparation to allow parallel links to work more independently
from each other we introduce a per-link spinlock, to be stored in the
struct nodes's link entry area. Since the node lock still is a regular
spinlock there is no increase in parallellism at this stage.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2312bf61

tipc: reduce code dependency between binding table and node layer · 1d7e1c25

Jon Paul Maloy authored Nov 19, 2015

The file name_distr.c currently contains three functions,
named_cluster_distribute(), tipc_publ_subcscribe() and
tipc_publ_unsubscribe() that all directly access fields in
struct tipc_node. We want to eliminate such dependencies, so
we move those functions to the file node.c and rename them to
tipc_node_broadcast(), tipc_node_subscribe() and tipc_node_unsubscribe()
respectively.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1d7e1c25

tipc: small cleanup of function tipc_node_check_state() · 5c10e979

Jon Paul Maloy authored Nov 19, 2015

The function tipc_node_check_state() contains the core logics
for handling link synchronization and failover. For this reason,
it is important to keep it as comprehensible as possible.

In this commit, we make three small cleanups.

1) If the node is in state SELF_DOWN_PEER_LEAVING and the received
   packet confirms that the peer has lost contact, there will be no
   further action in this function. To make this clearer, we return
   from the function directly after the state change.

2) Since commit 0f8b8e28 ("tipc: eliminate risk of stalled
   link synchronization") only the logically first TUNNEL_PROTO/SYNCH
   packet can alter the link state and set the synch point,
   independently of arrival order. Hence, there is not any longer any
   need to adjust the synch value in case such packets arrive in
   disorder. We remove this adjustment.

3) It is the intention that any message arriving on any of the links
   may trig a check for and possible termination of a node SYNCH state.
   A redundant and unnoticed check for tipc_link_is_synching() obviously
   beats this purpose, with the effect that only packets arriving on the
   synching link may currently end the synch state. We remove this check.
   This change will further shorten the synchronization period between
   parallel links.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5c10e979

tipc: move linearization of buffers to generic code · c7cad0d6

Jon Paul Maloy authored Nov 19, 2015

In commit 5cbb28a4 ("tipc: linearize arriving NAME_DISTR
and LINK_PROTO buffers") we added linearization of NAME_DISTRIBUTOR,
LINK_PROTOCOL/RESET and LINK_PROTOCOL/ACTIVATE to the function
tipc_udp_recv(). The location of the change was selected in order
to make the commit easily appliable to 'net' and 'stable'.

We now move this linearization to where it should be done, in the
functions tipc_named_rcv() and tipc_link_proto_rcv() respectively.
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c7cad0d6

Merge branch 'bnx2x-stats' · 12ded5ca

David S. Miller authored Nov 20, 2015

Yuval Mintz says:

====================
bnx2x: Statistics patch series

This series contains 2 small statistics-related patches,
first adding a new SW statistics and the other exposing port stats
for multi-function devices.

Please consider applying this series to `net-next'.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

12ded5ca

bnx2x: Show port statistics in Multi-function · 3fb2d492

Yuval Mintz authored Nov 19, 2015

Today, port statistics are being presented when using `ethool -S' only
for single-function devices, but there are some port statistics which are
crucial for analyzing bottle-necks. E.g., HW Rx discards due to lack of
buffer space [when device isn't handling ingress traffic fast enough].

Judging the pros and cons, it was decided that in-order to better support
automatic dump-gathering tools, bnx2x should no longer hide those stats.
This leaves only VFs lacking the port statistics.
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3fb2d492

bnx2x: Add new SW stat 'tx_exhaustion_events' · 6a531198

Yuval Mintz authored Nov 19, 2015

Driver already has an internal counter for number of times a given queue
had to be stopped due to Tx ring exhaustion.
This add the counter to the statistics presented by driver, e.g., by using
`ethtool -S'.
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6a531198

Merge branch 'ppp-kill-zombie-state' · 7521cd43

David S. Miller authored Nov 20, 2015

Guillaume Nault says:

====================
ppp: Remove PPPOX_ZOMBIE socket state

Several issues have been found lately wrt. the PPPOX_ZOMBIE socket
state. This state is now only set upon reception of a PADT to stop
further transmissions. However this is redundant with the PADT
workqueue mechanism introduced by 287f3a94 ("pppoe: Use workqueue
to die properly when a PADT is received").

We can thus simplify pppox socket state handling by getting rid of
PPPOX_ZOMBIE entirely.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

7521cd43

ppp: remove PPPOX_ZOMBIE socket state · a8acce6a

Guillaume Nault authored Nov 19, 2015

PPPOX_ZOMBIE is never set anymore.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

a8acce6a

ppp: don't set sk_state to PPPOX_ZOMBIE in pppoe_disc_rcv() · 8734e485

Guillaume Nault authored Nov 19, 2015

Since 287f3a94 ("pppoe: Use workqueue to die properly when a PADT
is received"), pppoe_disc_rcv() disconnects the socket by scheduling
pppoe_unbind_sock_work(). This is enough to stop socket transmission
and makes the PPPOX_ZOMBIE state uncessary.
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

8734e485

Merge branch 'mlxsw-vlan' · bdc17fad

David S. Miller authored Nov 20, 2015

Jiri Pirko says:

====================
mlxsw: small driver update

Couple of VLAN-related patches.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

bdc17fad

mlxsw: spectrum: Add error paths to __mlxsw_sp_port_vlans_add · b07a966c

Ido Schimmel authored Nov 19, 2015

The operation of adding VLANs on a port via switchdev ops can fail and
we need to be prepared for it. If we do not rollback hardware operations
following a failure, hardware and software will remain in an
inconsistent state.

Solve that by adding suitable error paths to __mlxsw_sp_port_vlans_add.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b07a966c

mlxsw: spectrum: Unify setting of HW VLAN filters · 3b7ad5ec

Ido Schimmel authored Nov 19, 2015

When adding or deleting VLANs from a bridged port, HW VLAN filters must be
set accordingly. Instead of having the same code in both add and delete
functions, just wrap it in a function and call it with the appropriate
parameters.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Elad Raz <eladr@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3b7ad5ec

mlxsw: spectrum: Use correct PVID value when removing VLANs · 06c071f6

Ido Schimmel authored Nov 19, 2015

When removing a range of VLANs in which PVID is a member we should use
the correct PVID value instead of some VLAN in the range.

Also, change two print statements to use 'dev' instead of
'mlxsw_sp_port->dev', as it's already used in other print statements in
the function.
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

06c071f6

bpf: add show_fdinfo handler for maps · f99bf205

Daniel Borkmann authored Nov 19, 2015

Add a handler for show_fdinfo() to be used by the anon-inodes
backend for eBPF maps, and dump the map specification there. Not
only useful for admins, but also it provides a minimal way to
compare specs from ELF vs pinned object.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

f99bf205

net: encx24j600: move rev announcement to probe function · 7b5dc0dd

Jon Ringle authored Nov 18, 2015

When encx24j600 is open and closed many times due to userspace polling the
interface, the log gets noise with this log message.

Moving this to encx24j600_spi_probe function where it belongs.
Signed-off-by: Jon Ringle <jringle@gridpoint.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7b5dc0dd

18 Nov, 2015 11 commits

Merge branch 'net-generic-busy-polling' · 85c72ba1

David S. Miller authored Nov 18, 2015

Eric Dumazet says:

====================
net: extend busy polling support

This patch series extends busy polling range to tunnels devices,
and adds busy polling generic support to all NAPI drivers.

No need to provide ndo_busy_poll() method and extra synchronization
between ndo_busy_poll() and normal napi->poll() method.
This was proven very difficult and bug prone.

mlx5 driver is changed to support busy polling using this new method,
and a second mlx5 patch adds napi_complete_done() support and proper
SNMP accounting.

bnx2x and mlx4 drivers are converted to new infrastructure,
reducing kernel bloat and improving performance.

Latest patch, adding generic support, adds a new requirement :

 -free_netdev() and netif_napi_del() must be called from process context.

Since this might not be the case in some drivers, we might have to
either : fix the non conformant drivers (by disabling busy polling on them)
or revert this last patch.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

85c72ba1

net: provide generic busy polling to all NAPI drivers · 93d05d4a

Eric Dumazet authored Nov 18, 2015

NAPI drivers no longer need to observe a particular protocol
to benefit from busy polling (CONFIG_NET_RX_BUSY_POLL=y)

napi_hash_add() and napi_hash_del() are automatically called
from core networking stack, respectively from
netif_napi_add() and netif_napi_del()

This patch depends on free_netdev() and netif_napi_del() being
called from process context, which seems to be the norm.

Drivers might still prefer to call napi_hash_del() on their
own, since they might combine all the rcu grace periods into
a single one, knowing their NAPI structures lifetime, while
core networking stack has no idea of a possible combining.

Once this patch proves to not bring serious regressions,
we will cleanup drivers to either remove napi_hash_del()
or provide appropriate rcu grace periods combining.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

93d05d4a

net: napi_hash_del() returns a boolean status · 34cbe27e

Eric Dumazet authored Nov 18, 2015

napi_hash_del() will soon be used from both drivers (if they want)
or core networking stack.

Callers are responsibles to ensure an RCU grace period is respected
before freeing napi structure : napi_hash_del() can signal if
this RCU grace period is needed or not.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

34cbe27e

net: move napi_hash[] into read mostly section · 6180d9de

Eric Dumazet authored Nov 18, 2015

We do not often add/delete a napi context.
Moving napi_hash[] into read mostly section avoids potential false sharing.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6180d9de

net: add netif_tx_napi_add() · d64b5e85

Eric Dumazet authored Nov 18, 2015

netif_tx_napi_add() is a variant of netif_napi_add()

It should be used by drivers that use a napi structure
to exclusively poll TX.

We do not want to add this kind of napi in napi_hash[] in following
patches, adding generic busy polling to all NAPI drivers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d64b5e85

net: move skb_mark_napi_id() into core networking stack · 93f93a44

Eric Dumazet authored Nov 18, 2015

We would like to automatically provide busy polling support
to all NAPI drivers, without them having to implement anything.

skb_mark_napi_id() can be called from napi_gro_receive() and
napi_get_frags().

Few drivers are still calling skb_mark_napi_id() because
they use netif_receive_skb(). They should eventually call
napi_gro_receive() instead. I will leave this to drivers
maintainers.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

93f93a44

mlx4: remove mlx4_en_low_latency_recv() · 868fdb06

Eric Dumazet authored Nov 18, 2015

Busy polling can now be handled in generic NAPI poll infrastructure.
This removes complexity and fast path overhead :

mlx4 used two spin_lock()/spin_unlock() pair per napi->poll() call
in mlx4_en_cq_lock_napi()/mlx4_en_cq_unlock_napi()

Tested:

Without busy polling :

lpaa23:~# echo 0 >/proc/sys/net/core/busy_read
lpaa24:~# echo 0 >/proc/sys/net/core/busy_read
lpaa23:~# ./netperf -H lpaa24 -t TCP_RR
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    47330.78

With busy polling :

lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
lpaa23:~# ./netperf -H lpaa24 -t TCP_RR
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    97643.55
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

868fdb06

bnx2x: remove bnx2x_low_latency_recv() support · b59768c6

Eric Dumazet authored Nov 18, 2015

Switch to native NAPI polling, as this reduces overhead and complexity.

Normal path is faster, since one cmpxchg() is not anymore requested,
and busy polling with the NAPI polling has same performance.

Tested:
lpk50:~# cat /proc/sys/net/core/busy_read
70
lpk50:~# nstat >/dev/null;./netperf -H lpk55 -t TCP_RR;nstat
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpk55.prod.google.com () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    40095.07
16384  87380
IpInReceives                    401062             0.0
IpInDelivers                    401062             0.0
IpOutRequests                   401079             0.0
TcpActiveOpens                  7                  0.0
TcpPassiveOpens                 3                  0.0
TcpAttemptFails                 3                  0.0
TcpEstabResets                  5                  0.0
TcpInSegs                       401036             0.0
TcpOutSegs                      401052             0.0
TcpOutRsts                      38                 0.0
UdpInDatagrams                  26                 0.0
UdpOutDatagrams                 27                 0.0
Ip6OutNoRoutes                  1                  0.0
TcpExtDelayedACKs               1                  0.0
TcpExtTCPPrequeued              98                 0.0
TcpExtTCPDirectCopyFromPrequeue 98                 0.0
TcpExtTCPHPHits                 4                  0.0
TcpExtTCPHPHitsToUser           98                 0.0
TcpExtTCPPureAcks               5                  0.0
TcpExtTCPHPAcks                 101                0.0
TcpExtTCPAbortOnData            6                  0.0
TcpExtBusyPollRxPackets         400832             0.0
TcpExtTCPOrigDataSent           400983             0.0
IpExtInOctets                   21273867           0.0
IpExtOutOctets                  21261254           0.0
IpExtInNoECTPkts                401064             0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b59768c6

mlx5: support napi_complete_done() · 44fb6fbb

Eric Dumazet authored Nov 18, 2015

A NAPI poll handler should return number of RX packets processed,
instead of 0 / budget.

This allows proper busy poll accounting through LINUX_MIB_BUSYPOLLRXPACKETS
SNMP counter.

napi_complete_done() allows /sys/class/net/ethX/gro_flush_timeout
to be used for finer GRO aggregation control.

Tested:

Enabled busy polling, and checked TcpExtBusyPollRxPackets counter is increasing.

echo 70 >/proc/sys/net/core/busy_read
nstat >/dev/null
netperf -H target -t TCP_RR >/dev/null
nstat | grep TcpExtBusyPollRxPackets
TcpExtBusyPollRxPackets         490958             0.0
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Eli Cohen <eli@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

44fb6fbb

mlx5: add busy polling support · 7ae92ae5

Eric Dumazet authored Nov 18, 2015

It is now easy to add busy polling support to a NAPI driver,
with very little impact on normal input path.

This patch serves as a reference implementation.

Note:

A followup patch will add proper napi_complete_done() in mlx5,
so that LINUX_MIB_BUSYPOLLRXPACKETS snmp counter is properly handled.

Tested:

Normal TCP_RR results without busy polling :

lpk51:~# echo 0 >/proc/sys/net/core/busy_read
lpk52:~# echo 0 >/proc/sys/net/core/busy_read

lpk51:~# ./netperf -H 192.168.4.52 -t TCP_RR -l 10
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.52 () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    53509.49
16384  87380

Now enable busy polling :

lpk51:~# echo 70 >/proc/sys/net/core/busy_read
lpk52:~# echo 70 >/proc/sys/net/core/busy_read

lpk51:~# ./netperf -H 192.168.4.52 -t TCP_RR -l 10
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.4.52 () port 0 AF_INET : first burst 0
Local /Remote
Socket Size   Request  Resp.   Elapsed  Trans.
Send   Recv   Size     Size    Time     Rate
bytes  Bytes  bytes    bytes   secs.    per sec

16384  87380  1        1       10.00    97530.92
16384  87380
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7ae92ae5

net: network drivers no longer need to implement ndo_busy_poll() · ce6aea93

Eric Dumazet authored Nov 18, 2015

Instead of having to implement complex ndo_busy_poll() method,
drivers can simply rely on NAPI poll logic.

Busy polling gains are mainly coming from polling itself,
not on exact details on how we poll the device.

ndo_busy_poll() if implemented can avoid touching
napi state, but it adds extra synchronization between
normal napi->poll() and busy poll handler, slowing down
the common path (non busy polling) with extra atomic operations.
In practice few drivers ever got busy poll because of the complexity.

We could go one step further, and make busy polling
available for all NAPI drivers, but this would require
that all netif_napi_del() calls are done in process context
so that we can call synchronize_rcu().
Full audit would be required.

Before this is done, a driver still needs to call :

- skb_mark_napi_id() for each skb provided to the stack.
- napi_hash_add() and napi_hash_del() to allocate a napi_id per napi struct.
- Make sure RCU grace period is respected after napi_hash_del() before
  memory containing napi structure is freed.

Followup patch implements busy poll for mlx5 driver as an example.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ce6aea93