Commits · 3b392ddba25a95dcf5fb30b33358961c49dd5cfc · Kirill Smelkov / linux

05 Jun, 2014 30 commits

MPLS: Use mpls_features to activate software MPLS GSO segmentation · 3b392ddb

Simon Horman authored Jun 04, 2014

If an MPLS packet requires segmentation then use mpls_features
to determine if the software implementation should be used.

As no driver advertises MPLS GSO segmentation this will always be
the case.

I had not noticed that this was necessary before as software MPLS GSO
segmentation was already being used in my test environment. I believe that
the reason for that is the skbs in question always had fragments and the
driver I used does not advertise NETIF_F_FRAGLIST (which seems to be the
case for most drivers). Thus software segmentation was activated by
skb_gso_ok().

This introduces the overhead of an extra call to skb_network_protocol()
in the case where where CONFIG_NET_MPLS_GSO is set and
skb->ip_summed == CHECKSUM_NONE.

Thanks to Jesse Gross for prompting me to investigate this.
Signed-off-by: Simon Horman <horms@verge.net.au>
Acked-by: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

3b392ddb

Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · d68de60f

David S. Miller authored Jun 05, 2014

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2014-06-05

This series contains updates to i40e and i40evf.

Jesse fixes an issue reported by Dave Jones where a couple of FD checks
ended up using bitwise OR where it should have been bitwise AND.

Neerav removes unused defines and macros for receive LRO.  Fix the driver
from allowing the user to set a larger MTU size that the hardware was
being configured to support.  Refactors send version which moves code in
two places into a small helper function.

Kamil modifies register diagnostics since register ranges can vary among
the different NVMs to avoid false test results.  So now we try to identify
the full range and use it for a register test and if we fail to define the
proper register range, we will only test the first register from that
group.  Then removes the check for large buffer since this was added in the
case this structure changed in the future, since the AQ definition is now
mature enough that this check is no longer necessary.

Mitch fixes i40evf driver to allocate descriptors in groups of 32 since the
hardware requires it.  Also fixes a crash when the ring size changed because
it would change the count before deallocating resources, causing the driver
to either free nonexistent buffers or leak leftover buffers.  Fixed the
driver to notify the VF for all types of resets so the VF can attempt a
graceful reinit.

Shannon refactors stats collection to create a unifying stats update routine
to call the various stat collection routines.  Removes rx_errors and
rx_missed stats since they were removed from the chip design.  Added
missing VSI statistics that the hardware offers but are not apart of the
standard netdev stats.

v2: dropped patch "i40e: Allow disabling of DCB via debugfs" from Neerav
    based on feedback from David Miller.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

d68de60f

i40e: add missing VSI statistics · 41a9e55c

Shannon Nelson authored Apr 23, 2014

Add a couple more statistics that the hardware offers but aren't part
of the standard netdev stats.

Change-ID: I201db2898f2c284aee3d9631470bc5edd349e9a5
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

41a9e55c

i40e/i40evf: remove rx_errors and rx_missed · 03da6f6a

Shannon Nelson authored Apr 23, 2014

The rx_errors (GLV_REPC) and rx_missed (GLV_RMPC) were removed
from the chip design.

Change-ID: Ifdeb69c90feac64ec95c36d3d32c75e3a06de3b7
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

03da6f6a

i40e: refactor stats collection · 7812fddc

Shannon Nelson authored Apr 23, 2014

Pull the PF stat collection out of the VSI collection routine, and
add a unifying stats update routine to call the various stat collection
routines.

Change-ID: I224192455bb3a6e5dc0a426935e67dffc123e306
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

7812fddc

i40e: refactor send version · 44033fac

Jesse Brandeburg authored Apr 23, 2014

This change moves some common code in two places into a
small helper function, and corrects a bug in one of the
two places in the process.

Change-ID: If3bba7152b240f13a7881eb0e8a781655fa66ce7
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

44033fac

i40e/i40evf: VEB structure added, GTIME macro update · 4f4e17bd

Kamil Krawczyk authored Apr 23, 2014

Structure for VEB context added. Update macro for
transition from ms to GTIME (us) time units.

Change-ID: Ib3a19587b4cf355348655df8f60c6f37bb1497a3
Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

4f4e17bd

i40e: notify VF of all types of resets · 263fc48f

Mitch Williams authored Apr 23, 2014

Currently, the PF driver only notifies the VFs for PF reset events.
Instead, notify the VFs for all types of resets, so they can attempt a
graceful reinit.

Change-ID: I03eb7afde25727198ef620f8b4e78bb667a11370
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

263fc48f

i40evf: fix crash when changing ring sizes · d732a184

Mitch Williams authored Apr 24, 2014

i40evf_set_ringparam was broken in several ways. First, it only changed
the size of the first ring, and second, changing the ring size would
often result in a panic because it would change the count before
deallocating resources, causing the driver to either free nonexistent
buffers, or leak leftover buffers.

Fix this by storing the descriptor count in the adapter structure, and
updating the count for each ring each time we allocate them. This
ensures that we always free the right size ring, and always end up with
the requested count when the device is (re)opened.

Change-ID: I298396cd3d452ba8509d9f2d33a93f25868a9a55
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

d732a184

i40evf: set descriptor multiple to 32 · 337eb08e

Mitch Williams authored Apr 23, 2014

Hardware requires descriptors to be allocated in groups of 32.

Change-ID: I752ccc96769d1bd8d3018c004b8aeff464045bf2
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

337eb08e

i40e: clamp jumbo frame size · 61a46a4c

Jesse Brandeburg authored Apr 23, 2014

The driver was allowing the user to set larger size MTU
than the hardware was being configured to support.

The driver was already using VLAN_HLEN when setting the
hardware max receivable frame size, so just add it to the
netdev MTU set entry point as well.

Change-ID: Ie20e2a35d04f8c411253e255bea79ca69aaeaea3
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

61a46a4c

i40e/i40evf: remove unused RX_LRO define · d0277054

Jesse Brandeburg authored Apr 23, 2014

Remove unused defines and macros for RX_LRO.

Change-ID: I8ca6715edfa62b56837417a1c4ff68c2345dab6e
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

d0277054

i40e: remove check for large buffer · c5cc0cfd

Kamil Krawczyk authored Apr 23, 2014

We introduced this check in case this structure changed in the future,
the AQ definition is now mature enough that this check is no longer necessary.

Change-ID: Ic66321d0a08557dc9d8cb84029185352cb534330
Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com>
Tested-by: Jim Young <jamesx.m.young@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

c5cc0cfd

i40e: Rework register diagnostic · 22dd9ae8

Kamil Krawczyk authored Apr 23, 2014

Register range, being subject to register diagnostic, can vary among
different NVMs. We will try to identify the full range and use it for
a register test. This is needed to avoid false test results. If we fail
to define the proper register range we will test only the first register
from that group.

Change-ID: Ieee7173c719733b61d3733177a94dc557eb7b3fd
Signed-off-by: Kamil Krawczyk <kamil.krawczyk@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

22dd9ae8

i40e: don't use OR to check a value · d3a90b70

Jesse Brandeburg authored Apr 16, 2014

A couple of FD checks ended up using bitwise OR to check
a value, which ends up always being evaluated to true.

This should fix the issue.  Thanks to DaveJ for noticing
and reporting the issue!

CC: Dave Jones <davej@redhat.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

d3a90b70

ipv4: use skb frags api in udp4_hwcsum() · ebbe495f

WANG Cong authored Jun 02, 2014

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ebbe495f

net: use the new API kvfree() · 4cb28970

WANG Cong authored Jun 02, 2014

It is available since v3.15-rc5.

Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4cb28970

dns_resolver: Do not accept domain names longer than 255 chars · 9638f671

Manuel Schölling authored May 31, 2014

According to RFC1035 "[...] the total length of a domain name (i.e.,
label octets and label length octets) is restricted to 255 octets or
less."
Signed-off-by: Manuel Schölling <manuel.schoelling@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

9638f671

Merge branch 'isdn-capi' · 555878b9

David S. Miller authored Jun 04, 2014

Tilman Schmidt says:

====================
ISDN patches for net-next (v2)

Here's v2 of the series of patches for the ISDN CAPI subsystem
prepared by Paul Bolle and reviewed by yours truly. It reflects
GregKH's review, resulting in a substantial simplification.
Please merge via net-next.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

555878b9

isdn/capi: fix (middleware) device nodes · d1cadce1

Paul Bolle authored Jun 01, 2014

Since v2.4 the capi driver used the following device nodes if
"middleware" support was enabled:
    /dev/capi20
    /dev/capi/0
    /dev/capi/1
    [...]

/dev/capi20 is a character device node. /dev/capi/0 (and up) are tty
device nodes (with a different major).

This device node (naming) scheme is not documented anywhere, as far as I
know. It was originally provided by the capifs pseudo filesystem (before
udev became available). It is required for example by the pppd
capiplugin. It was supported until a few years ago. But a number of
developments broke it:
- v2.6.6 (May 2004) renamed /dev/capi20 to /dev/capi and removed the
  "/" from the name of capi's tty driver. The explanation of the patch
  that did this included two examples of udev rules "to restore the old
  namespace";
- either udev 154 (May 2010) or udev 179 (January 2012) stopped
  allowing to rename device nodes, and thus the ability to have
  /dev/capi20 appear instead of /dev/capi and /dev/capi/0 (and up)
  instead of /dev/capi0 (and up);
- v3.0 (July 2011) also removed capifs. That disabled another method to
  create the /dev/capi/0 (and up) device nodes.

So now users need to manually tweak their setup (eg, create /dev/capi/
and fill that with symlinks) to get things working. This is all rather
hacky and only discoverable by searching the web. Fix all this by
renaming /dev/capi back to /dev/capi20, and by setting the name of the
"capi_nc" tty driver to "capi!" so the tty device nodes appear as
/dev/capi/0 (and up).
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Tilman Schmidt <tilman@imap.cc>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

d1cadce1

isdn/capi: Make verbose reporting depend on capidrv · a79f5d26

Paul Bolle authored Jun 01, 2014

The Kconfig symbol ISDN_DRV_AVMB1_VERBOSE_REASON is only used for
capi_info2str(). That function is only used in capidrv.c. So setting it
without setting ISDN_CAPI_CAPIDRV is pointless. Make it depend on
ISDN_CAPI_CAPIDRV, rename it to ISDN_CAPI_CAPIDRV_VERBOSE and put its
entry after ISDN_CAPI_CAPIDRV's entry.

Since this symbol seems to be primarily used for debugging, keep it off
by default. By now the last users of capidrv hopefully know all they
need to know about the reasons for disconnecting.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>

a79f5d26

isdn/capi: move capi_info2str to capidrv.c · ca05e3a7

Paul Bolle authored Jun 01, 2014

capi_info2str() is apparently meant to be of general utility. It is
actually only used in capidrv.c. So move it from capiutil.c to
capidrv.c and (obviously) stop exporting it.

And, since we're touching this, merge the two versions of this
function.
Signed-off-by: Paul Bolle <pebolle@tiscali.nl>
Signed-off-by: Tilman Schmidt <tilman@imap.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>

ca05e3a7

Merge branch 'inet_csums' · 00d115fc

David S. Miller authored Jun 04, 2014

Tom Herbert says:

====================
net: Support checksum in UDP

This patch series adds support for using checksums in UDP tunnels. With
this it is possible that two or more checksums may be set within the
same packet and we would like to do that efficiently.

This series also creates some new helper functions to be used by various
tunnel protocol implementations.

v2: Fixed indentation in tcp_v6_send_check arguments.
v3: Move udp_set_csum and udp6_set_csum to be not inlined
    Also have this functions call with a nocheck boolean argument
    instead of passing a sock structure.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

00d115fc

vxlan: Add support for UDP checksums (v4 sending, v6 zero csums) · 359a0ea9

Tom Herbert authored Jun 04, 2014

Added VXLAN link configuration for sending UDP checksums, and allowing
TX and RX of UDP6 checksums.

Also, call common iptunnel_handle_offloads and added GSO support for
checksums.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

359a0ea9

gre: Call gso_make_checksum · 4749c09c

Tom Herbert authored Jun 04, 2014

Call gso_make_checksum. This should have the benefit of using a
checksum that may have been previously computed for the packet.

This also adds NETIF_F_GSO_GRE_CSUM to differentiate devices that
offload GRE GSO with and without the GRE checksum offloaed.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4749c09c

net: Add GSO support for UDP tunnels with checksum · 0f4f4ffa

Tom Herbert authored Jun 04, 2014

Added a new netif feature for GSO_UDP_TUNNEL_CSUM. This indicates
that a device is capable of computing the UDP checksum in the
encapsulating header of a UDP tunnel.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0f4f4ffa

tcp: Call gso_make_checksum · e9c3a24b

Tom Herbert authored Jun 04, 2014

Call common gso_make_checksum when calculating checksum for a
TCP GSO segment.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e9c3a24b

net: Support for multiple checksums with gso · 7e2b10c1

Tom Herbert authored Jun 04, 2014

When creating a GSO packet segment we may need to set more than
one checksum in the packet (for instance a TCP checksum and
UDP checksum for VXLAN encapsulation). To be efficient, we want
to do checksum calculation for any part of the packet at most once.

This patch adds csum_start offset to skb_gso_cb. This tracks the
starting offset for skb->csum which is initially set in skb_segment.
When a protocol needs to compute a transport checksum it calls
gso_make_checksum which computes the checksum value from the start
of transport header to csum_start and then adds in skb->csum to get
the full checksum. skb->csum and csum_start are then updated to reflect
the checksum of the resultant packet starting from the transport header.

This patch also adds a flag to skbuff, encap_hdr_csum, which is set
in *gso_segment fucntions to indicate that a tunnel protocol needs
checksum calculation
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7e2b10c1

l2tp: call udp{6}_set_csum · 77157e19

Tom Herbert authored Jun 04, 2014

Call common functions to set checksum for UDP tunnel.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

77157e19

udp: Generic functions to set checksum · af5fcba7

Tom Herbert authored Jun 04, 2014

Added udp_set_csum and udp6_set_csum functions to set UDP checksums
in packets. These are for simple UDP packets such as those that might
be created in UDP tunnels.
Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af5fcba7

04 Jun, 2014 10 commits

Merge branch 'bonding-macvlan' · 6579867c

David S. Miller authored Jun 04, 2014

Vlad Yasevich says:

====================
Fix support for macvlan devices on top bonding

Currently, macvlan devices do not work well over bond interfaces.
Everything works well, untill a failover is triggered in the bond
device and then macvlan becomes unreachble untill arp entries
are flushed.   This series adds needed functionality to
handle correct notifications and update switches with mac addresses
assigned to macvlans.

The first patch simply addes IFF_UNICAST_FLT flag to bonds since they
already correctly manage the unicast filter list of the slaves, so
we might as well prevent the bond from needlessly going into promiscuous
mode.

The second patch adds notifier handler to macvlan to trigger correct
ARP notifications.

The third patch adds handling for TLB and RLB modes that use special
ETH_P_LOOPBACK type packets to teach switch about mac addresses.
It also allow ARPs for the macvlan mac addresses to be handled by
RLB mode.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

6579867c

bonding: Support macvlans on top of tlb/rlb mode bonds · 14af9963

Vlad Yasevich authored Jun 04, 2014

To make TLB mode work, the patch allows learning packets
to be sent using mac addresses assigned to macvlan devices,
also taking into an account vlans that may be between the
bond and macvlan device.

To make RLB work, all we have to do is accept ARP packets
for addresses added to the bond dev->uc list.  Since RLB
mode will take care to update the peers directly with
correct mac addresses, learning packets for these addresses
do not have be send to switch.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

14af9963

macvlan: Support bonding events · 4c991255

Vlad Yasevich authored Jun 04, 2014

Bonding and team drivers generate specific events during failover
that trigger switch updates.  When a macvlan device is configured
on top of bonding, we want switches to learn about the macvlan
devices as well.   This patch adds a handler to macvlan driver to
propagate these events to all macvlan devices.  We let the generic
inetdev event handler do the work.

This allows macvlan to operated correctly over active-backup
mode bond.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4c991255

bonding: Turn on IFF_UNICAST_FLT on bond devices · c565b488

Vlad Yasevich authored Jun 04, 2014

Bonding devices manage the unicast filters of the underlying
interfaces, but do not turn on IFF_UNICAST_FLT flag.  Thus
anytime a unicast address is added to the bond, the bond is
places in promiscuous mode.

Turn on IFF_UNICAST_FLT on the bond device so that the bond does
not go into promiscuous mode needlesly.  If an underlying device
does not support unicast filtering, that device will automaticall
enter promiscuous mode already.
Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c565b488

net: Revert "fib_trie: use seq_file_net rather than seq->private" · f830b022

Sasha Levin authored Jun 04, 2014

This reverts commit 30f38d2f.

fib_triestat is surrounded by a big lie: while it claims that it's a
seq_file (fib_triestat_seq_open, fib_triestat_seq_show), it isn't:

	static const struct file_operations fib_triestat_fops = {
	        .owner  = THIS_MODULE,
	        .open   = fib_triestat_seq_open,
	        .read   = seq_read,
	        .llseek = seq_lseek,
	        .release = single_release_net,
	};

Yes, fib_triestat is just a regular file.

A small detail (assuming CONFIG_NET_NS=y) is that while for seq_files
you could do seq_file_net() to get the net ptr, doing so for a regular
file would be wrong and would dereference an invalid pointer.

The fib_triestat lie claimed a victim, and trying to show the file would
be bad for the kernel. This patch just reverts the issue and fixes
fib_triestat, which still needs a rewrite to either be a seq_file or
stop claiming it is.
Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f830b022

trivial: drivers/net/ethernet/nvidia/forcedeth.c: fix typo s/SUBSTRACT1/SUBTRACT1/ · cef33c81

Antonio Ospite authored Jun 04, 2014

Signed-off-by: Antonio Ospite <ao2@ao2.it>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

cef33c81

gianfar: Fix the section mismatch warnings. · 898157ed

Xiubo Li authored Jun 04, 2014

Building with CONFIG_DEBUG_SECTION_MISMATCH enabled, the following
WARNING is occured:

  LD      drivers/net/built-in.o
WARNING: drivers/net/built-in.o(.text+0xcd4c): Section mismatch in
reference from the function gfar_probe() to the function
.init.text:gfar_init_addr_hash_table()
The function gfar_probe() references
the function __init gfar_init_addr_hash_table().
This is often because gfar_probe lacks a __init
annotation or the annotation of gfar_init_addr_hash_table is wrong.
Signed-off-by: Xiubo Li <Li.Xiubo@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

898157ed

Merge branch 'xen-netback-netfront-multiqueue' · 9ab89acc

David S. Miller authored Jun 04, 2014

Wei Liu says:

====================
This is rebased version of Andrew's V8 patch series. The original cover letter:

--------------------
xen-net{back, front}: Multiple transmit and receive queues

This patch series implements multiple transmit and receive queues (i.e.
multiple shared rings) for the xen virtual network interfaces.

The series is split up as follows:
- Patch 1 brings the 'grant_copy_op' array back into struct xenvif, in
preparation for multi-queue support. See the patch itself for more details.
- Patches 2 and 4 factor out the queue-specific data for netback and
netfront respectively, and modify the rest of the code to use these
as appropriate.
- Patches 3 and 5 introduce new XenStore keys to negotiate and use
multiple shared rings and event channels, and code to connect these
as appropriate.
- Patch 6 documents the XenStore keys required for the new feature
in include/xen/interface/io/netif.h

All other transmit and receive processing remains unchanged, i.e. there
is a kthread per queue and a NAPI context per queue.

The performance of these patches has been analysed in detail, with
results available at:

http://wiki.xenproject.org/wiki/Xen-netback_and_xen-netfront_multi-queue_performance_testing

To summarise:
* Using multiple queues allows a VM to transmit at line rate on a 10
Gbit/s NIC, compared with a maximum aggregate throughput of 6 Gbit/s
with a single queue.
* For intra-host VM--VM traffic, eight queues provide 171% of the
throughput of a single queue; almost 12 Gbit/s instead of 6 Gbit/s.
* There is a corresponding increase in total CPU usage, i.e. this is a
scaling out over available resources, not an efficiency improvement.
* Results depend on the availability of sufficient CPUs, as well as the
distribution of interrupts and the distribution of TCP streams across
the queues.

Queue selection is currently achieved via an L4 hash on the packet (i.e.
TCP src/dst port, IP src/dst address) and is not negotiated between the
frontend and backend, since only one option exists. Future patches to
support other frontends (particularly Windows) will need to add some
capability to negotiate not only the hash algorithm selection, but also
allow the frontend to specify some parameters to this.

Note that queue selection is a decision by the transmitting system about
which queue to use for a particular packet. In general, the algorithm
may differ between the frontend and the backend with no adverse effects.

Queue-specific XenStore entries for ring references and event channels
are stored hierarchically, i.e. under .../queue-N/... where N varies
from 0 to one less than the requested number of queues (inclusive). If
only one queue is requested, it falls back to the flat structure where
the ring references and event channels are written at the same level as
other vif information.

V8:
- Squash the queue error handling code into patch 3.
- Update the documentation (patch 6) according to comments on the
equivalent patch to Xen.

V7:
- Rebase on latest net-next, which includes the netback grant mapping
patch series from Zoltan Kiss
- Reduce QUEUE_NAME_SIZE by 1 to avoid double-counting the trailing '\0'
- Simplify the queue hashing by using (hash % num_queues) instead of
multiply & shift.
- Add ratelimited warning for invalid queue selection.
- Fix error handling to correctly tear down already setup queues.
- Use dev->real_num_tx_queues instead of separately maintaining a
count of the number of queues.

V6:
- Use 'max_queues' as the module param. name for both netback and netfront.

V5:
- Fix bug in xenvif_free() that could lead to an attempt to transmit an
skb after the queue structures had been freed.
- Improve the XenStore protocol documentation in netif.h.
- Fix IRQ_NAME_SIZE double-accounting for null terminator.
- Move rx_gso_checksum_fixup stat into struct xenvif_stats (per-queue).
- Don't initialise a local variable that is set in both branches (xspath).

V4:
- Add MODULE_PARM_DESC() for the multi-queue parameters for netback
and netfront modules.
- Move del_timer_sync() in netfront to after unregister_netdev, which
restores the order in which these functions were called before applying
these patches.

V3:
- Further indentation and style fixups.

V2:
- Rebase onto net-next.
- Change queue->number to queue->id.
- Add atomic operations around the small number of stats variables that
are not queue-specific or per-cpu.
- Fixup formatting and style issues.
- XenStore protocol changes documented in netif.h.
- Default max. number of queues to num_online_cpus().
- Check requested number of queues does not exceed maximum.
--------------------

I rebased this on top of net-next. No functional change is introduced. The
patch that needed some extra care was "xen-netback: Factor queue-specific data
into queue struct" because it clashed with a fix introduced in net. A simple
test of creating guest, iperf, then shutting down guest worked as expected.

The last patch fixes a minor problem that queue name is not initialised in
xen-netfront, resulting in names like "-tx" "-rx" in /proc/interrupt.

Changes since v9 (no functional change introduced):
* include commit summary in the commit message of first patch
* fold David Vrabel's Reviewed-by into last patch
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

9ab89acc

xen-netfront: initialise queue name in xennet_init_queue · 8b715010

Wei Liu authored Jun 04, 2014

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8b715010

xen-net{back, front}: Document multi-queue feature in netif.h · a2deb8b1

Andrew J. Bennieston authored Jun 04, 2014

Document the multi-queue feature in terms of XenStore keys to be written
by the backend and by the frontend.
Signed-off-by: Andrew J. Bennieston <andrew.bennieston@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a2deb8b1