Commits · 1dd173b01f52c7a197cab20563d5f4967472a852 · Kirill Smelkov / linux

22 Jun, 2013 9 commits

IB/qib: Add qp_stats debug file · 1dd173b0

Mike Marciniszyn authored Jun 15, 2013

This adds a seq_file iterator for reporting the QP hash table when the
qp_stats file is read.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>

1dd173b0

IB/qib: Add per-context stats interface · 17db3a92

Mike Marciniszyn authored Jun 15, 2013

This patch adds a debugfs stats interface for per kernel contexts
packet counts.

The code uses the opcode stats count and eliminates the counter in the
context.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>

17db3a92

IB/qib: Convert opcode counters to per-context · ddb88765

Mike Marciniszyn authored Jun 15, 2013

This fix changes the opcode relative counters for receive to per
context.

Profiling has shown that when mulitple contexts are being used there
is a lot of cache activity associated with these counters.

The code formerly kept these counters per port, but only provided the
interface to read per HCA.  This patch converts the read of counters
to per HCA and adds the debugfs hooks to be able to read the file as a
sequence of opcodes.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>

ddb88765

IB/qib: Optimize CQ callbacks · 85caafe3

Mike Marciniszyn authored Jun 04, 2013

The current workqueue implemention has the following performance
deficiencies on QDR HCAs:

- The CQ call backs tend to run on the CPUs processing the
  receive queues
- The single thread queue isn't optimal for multiple HCAs

This patch adds a dedicated per HCA bound thread to process CQ callbacks.
Reviewed-by: Ramkrishna Vepa <ramkrishna.vepa@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>

85caafe3

IB/qib: Add dual-rail NUMA awareness for PSM processes · c804f072

Ramkrishna Vepa authored Jun 02, 2013

The driver currently selects a HCA based on the algorithm that PSM
chooses, contexts within a HCA or across. The HCA can also be chosen
by the user. Either way, this patch assigns a CPU on the NUMA node
local to the selected HCA. This patch also tries to select the HCA
closest to the NUMA node of the CPU assigned via taskset to PSM
process. If this HCA is unusable then another unit is selected based
on the algorithm that is currently enforced or selected by PSM - round
robin context selection 'within' or 'across' HCA's.

Fixed a bug wherein contexts are setup on the NUMA node on which the
processes are opened (setup_ctxt()) and not on the NUMA node that the
driver recommends the CPU on.
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Vinit Agnihotri <vinit.abhay.agnihotri@intel.com>
Signed-off-by: Ramkrishna Vepa <ramkrishna.vepa@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>

c804f072

IB/qib: Add optional NUMA affinity · e0f30bac

Ramkrishna Vepa authored May 28, 2013

This patch adds context relative numa affinity conditioned on the
module parameter numa_aware. The qib_ctxtdata has an additional
node_id member and qib_create_ctxtdata() has an addition node_id
parameter.

The allocations within the hdr queue and eager queue setup routines
now take this additional member and adjust allocations as necesary.
PSM will pass the either current numa node or the node closest to the
HCA depending on numa_aware. Verbs will always use the node closest to
the HCA.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Ramkrishna Vepa <ramkrishna.vepa@intel.com>
Signed-off-by: Vinit Agnihotri <vinit.abhay.agnihotri@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>

e0f30bac

IB/qib: Update minor version number · ab4a13d6

Vinit Agnihotri authored Jun 15, 2013

External PSM repositories have advanced the minor number for a variety
of reasons. The driver needs to increase to avoid warnings.
Signed-off-by: Vinit Agnihotri <vinit.abhay.agnihotri@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>

ab4a13d6

IB/qib: Remove atomic_inc_not_zero() from QP RCU · f7cf9a61

Mike Marciniszyn authored Jun 15, 2013

Follow Documentation/RCU/rcuref.txt guidance in removing
atomic_inc_not_zero() from QP RCU implementation.

This patch also removes an unneeded synchronize_rcu() in the add path.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>

f7cf9a61

IB/qib: Add DCA support · 8469ba39

Mike Marciniszyn authored May 30, 2013

This patch adds DCA cache warming for systems that support DCA.

The code uses cpu affinity notification to react to an affinity change
from a user mode program like irqbalance and (re-)program the chip
accordingly. This notification avoids reading the current cpu on every
interrupt.
Reviewed-by: Dean Luick <dean.luick@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>

[ Add Kconfig dependency on SMP && GENERIC_HARDIRQS to avoid failure to
  build due to undefined struct irq_affinity_notify.  - Roland ]
Signed-off-by: Roland Dreier <roland@purestorage.com>

8469ba39

20 Jun, 2013 31 commits

ndisc: Convert use of typedef ctl_table to struct ctl_table · fedaf4ff

Joe Perches authored Jun 13, 2013

This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fedaf4ff

ipv6: Convert use of typedef ctl_table to struct ctl_table · 9e8cda3b

Joe Perches authored Jun 13, 2013

This typedef is unnecessary and should just be removed.
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e8cda3b

inet: frag , remove an empty ifdef. · af92e542

Rami Rosen authored Jun 15, 2013

This patch removes an empty ifdef from inet_frag_intern()
in net/ipv4/inet_fragment.c.

commit b67bfe0d
(hlist: drop the node parameter from iterators) removed hlist from
net/ipv4/inet_fragment.c, but did not remove the enclosing ifdef command,
which is now empty.
Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

af92e542

htb: refactor struct htb_sched fields for performance · c9364636

Eric Dumazet authored Jun 15, 2013

htb_sched structures are big, and source of false sharing on SMP.

Every time a packet is queued or dequeue, many cache lines must be
touched because structures are not lay out properly.

By carefully splitting htb_sched in two parts, and define sub structures
to increase data locality, we can improve performance dramatically on
SMP.

New htb_prio structure can also be used in htb_class to increase data
locality.

I got 26 % performance increase on a 24 threads machine, with 200
concurrent netperf in TCP_RR mode, using a HTB hierarchy of 4 classes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c9364636

tcp: introduce a per-route knob for quick ack · bcefe17c

Cong Wang authored Jun 15, 2013

In previous discussions, I tried to find some reasonable heuristics
for delayed ACK, however this seems not possible, according to Eric:

	"ACKS might also be delayed because of bidirectional
	traffic, and is more controlled by the application
	response time. TCP stack can not easily estimate it."

	"ACK can be incredibly useful to recover from losses in
	a short time.

	The vast majority of TCP sessions are small lived, and we
	send one ACK per received segment anyway at beginning or
	retransmits to let the sender smoothly increase its cwnd,
	so an auto-tuning facility wont help them that much."

and according to David:

	"ACKs are the only information we have to detect loss.

	And, for the same reasons that TCP VEGAS is fundamentally
	broken, we cannot measure the pipe or some other
	receiver-side-visible piece of information to determine
	when it's "safe" to stretch ACK.

	And even if it's "safe", we should not do it so that losses are
	accurately detected and we don't spuriously retransmit.

	The only way to know when the bandwidth increases is to
	"test" it, by sending more and more packets until drops happen.
	That's why all successful congestion control algorithms must
	operate on explicited tested pieces of information.

	Similarly, it's not really possible to universally know if
	it's safe to stretch ACK or not."

It still makes sense to enable or disable quick ack mode like
what TCP_QUICK_ACK does.

Similar to TCP_QUICK_ACK option, but for people who can't
modify the source code and still wants to control
TCP delayed ACK behavior. As David suggested, this should belong
to per-path scope, since different pathes may want different
behaviors.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Rick Jones <rick.jones2@hp.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Thomas Graf <tgraf@suug.ch>
CC: David Laight <David.Laight@ACULAB.COM>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

bcefe17c

sctp: Convert __list_for_each use to list_for_each · 2c0740e4

Dave Jones authored Jun 17, 2013

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c0740e4

bnx2: use pdev->pm_cap instead of pci_find_capability(.., PCI_CAP_ID_PM) · 85768271

Yijing Wang authored Jun 18, 2013

Pci core has been saved pm cap register offset by pdev->pm_cap in pci_pm_init()
in init path. So we can use pdev->pm_cap instead of using
pci_find_capability(pdev, PCI_CAP_ID_PM) for better performance and simplified code.
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>

85768271

amd8111e: use pdev->pm_cap instead of pci_find_capability(.., PCI_CAP_ID_PM) · f9c7da5e

Yijing Wang authored Jun 18, 2013

Pci core has been saved pm cap register offset by pdev->pm_cap in pci_pm_init()
in init path. So we can use pdev->pm_cap instead of using
pci_find_capability(pdev, PCI_CAP_ID_PM) for better performance and simplified code.
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Bill Pemberton <wfp5p@virginia.edu>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
Signed-off-by: David S. Miller <davem@davemloft.net>

f9c7da5e

Bnx2x: remove redundant D0 power state set · b8a39dd2

Yijing Wang authored Jun 18, 2013

Pci_enable_device() will set device power state to D0,
so it's no need to do it again in bnx2x_init_dev().
Also remove redundant PM Cap find code, because pci core
has been saved the pci device pm cap value.
Signed-off-by: Yijing Wang <wangyijing@huawei.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Yuval Mintz <yuvalmin@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b8a39dd2

net: Add missing dependencies on NETDEVICES · 2206209e

Ben Hutchings authored Jun 18, 2013

ETRAX_ETHERNET selects ETHERNET and MII, which depend on NETDEVICES.
I don't think anything should select NETDEVICES, so make it a
dependency.  It also doesn't need to select or depend on ETHERNET,
which has nothing to do with the Ethernet library functions.

BPCTL selects MII, which depends on NETDEVICES.  But everything in the
drivers/staging/silicom directory is related to net devices, so make
NET_VENDOR_SILICOM depend on NETDEVICES and remove the now-redundant
dependencies on NET.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>

2206209e

at91_ether: Do not select NET_CORE · d6cf7a86

Ben Hutchings authored Jun 18, 2013

This has no dependency on any of the drivers under NET_CORE.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d6cf7a86

net: Move MII out from under NET_CORE and hide it · a1606c7d

Ben Hutchings authored Jun 18, 2013

All drivers that select MII also need to select NET_CORE because MII
depends on it.  This is a bit ridiculous because NET_CORE is just a
menu option that doesn't enable any code by itself.

There is also no need for it to be a visible option, since its users
all select it.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a1606c7d

tcp:typo unset should be unsent · 9ef71e0c

Weiping Pan authored Jun 18, 2013

Signed-off-by: Weiping Pan <wpan@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9ef71e0c

bonding: trivial: make alb use bond_slave_has_mac() · b88ec38d

Veaceslav Falico authored Jun 18, 2013

Also, cleanup bond_alb_handle_active_change() from 2 identical ifs.
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b88ec38d

be2net: use pci_vfs_assigned()/pci_num_vf() instead of be_find_vfs() · 257a3feb

Sathya Perla authored Jun 14, 2013

be_find_vfs() is no longer needed as the common PCI calls provide the same
functionality.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

257a3feb

sit: fix an oops when IFLA_IPTUN_PROTO is not set · c2ff682a

Nicolas Dichtel authored Jun 19, 2013

The use of this attribute has been added in 32b8a8e5 (sit: add IPv4 over
IPv4 support). It is optional, by default proto is IPPROTO_IPV6.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c2ff682a

net: sock: adapt SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF · eea86af6

Daniel Borkmann authored Jun 19, 2013

The current situation is that SOCK_MIN_RCVBUF is 2048 + sizeof(struct sk_buff))
while SOCK_MIN_SNDBUF is 2048. Since in both cases, skb->truesize is used for
sk_{r,w}mem_alloc accounting, we should have both sizes adjusted via defining a
TCP_SKB_MIN_TRUESIZE.

Further, as Eric Dumazet points out, the minimal skb truesize in transmit path is
SKB_TRUESIZE(2048) after commit f07d960d ("tcp: avoid frag allocation for
small frames"), and tcp_sendmsg() tries to limit skb size to half the congestion
window, meaning we try to build two skbs at minimum. Thus, having SOCK_MIN_SNDBUF
as 2048 can hit a small regression for some applications setting to low
SO_SNDBUF / SO_RCVBUF. Note that we define a TCP_SKB_MIN_TRUESIZE, because
SKB_TRUESIZE(2048) adds SKB_DATA_ALIGN(sizeof(struct skb_shared_info)), but in
case of TCP skbs, the skb_shared_info is part of the 2048 bytes allocation for
skb->head.

The minor adaption in sk_stream_moderate_sndbuf() is to silence a warning by
using a typed max macro, as similarly done in SOCK_MIN_RCVBUF occurences, that
would appear otherwise.
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

eea86af6

neigh: disallow un-init_net to change thresh of neigh · dc25c676

Gao feng authored Jun 20, 2013

thresh and interval are global resources,
only init net can change them.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dc25c676

neigh: only allow init_net to change the default neigh_parms · 170d6f99

Gao feng authored Jun 20, 2013

Though we don't export the /proc/sys/net/ipv[4,6]/neigh/default/
directory to the un-init_net, but we can still use cmd such as
"ip ntable change name arp_cache locktime 129" to change the locktime
of default neigh_parms.

This patch disallows the un-init_net to find out the neigh_table.parms.
So the un-init_net will failed to influence the init_net.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

170d6f99

neigh: no need to call lookup_neigh_parms in neigh_parms_alloc · cf89d6b2

Gao feng authored Jun 20, 2013

neigh_table.parms always exist and is initialized,kmemdup
can use it to create new neigh_parms, actually lookup_neigh_parms
here will return neigh_table.parms too.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cf89d6b2

bnx2x: replace mechanism to check for next available packet · 75b29459

Dmitry Kravkov authored Jun 19, 2013

Check next packet availability by validating that HW has finished CQE
placement. This saves latency of another dma transaction performed to update
SB indexes.
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

75b29459

bnx2x: add support for ndo_ll_poll · 8f20aa57

Dmitry Kravkov authored Jun 19, 2013

Adds ndo_ll_poll method and locking for FPs between LL and the napi.

When receiving a packet we use skb_mark_ll to record the napi it came from.
Add each napi to the napi_hash right after netif_napi_add().
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8f20aa57

net/mlx4_en: Low Latency recv statistics · 8501841a

Amir Vadai authored Jun 18, 2013

Signed-off-by: Amir Vadai <amirv@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

8501841a

net/mlx4_en: Add Low Latency Socket (LLS) support · 9e77a2b8

Amir Vadai authored Jun 18, 2013

Add basic support for LLS.
Signed-off-by: Amir Vadai <amirv@mellanox.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9e77a2b8

openvswitch: gre tunneling support. · dc3d807d

David S. Miller authored Jun 19, 2013

Pravin B Shelar says:

====================
Following patch series adds support for gre tunneling.
First six patches extend kernel gre and ip_tunnel modules
api so that there is more code sharing between gre modules
and ovs. Rest of patches adds ovs tunneling infrastructre
and gre protocol vport.

V2 fixes two patches according to comments from Jesse.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

dc3d807d

openvswitch: Add gre tunnel support. · aa310701

Pravin B Shelar authored Jun 17, 2013

Add gre vport implementation.  Most of gre protocol processing
is pushed to gre module. It make use of gre demultiplexer
therefore it can co-exist with linux device based gre tunnels.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

aa310701

openvswitch: Optimize flow key match for non tunnel flows. · a3e82996

Pravin B Shelar authored Jun 17, 2013

Following patch adds start offset for sw_flow-key, so that we can
skip tunneling information in key for non-tunnel flows.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a3e82996

openvswitch: Expand action buffer size. · ffe3f432

Pravin B Shelar authored Jun 17, 2013

MAX_ACTIONS_BUFSIZE limits action list size, set tunnel action
needs extra space on action list, for now increase max actions list limit.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ffe3f432

openvswitch: Add tunneling interface. · 7d5437c7

Pravin B Shelar authored Jun 17, 2013

Add ovs tunnel interface for set tunnel action for userspace.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7d5437c7

openvswitch: Copy individual actions. · 74f84a57

Pravin B Shelar authored Jun 17, 2013

Rather than validating actions and then copying all actiaons
in one block, following patch does same operation in single pass.
This validate and copy action one by one. This is required for
ovs tunneling patch.

This patch does not change any functionality.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

74f84a57

ip_tunnel: Add dont fragment flag. · 9a628224

Pravin B Shelar authored Jun 17, 2013

This flag will be used by ovs tunneling.
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

9a628224