Commits · 029ac211464f9cf87fa7aa51a6f01e41642d76c3 · nexedi / linux

19 Sep, 2016 34 commits

Merge branch 'net-sched-singly-linked-list' · 029ac211

David S. Miller authored Sep 19, 2016

Florian Westphal says:

====================
sched: convert queues to single-linked list

During Netfilter Workshop 2016 Eric Dumazet pointed out that qdisc
schedulers use doubly-linked lists, even though single-linked list
would be enough.

The double-linked skb lists incur one extra write on enqueue/dequeue
operations (to change ->prev pointer of next list elem).

This series converts qdiscs to single-linked version, listhead
maintains pointers to first (for dequeue) and last skb (for enqueue).

Most qdiscs don't queue at all and instead use a leaf qdisc (typically
pfifo_fast) so only a few schedulers needed changes.

I briefly tested netem and htb and they seemed fine.

UDP_STREAM netperf with 64 byte packets via veth+pfifo_fast shows
a small (~2%) improvement.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

029ac211

sched: add and use qdisc_skb_head helpers · 48da34b7

Florian Westphal authored Sep 18, 2016

This change replaces sk_buff_head struct in Qdiscs with new qdisc_skb_head.

Its similar to the skb_buff_head api, but does not use skb->prev pointers.

Qdiscs will commonly enqueue at the tail of a list and dequeue at head.
While skb_buff_head works fine for this, enqueue/dequeue needs to also
adjust the prev pointer of next element.

The ->prev pointer is not required for qdiscs so we can just leave
it undefined and avoid one cacheline write access for en/dequeue.
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

48da34b7

sched: replace __skb_dequeue with __qdisc_dequeue_head · ed760cb8

Florian Westphal authored Sep 18, 2016

After previous patch these functions are identical.
Replace __skb_dequeue in qdiscs with __qdisc_dequeue_head.

Next patch will then make __qdisc_dequeue_head handle
single-linked list instead of strcut sk_buff_head argument.

Doesn't change generated code.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

ed760cb8

sched: remove qdisc arg from __qdisc_dequeue_head · ec323368

Florian Westphal authored Sep 18, 2016

Moves qdisc stat accouting to qdisc_dequeue_head.

The only direct caller of the __qdisc_dequeue_head version open-codes
this now.

This allows us to later use __qdisc_dequeue_head as a replacement
of __skb_dequeue() (which operates on sk_buff_head list).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

ec323368

sched: don't use skb queue helpers · 97d0678f

Florian Westphal authored Sep 18, 2016

A followup change will replace the sk_buff_head in the qdisc
struct with a slightly different list.

Use of the sk_buff_head helpers will thus cause compiler
warnings.

Open-code these accesses in an extra change to ease review.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

97d0678f

pie: use qdisc_dequeue_head wrapper · 1486587b

Florian Westphal authored Sep 18, 2016

Doesn't change generated code.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

1486587b

cxgb4: Fix return value check in cfg_queues_uld() · 106323b9

Wei Yongjun authored Sep 17, 2016

Fix the retrn value check which testing the wrong variable
in cfg_queues_uld().

Fixes: 94cdb8bb ("cxgb4: Add support for dynamic allocation of
resources for ULD")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

106323b9

Merge branch 'mediatek-hw-lro' · 4646651e

David S. Miller authored Sep 19, 2016

Nelson Chang says:

====================
net: ethernet: mediatek: add HW LRO functions

The series add the large receive offload (LRO) functions by hardware and
the ethtool functions to configure RX flows of HW LRO.

changes since v3:
- Respin the patch by the newer driver
- Move the dts description of hwlro to optional properties

changes since v2:
- Add ndo_fix_features to prevent NETIF_F_LRO off while RX flow is programmed
- Rephrase the dts property is a capability if the hardware supports LRO

changes since v1:
- Add HW LRO support
- Add ethtool hooks to set LRO RX flows
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

4646651e

net: ethernet: mediatek: add the dts property to set if the HW supports LRO · 004e6cc6

Nelson Chang authored Sep 17, 2016

Add the dts property for the capability if the hardware supports LRO.
Signed-off-by: Nelson Chang <nelson.chang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

004e6cc6

net: ethernet: mediatek: add ethtool functions to configure RX flows of HW LRO · 7aab747e

Nelson Chang authored Sep 17, 2016

The codes add ethtool functions to set RX flows for HW LRO. Because the
HW LRO hardware can only recognize the destination IP of TCP/IP RX flows,
the ethtool command to add HW LRO flow is as below:
ethtool -N [devname] flow-type tcp4 dst-ip [ip_addr] loc [0~1]

Otherwise, cause the hardware can set total four destination IPs, each
GMAC (GMAC1/GMAC2) can set two IPs separately at most.
Signed-off-by: Nelson Chang <nelson.chang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7aab747e

net: ethernet: mediatek: add HW LRO functions of PDMA RX rings · ee406810

Nelson Chang authored Sep 17, 2016

The codes add the large receive offload (LRO) functions by hardware as below:
1) PDMA has total four RX rings that one is the normal ring, and others can
   be configured as LRO rings.
2) Only TCP/IP RX flows can be offloaded. The hardware can set four IP
   addresses at most, if the destination IP of the RX flow matches one of
   them, it has the chance to be offloaded.
3) There three RX flows can be offloaded at most, and one flow is mapped to
   one RX ring.
4) If there are more than three candidate RX flows, the hardware can
   choose three of them by throughput comparison results.
Signed-off-by: Nelson Chang <nelson.chang@mediatek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee406810

chcr/cxgb4i/cxgbit/RDMA/cxgb4: Allocate resources dynamically for all cxgb4 ULD's · 0fbc81b3

Hariprasad Shenai authored Sep 17, 2016

Allocate resources dynamically to cxgb4's Upper layer driver's(ULD) like
cxgbit, iw_cxgb4 and cxgb4i. Allocate resources when they register with
cxgb4 driver and free them while unregistering. All the queues and the
interrupts for them will be allocated during ULD probe only and freed
during remove.
Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0fbc81b3

sctp: Remove some redundant code · e8bc8f9a

Christophe Jaillet authored Sep 16, 2016

In commit 311b2177 ("sctp: simplify sk_receive_queue locking"), a call
to 'skb_queue_splice_tail_init()' has been made explicit. Previously it was
hidden in 'sctp_skb_list_tail()'

Now, the code around it looks redundant. The '_init()' part of
'skb_queue_splice_tail_init()' should already do the same.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e8bc8f9a

mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full · 95357907

Jesper Dangaard Brouer authored Sep 16, 2016

The XDP_TX action can fail transmitting the frame in case the TX ring
is full or port is down. In case of TX failure it should drop the
frame, and not as now call 'break' which is the same as XDP_PASS.

Fixes: 9ecc2d86 ("net/mlx4_en: add xdp forwarding and data write support")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Brenden Blanco <bblanco@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

95357907

Merge branch 'ipvlan-l3' · 8ddda653

David S. Miller authored Sep 19, 2016

Mahesh Bandewar says:

====================
IPvlan introduce l3s mode

Same old problem with new approach especially from suggestions from
earlier patch-series.

First thing is that this is introduced as a new mode rather than
modifying the old (L3) mode. So the behavior of the existing modes is
preserved as it is and the new L3s mode obeys iptables so that intended
conn-tracking can work.

To do this, the code uses newly added l3mdev_rcv() handler and an
Iptables hook. l3mdev_rcv() to perform an inbound route lookup with the
correct (IPvlan slave) interface and then IPtable-hook at LOCAL_INPUT
to change the input device from master to the slave to complete the
formality.

Supporting stack changes are trivial changes to export symbol to get
IPv4 equivalent code exported for IPv6 and to allow netfilter hook
registration code to allow caller to hold RTNL. Please look into
individual patches for details.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8ddda653

ipvlan: Introduce l3s mode · 4fbae7d8

Mahesh Bandewar authored Sep 16, 2016

In a typical IPvlan L3 setup where master is in default-ns and
each slave is into different (slave) ns. In this setup egress
packet processing for traffic originating from slave-ns will
hit all NF_HOOKs in slave-ns as well as default-ns. However same
is not true for ingress processing. All these NF_HOOKs are
hit only in the slave-ns skipping them in the default-ns.
IPvlan in L3 mode is restrictive and if admins want to deploy
iptables rules in default-ns, this asymmetric data path makes it
impossible to do so.

This patch makes use of the l3_rcv() (added as part of l3mdev
enhancements) to perform input route lookup on RX packets without
changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
to change the skb->dev just before handing over skb to L4.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: David Ahern <dsa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4fbae7d8

net: Add _nf_(un)register_hooks symbols · e8bffe0c

Mahesh Bandewar authored Sep 16, 2016

Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow
caller to hold RTNL mutex.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

e8bffe0c

ipv6: Export p6_route_input_lookup symbol · d409b847

Mahesh Bandewar authored Sep 16, 2016

Make ip6_route_input_lookup available outside of ipv6 the module
similar to ip_route_input_noref in the IPv4 world.
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d409b847

Merge branch 'net-offloaded-stats' · a5ea31f5

David S. Miller authored Sep 18, 2016

Jiri Pirko says:

====================
net: return offloaded stats as default and expose original sw stats

The problem we try to handle is about offloaded forwarded packets
which are not seen by kernel. Let me try to draw it:

    port1                       port2 (HW stats are counted here)
      \                          /
       \                        /
        \                      /
         --(A)---- ASIC --(B)--
                    |
                   (C)
                    |
                   CPU (SW stats are counted here)

Now we have couple of flows for TX and RX (direction does not matter here):

1) port1->A->ASIC->C->CPU

   For this flow, HW and SW stats are equal.

2) port1->A->ASIC->C->CPU->C->ASIC->B->port2

   For this flow, HW and SW stats are equal.

3) port1->A->ASIC->B->port2

   For this flow, SW stats are 0.

The purpose of this patchset is to provide facility for user to
find out the difference between flows 1+2 and 3. In other words, user
will be able to see the statistics for the slow-path (through kernel).

Also note that HW stats are what someone calls "accumulated" stats.
Every packet counted by SW is also counted by HW. Not the other way around.

As a default the accumulated stats (HW) will be exposed to user
so the userspace apps can react properly.

This patchset add the SW stats (flows 1+2) under offload related stats, so
in the future we can expose other offload related stat in a similar way.

---
v9->v10:
- patch 2/3
 - removed unnecessary ()s as pointed out by Nik
v8->v9:
- patch 2/3
 - add using of idxattr and prividx
v7->v8:
- patch 2/3
 - move helping const from uapi to rtnetlink
 - cancel driver xstat nesting if it is empty
v6->v7:
- patch 1/3:
 - ndo interface changed to get the wanted stats type as an input.
 - change commit message.
- patch 2/3:
 - create a nesting for offloaded stat and put SW stats under it.
 - change the ndo call to indicate which offload stats we wants.
 - change commit message.
- patch 3/3:
 - change ndo implementation to match the changes in the previous patches.
 - change commit message.
v5->v6:
- patch 2/4 was dropped as requested by Roopa
- patch 1/3:
 - comment changed to indicate that default stats are combined stats
 - commit massage changed
- patch 2/3: (previously 3/4)
 - SW stats return nothing if there is no SW stats ndo
v4->v5:
- updated cover letter
- patch3/4:
  - using memcpy directly to copy stats as requested by DaveM
v3->v4:
- patch1/4:
  - fixed "return ()" pointed out by EricD
- patch2/4:
  - fixed if_nlmsg_size as pointed out by EricD
v2->v3:
- patch1/4:
  - added dev_have_sw_stats helper
- patch2/4:
  - avoided memcpy as requested by DaveM
- patch3/4:
  - use new dev_have_sw_stats helper
v1->v2:
- patch3/4:
  - fixed NULL initialization
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

a5ea31f5

mlxsw: spectrum: Implement offload stats ndo and expose HW stats by default · fc1bbb0f

Nogah Frankel authored Sep 16, 2016

Change the default statistics ndo to return HW statistics
(like the one returned by ethtool_ops).
The HW stats are collected to a cache by delayed work every 1 sec.
Implement the offload stat ndo.
Add a function to get SW statistics, to be called from this function.
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc1bbb0f

net: core: Add offload stats to if_stats_msg · 69ae6ad2

Nogah Frankel authored Sep 16, 2016

Add a nested attribute of offload stats to if_stats_msg
named IFLA_STATS_LINK_OFFLOAD_XSTATS.
Under it, add SW stats, meaning stats only per packets that went via
slowpath to the cpu, named IFLA_OFFLOAD_XSTATS_CPU_HIT.
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

69ae6ad2

netdevice: Add offload statistics ndo · 2c9d85d4

Nogah Frankel authored Sep 16, 2016

Add a new ndo to return statistics for offloaded operation.
Since there can be many different offloaded operation with many
stats types, the ndo gets an attribute id by which it knows which
stats are wanted. The ndo also gets a void pointer to be cast according
to the attribute id.
Signed-off-by: Nogah Frankel <nogahf@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2c9d85d4

Merge tag 'mac80211-next-for-davem-2016-09-16' of... · c13ed534

David S. Miller authored Sep 18, 2016

Merge tag 'mac80211-next-for-davem-2016-09-16' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This time we have various things - all across the board:
 * MU-MIMO sniffer support in mac80211
 * a create_singlethread_workqueue() cleanup
 * interface dump filtering that was documented but not implemented
 * support for the new radiotap timestamp field
 * send delBA in two unexpected conditions (as required by the spec)
 * connect keys cleanups - allow only WEP with index 0-3
 * per-station aggregation limit to work around broken APs
 * debugfs improvement for the integrated codel algorithm
and various other small improvements and cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

c13ed534

net: r6040: add in missing white space in error message text · 22da7349

Colin Ian King authored Sep 16, 2016

A couple of dev_err messages span two lines and the literal
string is missing a white space between words. Add the white
space and join the two lines into one.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: FLorian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

22da7349

pkt_sched: fq: use proper locking in fq_dump_stats() · 695b4ec0

Eric Dumazet authored Sep 15, 2016

When fq is used on 32bit kernels, we need to lock the qdisc before
copying 64bit fields.

Otherwise "tc -s qdisc ..." might report bogus values.

Fixes: afe4fd06 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

695b4ec0

openvswitch: use percpu flow stats · db74a333

Thadeu Lima de Souza Cascardo authored Sep 15, 2016

Instead of using flow stats per NUMA node, use it per CPU. When using
megaflows, the stats lock can be a bottleneck in scalability.

On a E5-2690 12-core system, usual throughput went from ~4Mpps to
~15Mpps when forwarding between two 40GbE ports with a single flow
configured on the datapath.

This has been tested on a system with possible CPUs 0-7,16-23. After
module removal, there were no corruption on the slab cache.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Cc: pravin shelar <pshelar@ovn.org>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

db74a333

openvswitch: fix flow stats accounting when node 0 is not possible · 40773966

Thadeu Lima de Souza Cascardo authored Sep 15, 2016

On a system with only node 1 as possible, all statistics is going to be
accounted on node 0 as it will have a single writer.

However, when getting and clearing the statistics, node 0 is not going
to be considered, as it's not a possible node.

Tested that statistics are not zero on a system with only node 1
possible. Also compile-tested with CONFIG_NUMA off.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

40773966

Merge branch 'sctp-transmit-errs' · 829ff348

David S. Miller authored Sep 18, 2016

Xin Long says:

====================
sctp: fix the transmit err process

This patchset is to improve the transmit err process and also fix some
issues.

After this patchset, once the chunks are enqueued successfully, even
if the chunks fail to send out, no matter because of nodst or nomem,
no err retruns back to users any more. Instead, they are taken care
of by retransmit.

v1->v2:
  - add more details to the changelog in patch 1/6
  - add Fixes: tag in patch 2/6, 3/6
  - also revert 69b5777f in patch 3/6
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

829ff348