Commits · d41a69f1d390fa3f2546498103cdcd78b30676ff · nexedi / linux

02 May, 2016 33 commits

tcp: make tcp_sendmsg() aware of socket backlog · d41a69f1

Eric Dumazet authored Apr 29, 2016

Large sendmsg()/write() hold socket lock for the duration of the call,
unless sk->sk_sndbuf limit is hit. This is bad because incoming packets
are parked into socket backlog for a long time.
Critical decisions like fast retransmit might be delayed.
Receivers have to maintain a big out of order queue with additional cpu
overhead, and also possible stalls in TX once windows are full.

Bidirectional flows are particularly hurt since the backlog can become
quite big if the copy from user space triggers IO (page faults)

Some applications learnt to use sendmsg() (or sendmmsg()) with small
chunks to avoid this issue.

Kernel should know better, right ?

Add a generic sk_flush_backlog() helper and use it right
before a new skb is allocated. Typically we put 64KB of payload
per skb (unless MSG_EOR is requested) and checking socket backlog
every 64KB gives good results.

As a matter of fact, tests with TSO/GSO disabled give very nice
results, as we manage to keep a small write queue and smaller
perceived rtt.

Note that sk_flush_backlog() maintains socket ownership,
so is not equivalent to a {release_sock(sk); lock_sock(sk);},
to ensure implicit atomicity rules that sendmsg() was
giving to (possibly buggy) applications.

In this simple implementation, I chose to not call tcp_release_cb(),
but we might consider this later.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@fb.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

d41a69f1

net: do not block BH while processing socket backlog · 5413d1ba

Eric Dumazet authored Apr 29, 2016

Socket backlog processing is a major latency source.

With current TCP socket sk_rcvbuf limits, I have sampled __release_sock()
holding cpu for more than 5 ms, and packets being dropped by the NIC
once ring buffer is filled.

All users are now ready to be called from process context,
we can unblock BH and let interrupts be serviced faster.

cond_resched_softirq() could be removed, as it has no more user.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

5413d1ba

sctp: prepare for socket backlog behavior change · 860fbbc3

Eric Dumazet authored Apr 29, 2016

sctp_inq_push() will soon be called without BH being blocked
when generic socket code flushes the socket backlog.

It is very possible SCTP can be converted to not rely on BH,
but this needs to be done by SCTP experts.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

860fbbc3

udp: prepare for non BH masking at backlog processing · e61da9e2

Eric Dumazet authored Apr 29, 2016

UDP uses the generic socket backlog code, and this will soon
be changed to not disable BH when protocol is called back.

We need to use appropriate SNMP accessors.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e61da9e2

dccp: do not assume DCCP code is non preemptible · 7309f882

Eric Dumazet authored Apr 29, 2016

DCCP uses the generic backlog code, and this will soon
be changed to not disable BH when protocol is called back.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7309f882

tcp: do not block bh during prequeue processing · fb3477c0

Eric Dumazet authored Apr 29, 2016

AFAIK, nothing in current TCP stack absolutely wants BH
being disabled once socket is owned by a thread running in
process context.

As mentioned in my prior patch ("tcp: give prequeue mode some care"),
processing a batch of packets might take time, better not block BH
at all.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

fb3477c0

tcp: do not assume TCP code is non preemptible · c10d9310

Eric Dumazet authored Apr 29, 2016

We want to to make TCP stack preemptible, as draining prequeue
and backlog queues can take lot of time.

Many SNMP updates were assuming that BH (and preemption) was disabled.

Need to convert some __NET_INC_STATS() calls to NET_INC_STATS()
and some __TCP_INC_STATS() to TCP_INC_STATS()

Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset()
and tcp_v4_send_ack(), we add an explicit preempt disabled section.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c10d9310

Merge branch 'xgene-channel-number' · 5e59c83f

David S. Miller authored May 02, 2016

Iyappan Subramanian says:

====================
drivers: net: xgene: fix: Get channel number from device binding

This patch set adds 'channel' property to get ethernet to CPU channel number,
thus decoupling the Linux driver from static resource selection.

v2: Address review comments from v1
- removed irq reference from Linux driver
- added 'channel' property to get ethernet to CPU channel number

v1:
- Initial version
====================
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5e59c83f

dtb: xgene: Add channel property · 6619ac5a

Iyappan Subramanian authored Apr 29, 2016

Added 'channel' property, describing ethernet to CPU channel number.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6619ac5a

Documentation: dtb: xgene: Add channel property · 4cac949f

Iyappan Subramanian authored Apr 29, 2016

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

4cac949f

drivers: net: xgene: Get channel number from device binding · 2a37daa6

Iyappan Subramanian authored Apr 29, 2016

This patch gets ethernet to CPU channel (prefetch buffer number) from
the newly added 'channel' property, thus decoupling Linux driver from
resource management.
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

2a37daa6

Merge branch 'qed-selftests' · 582a1db9

David S. Miller authored May 02, 2016

Sudarsana Reddy Kalluru says:

====================
qed/qede: ethtool selftests support.

This series adds the driver support for following selftests:
1. Register test
2. Memory test
3. Clock test
4. Interrupt test
5. Internal loopback test
Patch (1) adds the qed driver infrastructure for selftests. Patches (2) and
(3) add qede driver support for ethtool selftests.

Please consider applying this series to "net-next".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

582a1db9

qede: add implementation for internal loopback test. · 16f46bf0

Sudarsana Reddy Kalluru authored Apr 28, 2016

This patch adds the qede implementation for internal loopback test.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: Manish Chopra <manish.chopra@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

16f46bf0

qede: add support for selftests. · 3044a02e

Sudarsana Reddy Kalluru authored Apr 28, 2016

This patch adds the qede ethtool support for the following tests:
- interrupt test
- memory test
- register test
- clock test
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

3044a02e

qed: add infrastructure for device self tests. · 03dc76ca

Sudarsana Reddy Kalluru authored Apr 28, 2016

This patch adds the functionality and APIs needed for selftests.
It adds the ability to configure the link-mode which is required for the
implementation of loopback tests. It adds the APIs for clock test,
register test, interrupt test and memory test.
Signed-off-by: Sudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
Signed-off-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

03dc76ca

net: dsa: mv88e6xxx: replace ds with ps where possible · 158bc065

Andrew Lunn authored Apr 28, 2016

The dsa_switch structure ds is actually needed in very few places,
mostly during setup of the switch. The private structure ps is however
needed nearly everywhere. Pass ps, not ds internally.

[vd: rebased Andrew's patch.]
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

158bc065

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue · 8cd14ccb

David S. Miller authored May 01, 2016

Jeff Kirsher says:

====================
40GbE Intel Wired LAN Driver Updates 2016-05-01

This series contains updates to i40e and i40evf.

The theme of this series is code reduction, with several code cleanups in
this series.  Starting with Neerav's removal of the code that implemented
the HMC AQ APIs and calls, since they are now obsolete and not supported
by firmware.

Anjali changes the default of VFs to make sure they are not trusted or
privileged until its explicitly set for trust through the new NDO op
interface.  Also limited the number of MAC and VLAN addresses a VF can
add if it is untrusted/privileged.

Carolyn syncs the VF code for the changes made to the PF for the RSS
hash tuple settings, which ends up cleaning up much of the existing code.

Jesse cleans up compiler warnings which were found with gcc's W=2 option.
Then removed duplicate code, especially since only one copy was actually
being used.

Jacob addresses an issue which was found when testing GCC 6's which
happens to produce new warnings when you left shift a signed value
beyond the storage sizeof the type.  The converts i40e & i40evf to use
the BIT() macro more consistently.

Alex actually bucks the trend of code removal by adding support for
both drivers to use GSO_PARTIAL so that segmentation of frames with
checksums enabled in outer headers is supported.  Fortunately it does
not take much to add this support!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

8cd14ccb

sctp: signal sk_data_ready earlier on data chunks reception · 0970f5b3

Marcelo Ricardo Leitner authored Apr 29, 2016

Dave Miller pointed out that fb586f25 ("sctp: delay calls to
sk_data_ready() as much as possible") may insert latency specially if
the receiving application is running on another CPU and that it would be
better if we signalled as early as possible.

This patch thus basically inverts the logic on fb586f25 and signals
it as early as possible, similar to what we had before.

Fixes: fb586f25 ("sctp: delay calls to sk_data_ready() as much as possible")
Reported-by: Dave Miller <davem@davemloft.net>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0970f5b3

mdio_bus: Fix MDIO bus scanning in __mdiobus_register() · 70e927b9

Marek Vasut authored May 02, 2016

Since commit b74766a0 ("phylib: don't return NULL
from get_phy_device()") in linux-next, phy_get_device() will return
ERR_PTR(-ENODEV) instead of NULL if the PHY device ID is all ones.

This causes problem with stmmac driver and likely some other drivers
which call mdiobus_register(). I triggered this bug on SoCFPGA MCVEVK
board with linux-next 20160427 and 20160428. In case of the stmmac, if
there is no PHY node specified in the DT for the stmmac block, the stmmac
driver ( drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c function
stmmac_mdio_register() ) will call mdiobus_register() , which will
register the MDIO bus and probe for the PHY.

The mdiobus_register() resp. __mdiobus_register() iterates over all of
the addresses on the MDIO bus and calls mdiobus_scan() for each of them,
which invokes get_phy_device(). Before the aforementioned patch, the
mdiobus_scan() would return NULL if no PHY was found on a given address
and mdiobus_register() would continue and try the next PHY address. Now,
mdiobus_scan() returns ERR_PTR(-ENODEV), which is caught by the
'if (IS_ERR(phydev))' condition and the loop exits immediately if the
PHY address does not contain PHY.

Repair this by explicitly checking for the ERR_PTR(-ENODEV) and if this
error comes around, continue with the next PHY address.
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dinh Nguyen <dinguyen@opensource.altera.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

70e927b9

i40e/i40evf: Add support for GSO partial with UDP_TUNNEL_CSUM and GRE_CSUM · 1c7b4a23

Alexander Duyck authored Apr 14, 2016

This patch makes it so that i40e and i40evf can use GSO_PARTIAL to support
segmentation for frames with checksums enabled in outer headers.  As a
result we can now send data over these types of tunnels at over 20Gb/s
versus the 12Gb/s that was previously possible on my system.

The advantage with the i40e parts is that this offload is mostly
transparent as the hardware still deals with the inner and/or outer IPv4
headers so the IP ID is still incrementing for both when this offload is
performed.
Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

1c7b4a23

i40evf: make use of BIT() macro to avoid signed left shift · ae63bff0

Jacob Keller authored Apr 13, 2016

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ae63bff0

i40e: make use of BIT() macro to prevent left shift of signed values · 2101bac2

Jacob Keller authored Apr 13, 2016

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

2101bac2

i40e/i40evf: fix I40E_MASK signed shift overflow warnings · dcb57456

Jacob Keller authored Apr 13, 2016

GCC 6 has a new warning which will display when you attempt to left
shift a signed value beyond the storage size of the type. I40E_MASK
generates a mask value for 32bit registers. Properly typecast the mask
value and place the values in parenthesis to prevent macro expansion
issues.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

dcb57456

i40e/i40evf : Bump driver version from 1.5.5 to 1.5.10 · 5a6fc256

Harshitha Ramamurthy authored Apr 13, 2016

Signed-off-by: Harshitha Ramamurthy <harshitha.ramamurthy@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

5a6fc256

i40e: Update device ids for X722 · a3aa5036

Catherine Sullivan authored Apr 13, 2016

Add a device ID for X722.

Change-Id: I574f2345ab341de98a6a1c212d0603af853e48b0
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

a3aa5036

i40e: Drop extra copy of function · de38fef6

Jesse Brandeburg authored Apr 13, 2016

i40e_release_rx_desc was in two files, but was only used
and needed in txrx.c.  Get rid of the extra copy.

Change-Id: I86e18239aa03531fc198b6c052847475084a9200
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

de38fef6

i40e: Use consistent type for vf_id · a1b5a24f

Jesse Brandeburg authored Apr 13, 2016

The driver was all over the place using signed or unsigned types
for vf_id, when it should always be signed.

This fixes warnings of type unsafe comparisons from gcc with W=2.

Change-Id: I2cb681f83d0f68ca124d2e4131e4ac0d9f8a6b22
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

a1b5a24f

i40e: PTP - avoid aggregate return warnings · cdc3d932

Jesse Brandeburg authored Apr 13, 2016

Aggregate return warnings are when struct types are returned
and must be copied to the lvalue with a struct copy by the compiler.

This fixes warnings of type aggregate-return from gcc with W=2.

Change-Id: I896b1bf514544bf0faeb458869d79914b9f1b168
Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

cdc3d932

i40e: Fix uninitialized variable · 3ed439c5

Catherine Sullivan authored Apr 13, 2016

We have an uninitialized variable warning for valid_len for one case in
validate_vf_mesg. To fix this, just initialize it to 0 at the top of the
function and remove all of the now redundant assignments to 0 in the
individual cases.

Change-Id: Iacbd97f4c521ed8d662eef803a598d8707708cfd
Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

3ed439c5

i40evf: RSS Hash Option parameters · b29699b3

Carolyn Wyborny authored Apr 13, 2016

This patch syncs the VF code for the changes made to the PF for the RSS
hash tuple settings. Since the VF still cannot change the RSS hash
settings, change the code to make this clear to the user. Previously,
the default settings were returned in this function. However, the
default can be changed by the PF so this does not make sense anymore.

Change-Id: I085eaf005fc7978b440d2a1bf2b2dd7cadaff39b
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

b29699b3

i40e: Remove HMC AQ API implementation · 2b79c58d

Neerav Parikh authored May 01, 2016

Remove the code that implements the HMC AQ APIs and call these APIs.
This is done because these are obsolete APIs and are not supported
by firmware.

Change-ID: I5d771d8f37c3e16e7b0a972ff9b27e75aa2d05d4
Signed-off-by: Neerav Parikh <neerav.parikh@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

2b79c58d

i40e: Prevent falling to promiscuous if the VF is not trusted · a856b5cb

Anjali Singhai Jain authored Apr 13, 2016

With this change a non trusted VF can never fall to promiscuous
mode when there is no room for a MAC/VLAN filter.

Change-Id: I8a155aa25c0bcdc6093414920c9ade4ee0bd20e8
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

a856b5cb

i40e: Limit the number of MAC and VLAN addresses that can be added for VFs · 5f527ba9

Anjali Singhai Jain authored Apr 13, 2016

If the VF is privileged/trusted it can do as it may please including
but not limited to hogging resources and playing unfair.
But if the VF is not privileged/trusted it still can add some number
(8) of MAC and VLAN addresses.
Other restrictions with respect to Port VLAN and normal VLAN still apply
to not privileged/trusted VF.

Change-Id: I3a9529201b184c8873e1ad2e300aff468c9e6296
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

5f527ba9

01 May, 2016 3 commits

tipc: set 'active' state correctly for first established link · def22c47

Jon Paul Maloy authored Apr 28, 2016

When we are displaying statistics for the first link established between
two peers, it will always be presented as STANDBY although it in reality
is ACTIVE.

This happens because we forget to set the 'active' flag in the link
instance at the moment it is established. Although this is a bug, it only
has impact on the presentation view of the link, not on its actual
functionality.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

def22c47

of: of_mdio: Check if MDIO bus controller is available · a77f4f70

Florian Fainelli authored Apr 28, 2016

Add a check whether the 'struct device_node' pointer passed to
of_mdiobus_register() is an available (aka enabled) node in the Device
Tree.

Rationale for doing this are cases where an Ethernet MAC provides a MDIO
bus controller and node, and an additional Ethernet MAC might be
connecting its PHY/switches to that first MDIO bus controller, while
still embedding one internally which is therefore marked as "disabled".

Instead of sprinkling checks like these in callers of
of_mdiobus_register(), do this in a central location.
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a77f4f70

i40e: Change the default for VFs to be not privileged · 692fb0a7

Anjali Singhai Jain authored Apr 13, 2016

Make sure a VF is not trusted/privileged until its explicitly
set for trust through the new NDO op interface.

Change-Id: I476385c290d2b4901d8fceb29de43546accdc499
Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

692fb0a7

29 Apr, 2016 4 commits

Merge branch 'mlx5-aRFS' · 3cfef195

David S. Miller authored Apr 29, 2016

Saeed Mahameed says:

====================
Mellanox 100G mlx5 ethernet aRFS support

This series adds accelerated RFS support for the mlx5e driver.
I have added one patch non-related to aRFS that fixes the rtnl_lock
warning mlx5 driver been getting since b7aade15 ('vxlan: break dependency with netdev drivers')

aRFS support in details:

A direct TIR per RQ is now required in order to have the essential building blocks
for aRFS.  Today the driver has one direct TIR that forwards traffic to RQ[0] (core 0),
and one indirect TIR for RSS indirection table.  For that we've added one direct TIR
per RQ, e.g.: TIR[i] -> RQ[i] (core i).

Publicize Modify flow rule destination and reveal it in flow steering API, to have the
ability to dynamically modify the destination TIR(core) for aRFS rules from the
ethernet driver.

Initializing CPU reverse mapping to notify upper layer on internal receive queue cpu
mappings.

Some design refactoring for mlx5e ethernet driver flow tables and flow steering API.
Now the caller of create_flow_table can choose the level of the flow table, this way
we will create the mlx5e flow tables in a reversed order and connect them as we go,
we create flow table[i+1] before flow table[i] to be able to set flow table[i + 1] as
a destination of flow table[i] once flow table[i] is created.
also we have split the main flow table in the following manner:
    - From before: RX packet had to visit two flow tables until it is delivered to its receive queue:
        RX packet -> vlan filter flow table -> main flow table.
        > vlan filter will check the packet vlan field is allowed.
        > main flow will check if the dest mac is allowed and will check the l3/l4 headers to
        retrieve the RSS hash for steering the packet into its final receive queue.

    - Now main flow table is split into l2 dst mac steering table and ttc (traffic type classifier) table:
        RX packet -> vlan filter -> l2 table -> ttc table
        > vlan filter - same as before
        > L2 filter - filter packets according their destination mac address
        > ttc table - classify packet headers for RSS steering
            - L3/L4 classification rules to steer the packet according to thier headers hash
            - in case of none of the rules applies the packet is steered to RQ[0]

After the above refactoring all left to-do is to create aRFS flow table which will manage
aRFS steering rules to forward traffic to the desired RQ (core) and just connect the ttc
table rules destinations to aRFS flow table.

aRFS flow table in case of a miss will deliver the traffic to the core where the original
ttc hash would have chosen.

TTC table is not initialized and enabled until the user explicitly asks to, i.e. setting the NETIF_F_NTUPLE
to ON.  This way there is no need for ttc table to forward traffic to aRFS table unless required.
When setting back to OFF aRFS flow table is disabled and disconnected.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>

3cfef195

net/mlx5e: Enabling aRFS mechanism · 45bf454a

Maor Gottlieb authored Apr 29, 2016

Accelerated RFS requires that ntuple filtering is enabled via
ethtool and driver supports ndo_rx_flow_steer.
When the ntuple filtering is enabled, we modify the l3_l4 ttc
rules to point on the aRFS flow tables and when the filtering
is disabled, we modify the l3_l4 ttc rules to point on the RSS
TIRs.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

45bf454a

net/mlx5e: Add accelerated RFS support · 18c908e4

Maor Gottlieb authored Apr 29, 2016

Implement ndo_rx_flow_steer ndo.
A new flow steering rule will be composed from the
skb 4-tuple and added to the hardware aRFS flow table.

Each rule is stored in an internal hash table, if such
skb 4-tuple rule already exists we update the corresponding
hardware steering rule with the new destination.

For garbage collection rps_may_expire_flow will be
invoked for a limited amount of old rules upon any
ndo_rx_flow_steer invocation.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

18c908e4

net/mlx5e: Create aRFS flow tables · 1cabe6b0

Maor Gottlieb authored Apr 29, 2016

Create the following four flow tables for aRFS usage:
1. IPv4 TCP - filtering 4-tuple of IPv4 TCP packets.
2. IPv6 TCP - filtering 4-tuple of IPv6 TCP packets.
3. IPv4 UDP - filtering 4-tuple of IPv4 UDP packets.
4. IPv6 UDP - filtering 4-tuple of IPv6 UDP packets.

Each flow table has two flow groups: one for the 4-tuple
filtering (full match)  and the other contains * rule for miss rule.

Full match rule means a hit for aRFS and packet will be forwarded
to the dedicated RQ/Core, miss rule packets will be forwarded to
default RSS hashing.
Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

1cabe6b0