- 09 Aug, 2019 40 commits
-
-
Vlad Buslov authored
List of flows attached to mod header entry is used as implicit reference counter (mod header entry is deallocated when list becomes free) and as a mechanism to obtain mod header entry that flow is attached to (through list head). This is not safe when concurrent modification of list of flows attached to mod header entry is possible. Proper atomic reference counter is required to support concurrent access. As a preparation for extending mod header with reference counting, extract code that lookups and deletes mod header entry into standalone put/get helpers. In order to remove this dependency on external locking, extend mod header entry with reference counter to manage its lifetime and extend flow structure with direct pointer to mod header entry that flow is attached to. To remove code duplication between legacy and switchdev mode implementations that both support mod_hdr functionality, store mod_hdr table in dedicated structure used by both fdb and kernel namespaces. New table structure is extended with table lock by one of the following patches in this series. Implement helper function to get correct mod_hdr table depending on flow namespace. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
-
Vlad Buslov authored
Hairpin entries creation is fully synchronized by hairpin_tbl_lock. In order to allow concurrent initialization of mlx5e_hairpin structure instances and provisioning of hairpin entries to hardware, extend mlx5e_hairpin_entry with 'res_ready' completion. Move call to mlx5e_hairpin_create() out of hairpin_tbl_lock critical section. Modify code that attaches new flows to existing hpe to wait for 'res_ready' completion before using the hpe. Insert hpe to hairpin table before provisioning it to hardware and modify all users of hairpin table to verify that hpe was fully initialized by checking hpe->hp pointer (and to wait for 'res_ready' completion, if necessary). Modify dead peer update event handling function to save hpe's to temporary list with their reference counter incremented. Wait for completion of hpe's in temporary list and update their 'peer_gone' flag outside of hairpin_tbl_lock critical section. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
-
Vlad Buslov authored
To remove dependency on rtnl lock, protect hairpin hash table from concurrent modifications with new "hairpin_tbl_lock" mutex. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
-
Vlad Buslov authored
To remove dependency on rtnl lock, extend hairpin entry with spinlock and use it to protect list of flows attached to hairpin entry from concurrent modifications. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
-
Vlad Buslov authored
List of flows attached to hairpin entry is used as implicit reference counter (hairpin entry is deallocated when list becomes free) and as a mechanism to obtain hairpin entry that flow is attached to (through list head). This is not safe when concurrent modification of list of flows attached to hairpin entry is possible. Proper atomic reference counter is required to support concurrent access. As a preparation for extending hairpin with reference counting, extract code that deletes hairpin entry into standalone function. In order to remove this dependency on external locking, extend hairpin entry with reference counter to manage its lifetime and extend flow structure with direct pointer to hairpin entry that flow is attached to. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Jianbo Liu <jianbol@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
-
YueHaibing authored
net/sched/sch_taprio.c:680:32: warning: entry_list_policy defined but not used [-Wunused-const-variable=] One of the points of commit a3d43c0d ("taprio: Add support adding an admin schedule") is that it removes support (it now returns "not supported") for schedules using the TCA_TAPRIO_ATTR_SCHED_SINGLE_ENTRY attribute (which were never used), the parsing of those types of schedules was the only user of this policy. So removing this policy should be fine. Reported-by: Hulk Robot <hulkci@huawei.com> Suggested-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Holger Hoffstätte authored
Disabling TSO but leaving SG active results is a significant performance drop. Therefore disable also SG on RTL8168evl. This restores the original performance. Fixes: 93681cd7 ("r8169: enable HW csum and TSO") Signed-off-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Josh Hunt authored
TCP_BASE_MSS is used as the default initial MSS value when MTU probing is enabled. Update the comment to reflect this. Suggested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Josh Hunt authored
The current implementation of TCP MTU probing can considerably underestimate the MTU on lossy connections allowing the MSS to get down to 48. We have found that in almost all of these cases on our networks these paths can handle much larger MTUs meaning the connections are being artificially limited. Even though TCP MTU probing can raise the MSS back up we have seen this not to be the case causing connections to be "stuck" with an MSS of 48 when heavy loss is present. Prior to pushing out this change we could not keep TCP MTU probing enabled b/c of the above reasons. Now with a reasonble floor set we've had it enabled for the past 6 months. The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives administrators the ability to control the floor of MSS probing. Signed-off-by: Josh Hunt <johunt@akamai.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
The size of the snapshot has to be the same as the size of the region, therefore no need to pass it again during snapshot creation. Remove the arg and use region->size instead. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Starting from commit d41a69f1 ("tcp: make tcp_sendmsg() aware of socket backlog") loopback flows got hurt, because for each skb sent, the socket receives an immediate ACK and sk_flush_backlog() causes extra work. Intent was to not let the backlog grow too much, but we went a bit too far. We can check the backlog every 16 skbs (about 1MB chunks) to increase TCP over loopback performance by about 15 % Note that the call to sk_flush_backlog() handles a single ACK, thanks to coalescing done on backlog, but cleans the 16 skbs found in rtx rb-tree. Reported-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Denis Efremov authored
octeon_mbox_process_cmd() directly writes the PCI_EXP_DEVCTL_BCR_FLR bit, which bypasses timing requirements imposed by the PCIe spec. This patch fixes the function to use the pcie_flr() interface instead. Signed-off-by: Denis Efremov <efremov@linux.com> Reviewed-by: Andrew Murray <andrew.murray@arm.com> Reviewed-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Heiner Kallweit authored
We allocate 16kb per rx buffer, so we can avoid some overhead by using alloc_pages_node directly instead of bothering kmalloc_node. Due to this change buffers are page-aligned now, therefore the alignment check can be removed. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Acked-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
YueHaibing authored
Fixes gcc '-Wunused-but-set-variable' warning: net/sched/sch_fq_codel.c: In function fq_codel_dequeue: net/sched/sch_fq_codel.c:288:23: warning: variable prev_ecn_mark set but not used [-Wunused-but-set-variable] net/sched/sch_fq_codel.c:288:6: warning: variable prev_drop_count set but not used [-Wunused-but-set-variable] They are not used since commit 77ddaff2 ("fq_codel: Kill useless per-flow dropped statistic") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Extend existing driver for Spectrum and Spectrum-2 ASICs to support Spectrum-3 ASIC as well. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Jose Abreu says: ==================== net: stmmac: Improvements for -next [ This is just a rebase of v2 into latest -next in order to avoid a merge conflict ] Couple of improvements for -next tree. More info in commit logs. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Abreu authored
Add a selftest for the Flexible RX Parser feature. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Abreu authored
XGMAC cores also support the Flexible RX Parser feature. Add the support for it in the XGMAC core. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Abreu authored
XGMAC also supports Safety Features. This patch implements the configuration and handling of this feature in XGMAC core. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Abreu authored
Add a selftest for VLAN and Double VLAN Filtering in stmmac. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Abreu authored
Implement the VLAN Hash Filtering feature in XGMAC core. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Abreu authored
Add a test for RSS in the stmmac selftests. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Abreu authored
Implement the RSS functionality and add the corresponding callbacks in XGMAC core. Changes from v1: - Do not use magic constants (Jakub) - Use ethtool_rxfh_indir_default() (Jakub) Signed-off-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Abreu authored
Implement the TX Queue Priority callback in XGMAC core. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Abreu authored
Implement the TX Queue Weight callback. In order for this to be active we also need to set ETS algorithm when configuring Queue. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jose Abreu authored
Implement the MMC counters feature in XGMAC core. Signed-off-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
John Rutherford authored
Since node internal messages are passed directly to the socket, it is not possible to observe those messages via tcpdump or wireshark. We now remedy this by making it possible to clone such messages and send the clones to the loopback interface. The clones are dropped at reception and have no functional role except making the traffic visible. The feature is enabled if network taps are active for the loopback device. pcap filtering restrictions require the messages to be presented to the receiving side of the loopback device. v3 - Function dev_nit_active used to check for network taps. - Procedure netif_rx_ni used to send cloned messages to loopback device. Signed-off-by: John Rutherford <john.rutherford@dektech.com.au> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
wenxu says: ==================== flow_offload: add indr-block in nf_table_offload This series patch make nftables offload support the vlan and tunnel device offload through indr-block architecture. The first four patches mv tc indr block to flow offload and rename to flow-indr-block. Because the new flow-indr-block can't get the tcf_block directly. The fifth patch provide a callback list to get flow_block of each subsystem immediately when the device register and contain a block. The last patch make nf_tables_offload support flow-indr-block. This version add a mutex lock for add/del flow_indr_block_ing_cb ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
wenxu authored
nftable support indr-block call. It makes nftable an offload vlan and tunnel device. nft add table netdev firewall nft add chain netdev firewall aclout { type filter hook ingress offload device mlx_pf0vf0 priority - 300 \; } nft add rule netdev firewall aclout ip daddr 10.0.0.1 fwd to vlan0 nft add chain netdev firewall aclin { type filter hook ingress device vlan0 priority - 300 \; } nft add rule netdev firewall aclin ip daddr 10.0.0.7 fwd to mlx_pf0vf0 Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
wenxu authored
It provide a callback list to find the blocks of tc and nft subsystems Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
wenxu authored
move tc indirect block to flow_offload and rename it to flow indirect block.The nf_tables can use the indr block architecture. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
wenxu authored
This patch make indr_block_call don't access struct tc_indr_block_cb and tc_indr_block_dev directly Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
wenxu authored
Remove the tcf_block in the tc_indr_block_dev for muti-subsystem support. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
wenxu authored
This patch make tc_indr_block_ing_cmd can't access struct tc_indr_block_dev and tc_indr_block_cb. Signed-off-by: wenxu <wenxu@ucloud.cn> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Edward Cree says: ==================== net: batched receive in GRO path This series listifies part of GRO processing, in a manner which allows those packets which are not GROed (i.e. for which dev_gro_receive returns GRO_NORMAL) to be passed on to the listified regular receive path. dev_gro_receive() itself is not listified, nor the per-protocol GRO callback, since GRO's need to hold packets on lists under napi->gro_hash makes keeping the packets on other lists awkward, and since the GRO control block state of held skbs can refer only to one 'new' skb at a time. Instead, when napi_frags_finish() handles a GRO_NORMAL result, stash the skb onto a list in the napi struct, which is received at the end of the napi poll or when its length exceeds the (new) sysctl net.core.gro_normal_batch. Performance figures with this series, collected on a back-to-back pair of Solarflare sfn8522-r2 NICs with 120-second NetPerf tests. In the stats, sample size n for old and new code is 6 runs each; p is from a Welch t-test. Tests were run both with GRO enabled and disabled, the latter simulating uncoalesceable packets (e.g. due to IP or TCP options). The receive side (which was the device under test) had the NetPerf process pinned to one CPU, and the device interrupts pinned to a second CPU. CPU utilisation figures (used in cases of line-rate performance) are summed across all CPUs. net.core.gro_normal_batch was left at its default value of 8. TCP 4 streams, GRO on: all results line rate (9.415Gbps) net-next: 210.3% cpu after #1: 181.5% cpu (-13.7%, p=0.031 vs net-next) after #3: 196.7% cpu (- 8.4%, p=0.136 vs net-next) TCP 4 streams, GRO off: net-next: 8.017 Gbps after #1: 7.785 Gbps (- 2.9%, p=0.385 vs net-next) after #3: 7.604 Gbps (- 5.1%, p=0.282 vs net-next. But note *) TCP 1 stream, GRO off: net-next: 6.553 Gbps after #1: 6.444 Gbps (- 1.7%, p=0.302 vs net-next) after #3: 6.790 Gbps (+ 3.6%, p=0.169 vs net-next) TCP 1 stream, GRO on, busy_read = 50: all results line rate net-next: 156.0% cpu after #1: 174.5% cpu (+11.9%, p=0.015 vs net-next) after #3: 165.0% cpu (+ 5.8%, p=0.147 vs net-next) TCP 1 stream, GRO off, busy_read = 50: net-next: 6.488 Gbps after #1: 6.625 Gbps (+ 2.1%, p=0.059 vs net-next) after #3: 7.351 Gbps (+13.3%, p=0.026 vs net-next) TCP_RR 100 streams, GRO off, 8000 byte payload net-next: 995.083 us after #1: 969.167 us (- 2.6%, p=0.204 vs net-next) after #3: 976.433 us (- 1.9%, p=0.254 vs net-next) TCP_RR 100 streams, GRO off, 8000 byte payload, busy_read = 50: net-next: 2.851 ms after #1: 2.871 ms (+ 0.7%, p=0.134 vs net-next) after #3: 2.937 ms (+ 3.0%, p<0.001 vs net-next) TCP_RR 100 streams, GRO off, 1 byte payload, busy_read = 50: net-next: 867.317 us after #1: 865.717 us (- 0.2%, p=0.334 vs net-next) after #3: 868.517 us (+ 0.1%, p=0.414 vs net-next) (*) These tests produced a mixture of line-rate and below-line-rate results, meaning that statistically speaking the results were 'censored' by the upper bound, and were thus not normally distributed, making a Welch t-test mathematically invalid. I therefore also calculated estimators according to [1], which gave the following: net-next: 8.133 Gbps after #1: 8.130 Gbps (- 0.0%, p=0.499 vs net-next) after #3: 7.680 Gbps (- 5.6%, p=0.285 vs net-next) (though my procedure for determining ν wasn't mathematically well-founded either, so take that p-value with a grain of salt). A further check came from dividing the bandwidth figure by the CPU usage for each test run, giving: net-next: 3.461 after #1: 3.198 (- 7.6%, p=0.145 vs net-next) after #3: 3.641 (+ 5.2%, p=0.280 vs net-next) The above results are fairly mixed, and in most cases not statistically significant. But I think we can roughly conclude that the series marginally improves non-GROable throughput, without hurting latency (except in the large-payload busy-polling case, which in any case yields horrid performance even on net-next (almost triple the latency without busy-poll). Also, drivers which, unlike sfc, pass UDP traffic to GRO would expect to see a benefit from gaining access to batching. Changed in v3: * gro_normal_batch sysctl now uses SYSCTL_ONE instead of &one * removed RFC tags (no comments after a week means no-one objects, right?) Changed in v2: * During busy poll, call gro_normal_list() to receive batched packets after each cycle of the napi busy loop. See comments in Patch #3 for complications of doing the same in busy_poll_stop(). [1]: Cohen 1959, doi: 10.1080/00401706.1959.10489859 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Edward Cree authored
When GRO decides not to coalesce a packet, in napi_frags_finish(), instead of passing it to the stack immediately, place it on a list in the napi struct. Then, at flush time (napi_complete_done(), napi_poll(), or napi_busy_loop()), call netif_receive_skb_list_internal() on the list. We'd like to do that in napi_gro_flush(), but it's not called if !napi->gro_bitmask, so we have to do it in the callers instead. (There are a handful of drivers that call napi_gro_flush() themselves, but it's not clear why, or whether this will affect them.) Because a full 64 packets is an inefficiently large batch, also consume the list whenever it exceeds gro_normal_batch, a new net/core sysctl that defaults to 8. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Edward Cree authored
Same rationale as for sfc, except that this wasn't performance-tested. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Edward Cree authored
We already scored points when handling the RX event, no-one else does this, and looking at the history it appears this was originally meant to only score on merges, not on GRO_NORMAL. Moreover, it gets in the way of changing GRO to not immediately pass GRO_NORMAL skbs to the stack. Performance testing with four TCP streams received on a single CPU (where throughput was line rate of 9.4Gbps in all tests) showed a 13.7% reduction in RX CPU usage (n=6, p=0.03). Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Rahul Verma authored
Supported ports in ethtool <eth1> are displayed based on media type. For media type fibre and twinaxial, port type is "FIBRE". Media type Base-T is "TP" and media KR is "Backplane". V1->V2: Corrected the subject. Signed-off-by: Rahul Verma <rahulv@marvell.com> Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Chuhong Yuan authored
All refcount operations are protected by spinlocks now. Then the atomic counter can be replaced by a normal int. This patch depends on PATCH 1/2. Signed-off-by: Chuhong Yuan <hslester96@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-