- 13 May, 2015 40 commits
-
-
David S. Miller authored
Willem de Bruijn says: ==================== refine packet socket rollover: 1. mitigate a case of lock contention 2. avoid exporting resource exhaustion to other sockets, by migrating only to a victim socket that has ample room 3. avoid reordering of most flows on the socket, by migrating first the flow responsible for load imbalance 4. help processes detect load imbalance, by exporting rollover counters Context: rollover implements flow migration in packet socket fanout groups in case of extreme load imbalance. It is a specific implementation of migration that minimizes reordering by selecting the same victim socket when possible (and by selecting subsequent victims in a round robin fashion, from which its name derives). Changes: v2 -> v3: - statistics: replace unsigned long with __aligned_u64 v1 -> v2: - huge flow detection: run lockless - huge flow detection: replace stored index with random - contention avoidance: test in packet_poll while lock held - contention avoidance: clear pressure sooner packet_poll and packet_recvmsg would clear only if the sock is empty to avoid taking the necessary lock. But, * packet_poll already holds this lock, so a lockless variant __packet_rcv_has_room is cheap. * packet_recvmsg is usually called only for non-ring sockets, which also runs lockless. - preparation: drop "single return" patch packet_rcv_has_room is now a locked wrapper around __packet_rcv_has_room, achieving the same (single footer). The benchmark mentioned in the patches is at https://github.com/wdebruij/kerneltools/blob/master/tests/bench_rollover.c ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Rollover indicates exceptional conditions. Export a counter to inform socket owners of this state. If no socket with sufficient room is found, rollover fails. Also count these events. Finally, also count when flows are rolled over early thanks to huge flow detection, to validate its correctness. Tested: Read counters in bench_rollover on all other tests in the patchset Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Migrate flows from a socket to another socket in the fanout group not only when the socket is full. Start migrating huge flows early, to divert possible 4-tuple attacks without affecting normal traffic. Introduce fanout_flow_is_huge(). This detects huge flows, which are defined as taking up more than half the load. It does so cheaply, by storing the rxhashes of the N most recent packets. If over half of these are the same rxhash as the current packet, then drop it. This only protects against 4-tuple attacks. N is chosen to fit all data in a single cache line. Tested: Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input. lpbb5:/export/hda3/willemb# ./bench_rollover -l 1000 -r -s cpu rx rx.k drop.k rollover r.huge r.failed 0 14 14 0 0 0 0 1 20 20 0 0 0 0 2 16 16 0 0 0 0 3 6168824 6168824 0 4867721 4867721 0 4 4867741 4867741 0 0 0 0 5 12 12 0 0 0 0 6 15 15 0 0 0 0 7 17 17 0 0 0 0 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Rollover has to call packet_rcv_has_room on sockets in the fanout group to find a socket to migrate to. This operation is expensive especially if the packet sockets use rings, when a lock has to be acquired. Avoid pounding on the lock by all sockets by temporarily marking a socket as "under memory pressure" when such pressure is detected. While set, only the socket owner may call packet_rcv_has_room on the socket. Once it detects normal conditions, it clears the flag. The socket is not used as a victim by any other socket in the meantime. Under reasonably balanced load, each socket writer frequently calls packet_rcv_has_room and clears its own pressure field. As a backup for when the socket is rarely written to, also clear the flag on reading (packet_recvmsg, packet_poll) if this can be done cheaply (i.e., without calling packet_rcv_has_room). This is only for edge cases. Tested: Ran bench_rollover: a process with 8 sockets in a single fanout group, each pinned to a single cpu that receives one nic recv interrupt. RPS and RFS are disabled. The benchmark uses packet rx_ring, which has to take a lock when determining whether a socket has room. Sent 3.5 Mpps of UDP traffic with sufficient entropy to spread uniformly across the packet sockets (and inserted an iptables rule to drop in PREROUTING to avoid protocol stack processing). Without this patch, all sockets try to migrate traffic to neighbors, causing lock contention when searching for a non- empty neighbor. The lock is the top 9 entries. perf record -a -g sleep 5 - 17.82% bench_rollover [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock - 99.00% spin_lock + 81.77% packet_rcv_has_room.isra.41 + 18.23% tpacket_rcv + 0.84% packet_rcv_has_room.isra.41 + 5.20% ksoftirqd/6 [kernel.kallsyms] [k] _raw_spin_lock + 5.15% ksoftirqd/1 [kernel.kallsyms] [k] _raw_spin_lock + 5.14% ksoftirqd/2 [kernel.kallsyms] [k] _raw_spin_lock + 5.12% ksoftirqd/7 [kernel.kallsyms] [k] _raw_spin_lock + 5.12% ksoftirqd/5 [kernel.kallsyms] [k] _raw_spin_lock + 5.10% ksoftirqd/4 [kernel.kallsyms] [k] _raw_spin_lock + 4.66% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock + 4.45% ksoftirqd/3 [kernel.kallsyms] [k] _raw_spin_lock + 1.55% bench_rollover [kernel.kallsyms] [k] packet_rcv_has_room.isra.41 On net-next with this patch, this lock contention is no longer a top entry. Most time is spent in the actual read function. Next up are other locks: + 15.52% bench_rollover bench_rollover [.] reader + 4.68% swapper [kernel.kallsyms] [k] memcpy_erms + 2.77% swapper [kernel.kallsyms] [k] packet_lookup_frame.isra.51 + 2.56% ksoftirqd/1 [kernel.kallsyms] [k] memcpy_erms + 2.16% swapper [kernel.kallsyms] [k] tpacket_rcv + 1.93% swapper [kernel.kallsyms] [k] mlx4_en_process_rx_cq Looking closer at the remaining _raw_spin_lock, the cost of probing in rollover is now comparable to the cost of taking the lock later in tpacket_rcv. - 1.51% swapper [kernel.kallsyms] [k] _raw_spin_lock - _raw_spin_lock + 33.41% packet_rcv_has_room + 28.15% tpacket_rcv + 19.54% enqueue_to_backlog + 6.45% __free_pages_ok + 2.78% packet_rcv_fanout + 2.13% fanout_demux_rollover + 2.01% netif_receive_skb_internal Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Only migrate flows to sockets that have sufficient headroom, where sufficient is defined as having at least 25% empty space. The kernel has three different buffer types: a regular socket, a ring with frames (TPACKET_V[12]) or a ring with blocks (TPACKET_V3). The latter two do not expose a read pointer to the kernel, so headroom is not computed easily. All three needs a different implementation to estimate free space. Tested: Ran bench_rollover for 10 sec with 1.5 Mpps of single flow input. bench_rollover has as many sockets as there are NIC receive queues in the system. Each socket is owned by a process that is pinned to one of the receive cpus. RFS is disabled. RPS is enabled with an identity mapping (cpu x -> cpu x), to count drops with softnettop. lpbb5:/export/hda3/willemb# ./bench_rollover -r -l 1000 -s Press [Enter] to exit cpu rx rx.k drop.k rollover r.huge r.failed 0 16 16 0 0 0 0 1 21 21 0 0 0 0 2 5227502 5227502 0 0 0 0 3 18 18 0 0 0 0 4 6083289 6083289 0 5227496 0 0 5 22 22 0 0 0 0 6 21 21 0 0 0 0 7 9 9 0 0 0 0 Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
Replace rollover state per fanout group with state per socket. Future patches will add fields to the new structure. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Willem de Bruijn authored
packet_rcv_fanout calls fanout_demux_rollover twice. Move all rollover logic into the callee to simplify these callsites, especially with upcoming changes. The main differences between the two callsites is that the FLAG variant tests whether the socket previously selected by another mode (RR, RND, HASH, ..) has room before migrating flows, whereas the rollover mode has no original socket to test. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
__ip_local_out_sk() is only used from net/ipv4/ip_output.c net/ipv4/ip_output.c:94:5: warning: symbol '__ip_local_out_sk' was not declared. Should it be static? Fixes: 7026b1dd ("netfilter: Pass socket pointer down through okfn().") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
tw_timer_handler() is only used from net/ipv4/inet_timewait_sock.c Fixes: 789f558c ("tcp/dccp: get rid of central timewait timer") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Jiri Pirko says: ==================== introduce programable flow dissector and cls_flower Per Davem's request, I prepared this patchset which introduces programmable flow dissector. For current users of flow_keys, there is a wrapper skb_flow_dissect_flow_keys which maintains the previous behaviour. For purposes of cls_flower, couple of new dissection keys were introduced. Note that this dissector can be also eventually used by openvswitch code. Also, as a next step, I plan to get rid of *skb_flow_get_ports(export) and *__skb_get_poff as their functionality can be now implemented by skb_flow_dissect as well. v2->v3: - remove TCA_FLOWER_POLICE attr suggested by Jamal v1->v2: - move __skb_tx_hash rather to dev.c as suggested by Alex ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
This patch introduces a flow-based filter. So far, the very essential packet fields are supported. This patch is only the first step. There is a lot of potential performance improvements possible to implement. Also a lot of features are missing now. They will be addressed in follow-up patches. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
So far, only hashes made out of ipv6 addresses could be dissected. This patch introduces support for dissection of full ipv6 addresses. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Introduce dissector infrastructure which allows user to specify which parts of skb he wants to dissect. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
next to its user. No relation to flow_dissector so it makes no sense to have it in flow_dissector.c Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
__skb_tx_hash function has no relation to flow_dissect so just move it to dev.c Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Since the definition of the function is in flow_dissector.c, it makes sense to have the declaration in flow_dissector.h Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
Since these functions are defined in flow_dissector.c, move header declarations from skbuff.h into flow_dissector.h Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
commit 56193d1b ("net: Add function for parsing the header length out of linear ethernet frames") added this function declaration but it is defined nowhere. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Jiri Pirko authored
add couple of empty lines on the way. Signed-off-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Edward Cree says: ==================== sfc: Bowdlerise PTP MCDI errors When the NIC doesn't support PTP, probe-time MCDI commands fail in predictable ways. Instead of logging cryptic MCDI errors, just log that PTP isn't supported. v2: Hopefully stop Thunderbird mangling the patches. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Edward Cree authored
Also, remove a needless netif_err() from efx_ptp_update_stats() - if the MCDI fails it'll print its own error message, we don't need another that adds no information. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Edward Cree authored
Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Westphal authored
Seems all we want here is to avoid endless 'goto reclassify' loop. tc_classify_compat even resets this counter when something other than TC_ACT_RECLASSIFY is returned, so this skb-counter doesn't break hypothetical loops induced by something other than perpetual TC_ACT_RECLASSIFY return values. skb_act_clone is now identical to skb_clone, so just use that. Tested with following (bogus) filter: tc filter add dev eth0 parent ffff: \ protocol ip u32 match u32 0 0 police rate 10Kbit burst \ 64000 mtu 1500 action reclassify Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller authored
Four minor merge conflicts: 1) qca_spi.c renamed the local variable used for the SPI device from spi_device to spi, meanwhile the spi_set_drvdata() call got moved further up in the probe function. 2) Two changes were both adding new members to codel params structure, and thus we had overlapping changes to the initializer function. 3) 'net' was making a fix to sk_release_kernel() which is completely removed in 'net-next'. 4) In net_namespace.c, the rtnl_net_fill() call for GET operations had the command value fixed, meanwhile 'net-next' adjusted the argument signature a bit. This also matches example merge resolutions posted by Stephen Rothwell over the past two days. Signed-off-by: David S. Miller <davem@davemloft.net>
-
Justin Cormack authored
Fix missing copy_from_user in macvtap SIOCSIFHWADDR ioctl. Signed-off-by: Justin Cormack <justin@netbsd.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Scott Feldman authored
Older gcc versions (e.g. gcc version 4.4.6) don't like anonymous unions which was causing build issues on the newly added switchdev attr/obj structs. Fix this by using named union on structs. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Reported-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Scott Feldman says: ==================== switchdev: more (minor) cleanups Fix some sparse warnings and include some documentation review comments that didn't get picked up in the switchdev Spring Cleanup series. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Scott Feldman authored
There were a few review comments on the switchdev.txt documentation that didn't get included with the Spring Cleanup series, so include them now. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Scott Feldman authored
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Scott Feldman authored
And let driver convert it to host-byte order as needed. Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Scott Feldman authored
Signed-off-by: Scott Feldman <sfeldma@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds authored
Pull networking fixes from David Miller: 1) Handle max TX power properly wrt VIFs and the MAC in iwlwifi, from Avri Altman. 2) Use the correct FW API for scan completions in iwlwifi, from Avraham Stern. 3) FW monitor in iwlwifi accidently uses unmapped memory, fix from Liad Kaufman. 4) rhashtable conversion of mac80211 station table was buggy, the virtual interface was not taken into account. Fix from Johannes Berg. 5) Fix deadlock in rtlwifi by not using a zero timeout for usb_control_msg(), from Larry Finger. 6) Update reordering state before calculating loss detection, from Yuchung Cheng. 7) Fix off by one in bluetooth firmward parsing, from Dan Carpenter. 8) Fix extended frame handling in xiling_can driver, from Jeppe Ledet-Pedersen. 9) Fix CODEL packet scheduler behavior in the presence of TSO packets, from Eric Dumazet. 10) Fix NAPI budget testing in fm10k driver, from Alexander Duyck. 11) macvlan needs to propagate promisc settings down the the lower device, from Vlad Yasevich. 12) igb driver can oops when changing number of rings, from Toshiaki Makita. 13) Source specific default routes not handled properly in ipv6, from Markus Stenberg. 14) Use after free in tc_ctl_tfilter(), from WANG Cong. 15) Use softirq spinlocking in netxen driver, from Tony Camuso. 16) Two ARM bpf JIT fixes from Nicolas Schichan. 17) Handle MSG_DONTWAIT properly in ring based AF_PACKET sends, from Mathias Kretschmer. 18) Fix x86 bpf JIT implementation of FROM_{BE16,LE16,LE32}, from Alexei Starovoitov. 19) ll_temac driver DMA maps TX packet header with incorrect length, fix from Michal Simek. 20) We removed pm_qos bits from netdevice.h, but some indirect references remained. Kill them. From David Ahern. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (90 commits) net: Remove remaining remnants of pm_qos from netdevice.h e1000e: Add pm_qos header net: phy: micrel: Fix regression in kszphy_probe net: ll_temac: Fix DMA map size bug x86: bpf_jit: fix FROM_BE16 and FROM_LE16/32 instructions netns: return RTM_NEWNSID instead of RTM_GETNSID on a get Update be2net maintainers' email addresses net_sched: gred: use correct backlog value in WRED mode pppoe: drop pppoe device in pppoe_unbind_sock_work net: qca_spi: Fix possible race during probe net: mdio-gpio: Allow for unspecified bus id af_packet / TX_RING not fully non-blocking (w/ MSG_DONTWAIT). bnx2x: limit fw delay in kdump to 5s after boot ARM: net: delegate filter to kernel interpreter when imm_offset() return value can't fit into 12bits. ARM: net fix emit_udiv() for BPF_ALU | BPF_DIV | BPF_K intruction. mpls: Change reserved label names to be consistent with netbsd usbnet: avoid integer overflow in start_xmit netxen_nic: use spin_[un]lock_bh around tx_clean_lock (2) net: xgene_enet: Set hardware dependency net: amd-xgbe: Add hardware dependency ...
-
David Ahern authored
Commit e2c65448 removed pm_qos from struct net_device but left the comment and header file. Remove those. Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Ahern authored
Commit e2c65448 moved pm_qos_req to e1000_adapter. Add the header file that defines the struct. Signed-off-by: David Ahern <dsahern@gmail.com> Cc: Thomas Graf <tgraf@suug.ch> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-