- 23 Oct, 2017 19 commits
-
-
David S. Miller authored
Jiri Pirko says: ==================== mlxsw: Add support for non-equal-cost multi-path Ido says: In the device, nexthops are stored as adjacency entries in an array called the KVD linear (KVDL). When a multi-path route is hit the packet's headers are hashed and then converted to an index into KVDL based on the adjacency group's size and base index. Up until now the driver ignored the `weight` parameter for multi-path routes and allocated only one adjacency entry for each nexthop with a limit of 32 nexthops in a group. This set makes the driver take the `weight` parameter into account when allocating adjacency entries. First patch teaches dpipe to show the size of the adjacency group, so that users will be able to determine the actual weight of each nexthop. The second patch refactors the KVDL allocator, making it more receptive towards the addition of another partition later in the set. Patches 3-5 introduce small changes towards the actual change in the sixth patch that populates the adjacency entries according to their relative weight. Last two patches finally add another partition to the KVDL, which allows us to allocate more than 32 entries per-group and thus support more nexthops and also provide higher accuracy with regards to the requested weights. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
The KVD linear is currently partitioned into two partitions. One for single entries and another for groups of 32 entries. Add another partition consisting of groups of 512 entries which will allow us to more accurately represent the nexthop weights in non-equal cost multi-path routing. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
The memory region where adjacency entries (nexthops) are stored is called the KVD linear and is configured during initialization with a size of 64K. Extend this area with 32K more entries, that will be partitioned into 64 groups of 0.5K entries, thereby allowing us to support weighted nexthops with high accuracy. Change the ratio between both types of hash entries, so as to prevent reduction in the number of double hash entries, which are used for IPv6 neighbours and routes with a prefix length greater than 64. Note that the user will be able to control all these sizes once the devlink resource manager is introduced. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
Up until now the driver assumed all the nexthops have an equal weight and wrote each to a single adjacency entry. This patch takes the `weight` parameter into account and populates the adjacency group according to the relative weight of each nexthop. Specifically, the weights of all the nexthops that should be offloaded are first normalized and then used to calculate the upper adjacency index of each nexthop. This is done according to the hash-threshold algorithm used by the kernel for IPv4 multi-path routing. Adjacency groups are currently limited to 32 entries which limits the weights that can be used, but follow-up patches will introduce groups of 512 entries. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
The device has certain restrictions regarding the size of an adjacency group. Have the router determine the size of the adjacency group according to available KVDL allocation sizes and these restrictions. This was not needed until now since only allocations of up 32 entries were supported and these are all valid sizes for an adjacency group. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
As the first step towards non-equal-cost multi-path support, store each nexthop's weight. For IPv6 nexthops always set the weight to 1, as it only supports ECMP. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
The current KVDL allocation API allows the user to specify the requested number of entries, but the user has no way of knowing how many entries were actually allocated. This works because existing users (e.g., router) request the exact number they end up using. With the introduction of large adjacency groups, this will change, as the router will have the ability to choose from several allocation sizes, where larger allocations provide higher accuracy with respect to requested weights and better resilience against nexthop failures. One option is to have the router try several allocations of descending size until one succeeds, but a better way is to simply allow it to query the actual allocation size and then size its request accordingly. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
The KVD linear (KVDL) allocator currently consists of a very large bitmap that reflects the KVDL's usage. The boundaries of each partition as well as their allocation size are represented using defines. This representation requires us to patch all the functions that act on a partition whenever the partitioning scheme is changed. In addition, it does not enable the dynamic configuration of the KVDL using the up-coming resource manager. Add objects to represent these partitions as well as the accompanying code that acts on them to perform allocations and de-allocations. In the following patches, this will allow us to easily add another partition as well as new operations to act on these partitions. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ido Schimmel authored
The adjacency group size is part of the match on the adjacency group and should therefore be exposed using dpipe. When non-equal-cost multi-path support will be introduced, the group's size will help users understand the exact number of adjacency entries each nexthop occupies, as a nexthop will no longer correspond to a single entry. The output for a multi-path route with two nexthops, one with weight 255 and the second 1 will be: Example: $ devlink dpipe table dump pci/0000:01:00.0 name mlxsw_adj pci/0000:01:00.0: index 0 match_value: type field_exact header mlxsw_meta field adj_index value 65536 type field_exact header mlxsw_meta field adj_size value 512 type field_exact header mlxsw_meta field adj_hash_index value 0 action_value: type field_modify header ethernet field destination mac value e4:1d:2d:a5:f3:64 type field_modify header mlxsw_meta field erif_port mapping ifindex mapping_value 3 value 1 index 1 match_value: type field_exact header mlxsw_meta field adj_index value 65536 type field_exact header mlxsw_meta field adj_size value 512 type field_exact header mlxsw_meta field adj_hash_index value 510 action_value: type field_modify header ethernet field destination mac value e4:1d:2d:a5:f3:65 type field_modify header mlxsw_meta field erif_port mapping ifindex mapping_value 4 value 2 Thus, the first nexthop occupies 510 adjacency entries and the second 2, which leads to a ratio of 255 to 1. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Florian Fainelli says: ==================== net: dsa: bcm_sf2: Add support for IPv6 CFP rules This patch series adds support for matching IPv6 addresses to the existing CFP support code. Because IPv6 addresses are four times bigger than IPv4, we can fit them anymore in a single slice, so we need to chain two in order to have a complete match. This makes us require a second bitmap tracking unique rules so we don't over populate the TCAM. Finally, because the code had to be re-organized, it became a lot easier to support arbitrary prefix/mask lengths, so the last two patches do just that. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
There is no reason why we should limit ourselves to matching only full IPv4 addresses (/32), the same logic applies between the DATA and MASK ports, so just make it more configurable to accept both. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
There is no reason why we should limit ourselves to matching only full IPv4 addresses (/32), the same logic applies between the DATA and MASK ports, so just make it more configurable to accept both. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
Inserting IPv6 CFP rules complicates the code a little bit in that we need to insert two rules side by side and chain them to match a full IPv6 tuple (src, dst IPv6 + port + protocol). Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
There is no need to do a HW search of the TCAMs which is something slow and expensive. Since we already maintain a bitmask of active CFP rules, just iterate over those, starting from bit 1 (after the reserved entry) to get a count and index position to store the rule later on. As a result we can remove the code in bcm_sf2_cfp_rule_get() which acted on the "search" argument, and remove that argument. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
In preparation for introducing IPv6 rules support, make the cfp_udf_layout more flexible and match more accurately how the HW is designed: we have 3 + 1 slices per protocol, but we may not be using all of them and we are relative to a particular base offset (slice A for IPv4 for instance). Also populate the slice number that should be used (slice 1 for IPv4) based on the lookup function. Finally, we introduce two helper functions: udf_upper_bits() and udf_lower_bits() to help setting the UDF_n_* valid bits based on the number of UDFs valid within a slice. Update the IPv4 rule setting to make use of it to be more robust wrt. change in number of User Defined Fields being programmed. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
Move the processing of IPv4 rules into specific functions, allowing us to clearly identify which parts are generic and which ones are not. Also create a specific function to insert a rule into the action and policer RAMs as those tend to be fairly generic. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Florian Fainelli authored
Instead of open coding the shift for the IP protocol, IP fragment bit etc. define and/or use existing constants to that end. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Kees Cook authored
While the work callback uses the urb to find cardstate from bas_cardstate, this may not be valid for timer callbacks. Instead, introduce a direct pointer back to the cardstate from bas_cardstate for use in timer callbacks. Reported-by: Paul Bolle <pebolle@tiscali.nl> Fixes: 4cfea08e ("isdn/gigaset: Convert timers to use timer_setup()") Cc: Paul Bolle <pebolle@tiscali.nl> Cc: Karsten Keil <isdn@linux-pingi.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: Johan Hovold <johan@kernel.org> Cc: gigaset307x-common@lists.sourceforge.net Cc: netdev@vger.kernel.org Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Alexei Starovoitov authored
fix multiple build errors and warnings 1. test_maps.c: In function ‘test_map_rdonly’: test_maps.c:1051:30: error: ‘BPF_F_RDONLY’ undeclared (first use in this function) MAP_SIZE, map_flags | BPF_F_RDONLY); 2. test_maps.c:1048:6: warning: unused variable ‘i’ [-Wunused-variable] int i, fd, key = 0, value = 0; 3. test_maps.c:1087:2: error: called object is not a function or function pointer assert(bpf_map_lookup_elem(fd, &key, &value) == -1 && errno == EPERM); 4. ./bpf_helpers.h:72:11: error: use of undeclared identifier 'BPF_FUNC_getsockopt' (void *) BPF_FUNC_getsockopt; Fixes: e043325b ("bpf: Add tests for eBPF file mode") Fixes: 6e71b04a ("bpf: Add file mode configuration into bpf maps") Fixes: cd86d1fd ("bpf: Adding helper function bpf_getsockops") Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
- 22 Oct, 2017 21 commits
-
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller authored
There were quite a few overlapping sets of changes here. Daniel's bug fix for off-by-ones in the new BPF branch instructions, along with the added allowances for "data_end > ptr + x" forms collided with the metadata additions. Along with those three changes came veritifer test cases, which in their final form I tried to group together properly. If I had just trimmed GIT's conflict tags as-is, this would have split up the meta tests unnecessarily. In the socketmap code, a set of preemption disabling changes overlapped with the rename of bpf_compute_data_end() to bpf_compute_data_pointers(). Changes were made to the mv88e6060.c driver set addr method which got removed in net-next. The hyperv transport socket layer had a locking change in 'net' which overlapped with a change of socket state macro usage in 'net-next'. Signed-off-by: David S. Miller <davem@davemloft.net>
-
git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds authored
Pull networking fixes from David Miller: "A little more than usual this time around. Been travelling, so that is part of it. Anyways, here are the highlights: 1) Deal with memcontrol races wrt. listener dismantle, from Eric Dumazet. 2) Handle page allocation failures properly in nfp driver, from Jaku Kicinski. 3) Fix memory leaks in macsec, from Sabrina Dubroca. 4) Fix crashes in pppol2tp_session_ioctl(), from Guillaume Nault. 5) Several fixes in bnxt_en driver, including preventing potential NVRAM parameter corruption from Michael Chan. 6) Fix for KRACK attacks in wireless, from Johannes Berg. 7) rtnetlink event generation fixes from Xin Long. 8) Deadlock in mlxsw driver, from Ido Schimmel. 9) Disallow arithmetic operations on context pointers in bpf, from Jakub Kicinski. 10) Missing sock_owned_by_user() check in sctp_icmp_redirect(), from Xin Long. 11) Only TCP is supported for sockmap, make that explicit with a check, from John Fastabend. 12) Fix IP options state races in DCCP and TCP, from Eric Dumazet. 13) Fix panic in packet_getsockopt(), also from Eric Dumazet. 14) Add missing locked in hv_sock layer, from Dexuan Cui. 15) Various aquantia bug fixes, including several statistics handling cures. From Igor Russkikh et al. 16) Fix arithmetic overflow in devmap code, from John Fastabend. 17) Fix busted socket memory accounting when we get a fault in the tcp zero copy paths. From Willem de Bruijn. 18) Don't leave opt->tot_len uninitialized in ipv6, from Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) stmmac: Don't access tx_q->dirty_tx before netif_tx_lock ipv6: flowlabel: do not leave opt->tot_len with garbage of_mdio: Fix broken PHY IRQ in case of probe deferral textsearch: fix typos in library helpers rxrpc: Don't release call mutex on error pointer net: stmmac: Prevent infinite loop in get_rx_timestamp_status() net: stmmac: Fix stmmac_get_rx_hwtstamp() net: stmmac: Add missing call to dev_kfree_skb() mlxsw: spectrum_router: Configure TIGCR on init mlxsw: reg: Add Tunneling IPinIP General Configuration Register net: ethtool: remove error check for legacy setting transceiver type soreuseport: fix initialization race net: bridge: fix returning of vlan range op errors sock: correct sk_wmem_queued accounting on efault in tcp zerocopy bpf: add test cases to bpf selftests to cover all access tests bpf: fix pattern matches for direct packet access bpf: fix off by one for range markings with L{T, E} patterns bpf: devmap fix arithmetic overflow in bitmap_size calculation net: aquantia: Bad udp rate on default interrupt coalescing net: aquantia: Enable coalescing management via ethtool interface ...
-
Bernd Edlinger authored
This is the possible reason for different hard to reproduce problems on my ARMv7-SMP test system. The symptoms are in recent kernels imprecise external aborts, and in older kernels various kinds of network stalls and unexpected page allocation failures. My testing indicates that the trouble started between v4.5 and v4.6 and prevails up to v4.14. Using the dirty_tx before acquiring the spin lock is clearly wrong and was first introduced with v4.6. Fixes: e3ad57c9 ("stmmac: review RX/TX ring management") Signed-off-by: Bernd Edlinger <bernd.edlinger@hotmail.de> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
When syzkaller team brought us a C repro for the crash [1] that had been reported many times in the past, I finally could find the root cause. If FlowLabel info is merged by fl6_merge_options(), we leave part of the opt_space storage provided by udp/raw/l2tp with random value in opt_space.tot_len, unless a control message was provided at sendmsg() time. Then ip6_setup_cork() would use this random value to perform a kzalloc() call. Undefined behavior and crashes. Fix is to properly set tot_len in fl6_merge_options() At the same time, we can also avoid consuming memory and cpu cycles to clear it, if every option is copied via a kmemdup(). This is the change in ip6_setup_cork(). [1] kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 6613 Comm: syz-executor0 Not tainted 4.14.0-rc4+ #127 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: ffff8801cb64a100 task.stack: ffff8801cc350000 RIP: 0010:ip6_setup_cork+0x274/0x15c0 net/ipv6/ip6_output.c:1168 RSP: 0018:ffff8801cc357550 EFLAGS: 00010203 RAX: dffffc0000000000 RBX: ffff8801cc357748 RCX: 0000000000000010 RDX: 0000000000000002 RSI: ffffffff842bd1d9 RDI: 0000000000000014 RBP: ffff8801cc357620 R08: ffff8801cb17f380 R09: ffff8801cc357b10 R10: ffff8801cb64a100 R11: 0000000000000000 R12: ffff8801cc357ab0 R13: ffff8801cc357b10 R14: 0000000000000000 R15: ffff8801c3bbf0c0 FS: 00007f9c5c459700(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020324000 CR3: 00000001d1cf2000 CR4: 00000000001406f0 DR0: 0000000020001010 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 Call Trace: ip6_make_skb+0x282/0x530 net/ipv6/ip6_output.c:1729 udpv6_sendmsg+0x2769/0x3380 net/ipv6/udp.c:1340 inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:762 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 SYSC_sendto+0x358/0x5a0 net/socket.c:1750 SyS_sendto+0x40/0x50 net/socket.c:1718 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x4520a9 RSP: 002b:00007f9c5c458c08 EFLAGS: 00000216 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000718000 RCX: 00000000004520a9 RDX: 0000000000000001 RSI: 0000000020fd1000 RDI: 0000000000000016 RBP: 0000000000000086 R08: 0000000020e0afe4 R09: 000000000000001c R10: 0000000000000000 R11: 0000000000000216 R12: 00000000004bb1ee R13: 00000000ffffffff R14: 0000000000000016 R15: 0000000000000029 Code: e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 ea 0f 00 00 48 8d 79 04 48 b8 00 00 00 00 00 fc ff df 45 8b 74 24 04 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 RIP: ip6_setup_cork+0x274/0x15c0 net/ipv6/ip6_output.c:1168 RSP: ffff8801cc357550 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Geert Uytterhoeven authored
If an Ethernet PHY is initialized before the interrupt controller it is connected to, a message like the following is printed: irq: no irq domain found for /interrupt-controller@e61c0000 ! However, the actual error is ignored, leading to a non-functional (POLL) PHY interrupt later: Micrel KSZ8041RNLI ee700000.ethernet-ffffffff:01: attached PHY driver [Micrel KSZ8041RNLI] (mii_bus:phy_addr=ee700000.ethernet-ffffffff:01, irq=POLL) Depending on whether the PHY driver will fall back to polling, Ethernet may or may not work. To fix this: 1. Switch of_mdiobus_register_phy() from irq_of_parse_and_map() to of_irq_get(). Unlike the former, the latter returns -EPROBE_DEFER if the interrupt controller is not yet available, so this condition can be detected. Other errors are handled the same as before, i.e. use the passed mdio->irq[addr] as interrupt. 2. Propagate and handle errors from of_mdiobus_register_phy() and of_mdiobus_register_device(). Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Randy Dunlap authored
Fix spellos (typos) in textsearch library helpers. Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Eric Dumazet says: ==================== tun: timer cleanups While working on a syzkaller issue that might have been fixed already by Cong Wang in commit 0ad646c8 ("tun: call dev_get_valid_name() before register_netdevice()") I made three small changes related to flow_gc_timer. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Timer is properly armed on demand from tun_flow_update(), so there is no need to arm it at tun init. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
If tun_flow_cleanup() deleted all flows, no need to arm the timer again. It will be armed next time tun_flow_update() is called. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
tun_flow_cleanup() being a timer callback, it is already running in BH context. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Lawrence Brakmo says: ==================== bpf: add support for BASE_RTT This patch set adds the following functionality to socket_ops BPF programs. 1) Add bpf helper function bpf_getsocketops. Currently only supports TCP_CONGESTION 2) Add BPF_SOCKET_OPS_BASE_RTT op to get the base RTT of the connection. In general, the base RTT indicates the threshold such that RTTs above it indicate congestion. More details in the relevant patches. Consists of the following patches: [PATCH net-next 1/5] bpf: add support for BPF_SOCK_OPS_BASE_RTT [PATCH net-next 2/5] bpf: Adding helper function bpf_getsockops [PATCH net-next 3/5] bpf: Add BPF_SOCKET_OPS_BASE_RTT support to [PATCH net-next 4/5] bpf: sample BPF_SOCKET_OPS_BASE_RTT program [PATCH net-next 5/5] bpf: create samples/bpf/tcp_bpf.readme ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-
Lawrence Brakmo authored
Readme file explaining how to create a cgroupv2 and attach one of the tcp_*_kern.o socket_ops BPF program. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked_by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Lawrence Brakmo authored
Sample socket_ops BPF program to test the BPF helper function bpf_getsocketops and the new socket_ops op BPF_SOCKET_OPS_BASE_RTT. The program provides a base RTT of 80us when the calling flow is within a DC (as determined by the IPV6 prefix) and the congestion algorithm is "nv". Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked_by: Alexei Starovoitov <ast@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Lawrence Brakmo authored
TCP_NV will try to get the base RTT from a socket_ops BPF program if one is loaded. NV will then use the base RTT to bound its min RTT (its notion of the base RTT). It uses the base RTT as an upper bound and 80% of the base RTT as its lower bound. In other words, NV will consider filtered RTTs larger than base RTT as a sign of congestion. As a result, there is no minRTT inflation when there is a lot of congestion. For example, in a DC where the RTTs are less than 40us when there is no congestion, a base RTT value of 80us improves the performance of NV. The difference between the uncongested RTT and the base RTT provided represents how much queueing we are willing to have (in practice it can be higher). NV has been tunned to reduce congestion when there are many flows at the cost of one flow not achieving full bandwith utilization. When a reasonable base RTT is provided, one NV flow can now fully utilize the full bandwidth. In addition, the performance is also improved when there are many flows. In the following examples the NV results are using a kernel with this patch set (i.e. both NV results are using the new nv_loss_dec_factor). With one host sending to another host and only one flow the goodputs are: Cubic: 9.3 Gbps, NV: 5.5 Gbps, NV (baseRTT=80us): 9.2 Gbps With 2 hosts sending to one host (1 flow per host, the goodput per flow is: Cubic: 4.6 Gbps, NV: 4.5 Gbps, NV (baseRTT=80us)L 4.6 Gbps But the RTTs seen by a ping process in the sender is: Cubic: 3.3ms NV: 97us, NV (baseRTT=80us): 146us With a lot of flows things look even better for NV with baseRTT. Here we have 3 hosts sending to one host. Each sending host has 6 flows: 1 stream, 4x1MB RPC, 1x10KB RPC. Cubic, NV and NV with baseRTT all fully utilize the full available bandwidth. However, the distribution of bandwidth among the flows is very different. For the 10KB RPC flow: Cubic: 27Mbps, NV: 111Mbps, NV (baseRTT=80us): 222Mbps The 99% latencies for the 10KB flows are: Cubic: 26ms, NV: 1ms, NV (baseRTT=80us): 500us The RTT seen by a ping process at the senders: Cubic: 3.2ms NV: 720us, NV (baseRTT=80us): 330us Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Lawrence Brakmo authored
Adding support for helper function bpf_getsockops to socket_ops BPF programs. This patch only supports TCP_CONGESTION. Signed-off-by: Vlad Vysotsky <vlad@cs.ucla.edu> Acked-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Lawrence Brakmo authored
A congestion control algorithm can make a call to the BPF socket_ops program to request the base RTT. The base RTT can be congestion control dependent and is meant to represent a congestion threshold such that RTTs above it indicate congestion. This is especially useful for flows within a DC where the base RTT is easy to obtain. Being provided a base RTT solves a basic problem in RTT based congestion avoidance algorithms (such as Vegas, NV and BBR). Although it is easy to get the base RTT when the network is not congested, it is very diffcult to do when it is very congested. Newer connections get an inflated value of the base RTT leading to unfariness (newer flows with a larger base RTT get more bandwidth). As a result, RTT based congestion avoidance algorithms tend to update their base RTTs to improve fairness. In very congested networks this can lead to base RTT inflation, reducing the ability of these RTT based congestion control algorithms to prevent congestion. Note that in my experiments with TCP-NV, the base RTT provided can be much larger than the actual hardware RTT. For example, experimenting with hosts within a rack where the hardware RTT is 16-20us, I've used base RTTs up to 150us. The effect of using a larger base RTT is that the congestion avoidance algorithm will allow more queueing. When there are only a few flows the main effect is larger measured RTTs and RPC latencies due to the increased queueing. When there are a lot of flows, a larger base RTT can lead to more congestion and more packet drops. For this case, where the hardware RTT is 20us, a base RTT of 80us produces good results. This patch only introduces BPF_SOCK_OPS_BASE_RTT, a later patch in this set adds support for using it in TCP-NV. Further study and testing is needed before support can be added to other delay based congestion avoidance algorithms. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Alexei Starovoitov <ast@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Pieter Jansen van Vuuren authored
Use direct access struct fields rather than PREP_FIELD() macros to manipulate the jump ID and length, both of which are exactly 8-bits wide. This simplifies the code somewhat. Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Gustavo A. R. Silva authored
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Gustavo A. R. Silva authored
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. Signed-off-by: Gustavo A. R. Silva <garsilva@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David Howells authored
Don't release call mutex at the end of rxrpc_kernel_begin_call() if the call pointer actually holds an error value. Fixes: 540b1c48 ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
Jose Abreu says: ==================== net: stmmac: Fix HW timestamping Three fixes for HW timestamping feature, all of them for RX side. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
-