- 28 Jan, 2008 40 commits
-
-
Eric Dumazet authored
ip_rt_acct needs 4096 bytes per cpu to perform some accounting. It is actually allocated as a single huge array [4096*NR_CPUS] (rounded up to a power of two) Converting it to a per cpu variable is wanted to : - Save space on machines were num_possible_cpus() < NR_CPUS - Better NUMA placement (each cpu gets memory on its node) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilpo Järvinen authored
Key points of this patch are: - In case new SACK information is advance only type, no skb processing below previously discovered highest point is done - Optimize cases below highest point too since there's no need to always go up to highest point (which is very likely still present in that SACK), this is not entirely true though because I'm dropping the fastpath_skb_hint which could previously optimize those cases even better. Whether that's significant, I'm not too sure. Currently it will provide skipping by walking. Combined with RB-tree, all skipping would become fast too regardless of window size (can be done incrementally later). Previously a number of cases in TCP SACK processing fails to take advantage of costly stored information in sack_recv_cache, most importantly, expected events such as cumulative ACK and new hole ACKs. Processing on such ACKs result in rather long walks building up latencies (which easily gets nasty when window is huge). Those latencies are often completely unnecessary compared with the amount of _new_ information received, usually for cumulative ACK there's no new information at all, yet TCP walks whole queue unnecessary potentially taking a number of costly cache misses on the way, etc.! Since the inclusion of highest_sack, there's a lot information that is very likely redundant (SACK fastpath hint stuff, fackets_out, highest_sack), though there's no ultimate guarantee that they'll remain the same whole the time (in all unearthly scenarios). Take advantage of this knowledge here and drop fastpath hint and use direct access to highest SACKed skb as a replacement. Effectively "special cased" fastpath is dropped. This change adds some complexity to introduce better coveraged "fastpath", though the added complexity should make TCP behave more cache friendly. The current ACK's SACK blocks are compared against each cached block individially and only ranges that are new are then scanned by the high constant walk. For other parts of write queue, even when in previously known part of the SACK blocks, a faster skip function is used (if necessary at all). In addition, whenever possible, TCP fast-forwards to highest_sack skb that was made available by an earlier patch. In typical case, no other things but this fast-forward and mandatory markings after that occur making the access pattern quite similar to the former fastpath "special case". DSACKs are special case that must always be walked. The local to recv_sack_cache copying could be more intelligent w.r.t DSACKs which are likely to be there only once but that is left to a separate patch. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilpo Järvinen authored
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilpo Järvinen authored
Worker function that implements the main logic of the inner-most loop of tcp_sacktag_write_queue(). Idea was originally presented by David S. Miller. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilpo Järvinen authored
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilpo Järvinen authored
Highest_sack_end_seq is no longer calculated in the loop, thus it can be pushed to the worker function altogether making that function independent of the sacktag. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilpo Järvinen authored
It is going to replace the sack fastpath hint quite soon... :-) Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilpo Järvinen authored
Many assumptions that are true when no reordering or other strange events happen are not a part of the RFC3517. FACK implementation is based on such assumptions. Previously (before the rewrite) the non-FACK SACK was basically doing fast rexmit and then it times out all skbs when first cumulative ACK arrives, which cannot really be called SACK based recovery :-). RFC3517 SACK disables these things: - Per SKB timeouts & head timeout entry to recovery - Marking at least one skb while in recovery (RFC3517 does this only for the fast retransmission but not for the other skbs when cumulative ACKs arrive in the recovery) - Sacktag's loss detection flavors B and C (see comment before tcp_sacktag_write_queue) This does not implement the "last resort" rule 3 of NextSeg, which allows retransmissions also when not enough SACK blocks have yet arrived above a segment for IsLost to return true [RFC3517]. The implementation differs from RFC3517 in these points: - Rate-halving is used instead of FlightSize / 2 - Instead of using dupACKs to trigger the recovery, the number of SACK blocks is used as FACK does with SACK blocks+holes (which provides more accurate number). It seems that the difference can affect negatively only if the receiver does not generate SACK blocks at all even though it claimed to be SACK-capable. - Dupthresh is not a constant one. Dynamical adjustments include both holes and sacked segments (equal to what FACK has) due to complexity involved in determining the number sacked blocks between highest_sack and the reordered segment. Thus it's will be an over-estimate. Implementation note: tcp_clean_rtx_queue doesn't need a lost_cnt tweak because head skb at that point cannot be SACKED_ACKED (nor would such situation last for long enough to cause problems). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilpo Järvinen authored
This implements more accurately what is stated in sacktag's overall comment: "Both of these heuristics are not used in Loss state, when we cannot account for retransmits accurately." When CA_Loss state is entered, the state changer ensures that undo_marker is only set if no TCPCB_RETRANS skbs were found, thus having non-zero undo_marker in CA_Loss basically tells that the R-bits still accurately reflect the current state of TCP. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Ilpo Järvinen authored
All intermediate conditions include it already, make them simpler as well. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Pavel Emelyanov authored
The sock_valbool_flag() helper is used in setsockopt to set or reset some flag on the sock. This helper is required in the net/socket.c only, so move it there. Besides, patch two places in sys_setsockopt() that repeat this helper functionality manually. Since this is not a bugfix, but a trivial cleanup, I prepared this patch against net-2.6.25, but it also applies (with a single offset) to the latest net-2.6. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Pavel Emelyanov authored
Some time ago a sockfd_lookup_light was introduced and most of the socket.c file was patched to use it. However two routines were left - sys_sendto and sys_recvfrom. Patch them as well, since this helper does exactly what these two need. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Rainer Jochem authored
From : Rainer Jochem <rainer.jochem@mpi-sb.mpg.de> Acked-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Eric Dumazet authored
Qdisc_class_ops are const, and Qdisc_ops are mostly read. Using "const" and "__read_mostly" qualifiers helps to reduce false sharing. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-
YOSHIFUJI Hideaki authored
Policy table is implemented as an RCU linear list since we do not expect large list nor frequent updates. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
YOSHIFUJI Hideaki authored
This patch allows ifindex to be a key for address selection policy table. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
YOSHIFUJI Hideaki authored
This patch renames ipv6_saddr_label() to ipv6_addr_label() because address label is used for both of source address and destination address. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org> Signed-off-by: David S. Miller <davem@davemloft.net>
-
David S. Miller authored
After changeset: [NETFILTER]: Introduce NF_INET_ hook values It always evaluates to NF_INET_POST_ROUTING. Signed-off-by: David S. Miller <davem@davemloft.net>
-
Patrick McHardy authored
The IPv4 and IPv6 hook values are identical, yet some code tries to figure out the "correct" value by looking at the address family. Introduce NF_INET_* values for both IPv4 and IPv6. The old values are kept in a #ifndef __KERNEL__ section for userspace compatibility. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
This patch adds support for async resumptions on input. To do so, the transform would return -EINPROGRESS and subsequently invoke the function xfrm_input_resume to resume processing. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
The nhoff field isn't actually necessary in xfrm_input. For tunnel mode transforms we now throw away the output IP header so it makes no sense to fill in the nexthdr field. For transport mode we can now let the function transport_finish do the setting and it knows where the nexthdr field is. The only other thing that needs the nexthdr field to be set is the header extraction code. However, we can simply move the protocol extraction out of the generic header extraction. We want to minimise the amount of info we have to carry around between transforms as this simplifies the resumption process for async crypto. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
Currently x->lastused is u64 which means that it cannot be read/written atomically on all architectures. David Miller observed that the value stored in it is only an unsigned long which is always atomic. So based on his suggestion this patch changes the internal representation from u64 to unsigned long while the user-interface still refers to it as u64. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
This patch releases the lock on the state before calling x->type->input. It also adds the lock to the spots where they're currently needed. Most of those places (all except mip6) are expected to disappear with async crypto. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
Similar to the moving out of the replay processing on the output, this patch moves the integrity stat collectin from x->type->input into xfrm_input. This would eventually allow transforms such as AH/ESP to be lockless. The error value EBADMSG (currently unused in the crypto layer) is used to indicate a failed integrity check. In future this error can be directly returned by the crypto layer once we switch to aead algorithms. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
As it is xfrm_input first collects a list of xfrm states on the stack before storing them in the packet's security path just before it returns. For async crypto, this construction presents an obstacle since we may need to leave the loop after each transform. In fact, it's much easier to just skip the stack completely and always store to the security path. This is proven by the fact that this patch actually shrinks the code. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
As part of the work on asynchronous cryptographic operations, we need to be able to resume from the spot where they occur. As such, it helps if we isolate them to one spot. This patch moves most of the remaining family-specific processing into the common input code. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
This patch adds support for async resumptions on output. To do so, the transform would return -EINPROGRESS and subsequently invoke the function xfrm_output_resume to resume processing. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
As part of the work on asynchrnous cryptographic operations, we need to be able to resume from the spot where they occur. As such, it helps if we isolate them to one spot. This patch moves most of the remaining family-specific processing into the common output code. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
Most callers of the LOCAL_OUT chain will set the IP packet length before doing so. They also share the same output function dst_output. This patch creates a new function called ip6_local_out which does all of that and converts the appropriate users over to it. Apart from removing duplicate code, it will also help in merging the IPsec output path. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
Most callers of the LOCAL_OUT chain will set the IP packet length and header checksum before doing so. They also share the same output function dst_output. This patch creates a new function called ip_local_out which does all of that and converts the appropriate users over to it. Apart from removing duplicate code, it will also help in merging the IPsec output path once the same thing is done for IPv6. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
With inter-family transforms the inner mode differs from the outer mode. Attempting to handle both sides from the same function means that it needs to handle both IPv4 and IPv6 which creates duplication and confusion. This patch separates the two parts on the input path so that each function deals with one family only. In particular, the functions xfrm4_extract_inut/xfrm6_extract_inut moves the pertinent fields from the IPv4/IPv6 IP headers into a neutral format stored in skb->cb. This is then used by the inner mode input functions to modify the inner IP header. In this way the input function no longer has to know about the outer address family. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
With inter-family transforms the inner mode differs from the outer mode. Attempting to handle both sides from the same function means that it needs to handle both IPv4 and IPv6 which creates duplication and confusion. This patch separates the two parts on the output path so that each function deals with one family only. In particular, the functions xfrm4_extract_output/xfrm6_extract_output moves the pertinent fields from the IPv4/IPv6 IP headers into a neutral format stored in skb->cb. This is then used by the outer mode output functions to write the outer IP header. In this way the output function no longer has to know about the inner address family. Since the extract functions are only called by tunnel modes (the only modes that can support inter-family transforms), I've also moved the xfrm*_tunnel_check_size calls into them. This allows the correct ICMP message to be sent as opposed to now where you might call icmp_send with an IPv6 packet and vice versa. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
This patch changes the prototype of ipv4_copy_dscp and ipv6_copy_dscp so that they directly take the outer DSCP rather than the outer IP header. This will help us to unify the code for inter-family tunnels. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
RO mode is the only one that requires a locked output function. So it's easier to move the lock into that function rather than requiring everyone else to run under the lock. In particular, this allows us to move the size check into the output function without causing a potential dead-lock should the ICMP error somehow hit the same SA on transmission. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
While BEET can theoretically work with IPComp the current code can't do that because it tries to construct a BEET mode tunnel type which doesn't (and cannot) exist. In fact as it is it won't even attach a tunnel object at all for BEET which is bogus. To support this fully we'd also need to change the policy checks on input to recognise a plain tunnel as a legal variant of an optional BEET transform. This patch simply fails such constructions for now. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
Half of the code in xfrm4_bundle_create and xfrm6_bundle_create are common. This patch extracts that logic and puts it into xfrm_bundle_create. The rest of it are then accessed through afinfo. As a result this fixes the problem with inter-family transforms where we treat every xfrm dst in the bundle as if it belongs to the top family. This patch also fixes a long-standing error-path bug where we may free the xfrm states twice. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
This patch moves the flow construction from the callers of xfrm_dst_lookup into that function. It also changes xfrm_dst_lookup so that it takes an xfrm state as its argument instead of explicit addresses. This removes any address-specific logic from the callers of xfrm_dst_lookup which is needed to correctly support inter-family transforms. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
The functions local_addr and remote_addr are more than what they're needed for. The same thing can be done easily with flags on the type object. This patch does that and simplifies the wrapper functions in xfrm6_policy accordingly. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
Previously we took the device from the bottom route and idev from the top route. This is bad because idev may well point to a different device. This patch changes it so that we get the idev from the device directly. It also makes it an error if either dev or idev is NULL. This is consistent with the rest of the routing code which also treats these cases as errors. I've removed the err initialisation in xfrm6_policy.c because it achieves no purpose and hid a bug when an initial version of this patch neglected to set err to -ENODEV (fortunately the IPv4 version warned about it). Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-
Herbert Xu authored
The input function should never be invoked on IPsec dst objects. This is because we don't apply IPsec on input until after we've made the routing decision. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
-