1. 09 Aug, 2012 5 commits
    • Pavel Emelyanov's avatar
      net: Make ifindex generation per-net namespace · aa79e66e
      Pavel Emelyanov authored
      Strictly speaking this is only _really_ required for checkpoint-restore to
      make loopback device always have the same index.
      
      This change appears to be safe wrt "ifindex should be unique per-system"
      concept, as all the ifindex usage is either already made per net namespace
      of is explicitly limited with init_net only.
      
      There are two cool side effects of this. The first one -- ifindices of
      devices in container are always small, regardless of how many containers
      we've started (and re-started) so far. The second one is -- we can speed
      up the loopback ifidex access as shown in the next patch.
      
      v2: Place ifindex right after dev_base_seq : avoid two holes and use the
          same cache line, dirtied in list_netdevice()/unlist_netdevice()
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa79e66e
    • Pavel Emelyanov's avatar
      veth: Allow to create peer link with given ifindex · e6f8f1a7
      Pavel Emelyanov authored
      The ifinfomsg is in there (thanks kaber@ for foreseeing this long time ago),
      so take the given ifidex and register netdev with it.
      
      Ben noticed, that this code path previously ignored ifmp->ifi_index and
      userland could be passing in garbage. Thus it may now fail occasionally
      because the value clashes with an existing interface.
      
      To address this it's assumed that if the caller specifies the ifindex for
      the veth master device, then it's aware of this possibility and should
      explicitly specify (or set to 0 for auto-assignment) the peer's ifindex as
      well. With this the compatibility with old tools not setting ifindex is
      preserved.
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e6f8f1a7
    • Pavel Emelyanov's avatar
      net: Allow to create links with given ifindex · 9c7dafbf
      Pavel Emelyanov authored
      Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index
      is not zero. I propose to allow requesting ifindices on link creation. This
      is required by the checkpoint-restore to correctly restore a net namespace
      (i.e. -- a container).
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9c7dafbf
    • Pavel Emelyanov's avatar
      net: Dont use ifindices in hash fns · b14f243a
      Pavel Emelyanov authored
      Eric noticed, that when there will be devices with equal indices, some
      hash functions that use them will become less effective as they could.
      Fix this in advance by mixing the net_device address into the hash value
      instead of the device index.
      
      This is true for arp and ndisc hash fns. The netlabel, can and llc ones
      are also ifindex-based, but that three are init_net-only, thus will not
      be affected.
      
      Many thanks to David and Eric for the hash32_ptr implementation!
      Signed-off-by: default avatarPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b14f243a
    • Eric Dumazet's avatar
      time: jiffies_delta_to_clock_t() helper to the rescue · a399a805
      Eric Dumazet authored
      Various /proc/net files sometimes report crazy timer values, expressed
      in clock_t units.
      
      This happens when an expired timer delta (expires - jiffies) is passed
      to jiffies_to_clock_t().
      
      This function has an overflow in :
      
      return div_u64((u64)x * TICK_NSEC, NSEC_PER_SEC / USER_HZ);
      
      commit cbbc719f (time: Change jiffies_to_clock_t() argument type
      to unsigned long) only got around the problem.
      
      As we cant output negative values in /proc/net/tcp without breaking
      various tools, I suggest adding a jiffies_delta_to_clock_t() wrapper
      that caps the negative delta to a 0 value.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarMaciej Żenczykowski <maze@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: hank <pyu@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a399a805
  2. 07 Aug, 2012 4 commits
  3. 06 Aug, 2012 4 commits
    • Eric Dumazet's avatar
      tcp: ecn: dont delay ACKS after CE · aae06bf5
      Eric Dumazet authored
      While playing with CoDel and ECN marking, I discovered a
      non optimal behavior of receiver of CE (Congestion Encountered)
      segments.
      
      In pathological cases, sender has reduced its cwnd to low values,
      and receiver delays its ACK (by 40 ms).
      
      While RFC 3168 6.1.3 (The TCP Receiver) doesn't explicitly recommend
      to send immediate ACKS, we believe its better to not delay ACKS, because
      a CE segment should give same signal than a dropped segment, and its
      quite important to reduce RTT to give ECE/CWR signals as fast as
      possible.
      
      Note we already call tcp_enter_quickack_mode() from TCP_ECN_check_ce()
      if we receive a retransmit, for the same reason.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aae06bf5
    • Eric Dumazet's avatar
      net: tcp: GRO should be ECN friendly · a9e050f4
      Eric Dumazet authored
      While doing TCP ECN tests, I discovered GRO was reordering packets if it
      receives one packet with CE set, while previous packets in same NAPI run
      have ECT(0) for the same flow :
      
      09:25:25.857620 IP (tos 0x2,ECT(0), ttl 64, id 27893, offset 0, flags
      [DF], proto TCP (6), length 4396)
          172.30.42.19.54550 > 172.30.42.13.44139: Flags [.], seq
      233801:238145, ack 1, win 115, options [nop,nop,TS val 3397779 ecr
      1990627], length 4344
      
      09:25:25.857626 IP (tos 0x3,CE, ttl 64, id 27892, offset 0, flags [DF],
      proto TCP (6), length 1500)
          172.30.42.19.54550 > 172.30.42.13.44139: Flags [.], seq
      232353:233801, ack 1, win 115, options [nop,nop,TS val 3397779 ecr
      1990627], length 1448
      
      09:25:25.857638 IP (tos 0x0, ttl 64, id 34581, offset 0, flags [DF],
      proto TCP (6), length 64)
          172.30.42.13.44139 > 172.30.42.19.54550: Flags [.], cksum 0xac8f
      (incorrect -> 0xca69), ack 232353, win 1271, options [nop,nop,TS val
      1990627 ecr 3397779,nop,nop,sack 1 {233801:238145}], length 0
      
      We have two problems here :
      
      1) GRO reorders packets
      
        If NIC gave packet1, then packet2, which happen to be from "different
      flows"  GRO feeds stack with packet2, then packet1. I have yet to
      understand how to solve this problem.
      
      2) GRO is not ECN friendly
      
      Delivering packets out of order makes TCP stack not as fast as it could
      be.
      
      In this patch I suggest we make the tos test not part of the 'same_flow'
      determination, but part of the 'should flush' logic
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Acked-by: default avatarHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a9e050f4
    • Eric Dumazet's avatar
      net: reorganize IP MIB values · 14a19680
      Eric Dumazet authored
      Reduce IP latencies by placing hot MIB IP fields in a single cache line.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      14a19680
    • Eric Dumazet's avatar
      net: avoid reloads in SNMP_UPD_PO_STATS · d25398df
      Eric Dumazet authored
      Avoid two instructions to reload dev->nd_net->mib.ip_statistics pointer,
      unsing a temp variable, in ip_rcv(), ip_output() paths for example.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d25398df
  4. 04 Aug, 2012 13 commits
  5. 03 Aug, 2012 8 commits
  6. 02 Aug, 2012 6 commits