1. 17 Nov, 2008 23 commits
    • Eric Dumazet's avatar
      net: af_unix can make unix_nr_socks visbile in /proc · 248969ae
      Eric Dumazet authored
      Currently, /proc/net/protocols displays socket counts only for TCP/TCPv6
      protocols
      
      We can provide unix_nr_socks for free here, this counter being
      already maintained in af_unix
      
      Before patch :
      
      # grep UNIX /proc/net/protocols
      UNIX       428     -1      -1   NI       0   yes  kernel
      
      After patch :
      
      # grep UNIX /proc/net/protocols
      UNIX       428     98      -1   NI       0   yes  kernel
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      248969ae
    • Wang Chen's avatar
      netdevice chelsio: Convert directly reference of netdev->priv · c3ccc123
      Wang Chen authored
      Several netdev share one adapter here.
      We use netdev->ml_priv of the netdevs point to the first netdev's priv.
      Signed-off-by: default avatarWang Chen <wangchen@cn.fujitsu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3ccc123
    • Alexey Dobriyan's avatar
      ematch: simpler tcf_em_unregister() · 4d24b52a
      Alexey Dobriyan authored
      Simply delete ops from list and let list debugging do the job.
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4d24b52a
    • Eric Dumazet's avatar
      net: Cleanup of af_unix · 6eba6a37
      Eric Dumazet authored
      This is a pure cleanup of net/unix/af_unix.c to meet current code
      style standards
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6eba6a37
    • Gerrit Renker's avatar
      dccp: Tidy up setsockopt calls · 19102996
      Gerrit Renker authored
      This splits the setsockopt calls into two groups, depending on whether an
      integer argument (val) is required and whether routines being called do
      their own locking.
      
      Some options (such as setting the CCID) use u8 rather than int, so that for
      these the test with regard to integer-sizeof can not be used.
      
      The second switch-case statement now only has those statements which need
      locking and which make use of `val'.
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: default avatarIan McDonald <ian.mcdonald@jandi.co.nz>
      Acked-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Reviewed-by: default avatarEugene Teo <eugeneteo@kernel.sg>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19102996
    • Gerrit Renker's avatar
      dccp: Deprecate Ack Ratio sysctl · dd9c0e36
      Gerrit Renker authored
      This patch deprecates the Ack Ratio sysctl, since
       * Ack Ratio is entirely ignored by CCID-3 and CCID-4,
       * Ack Ratio currently doesn't work in CCID-2 (i.e. is always set to 1);
       * even if it would work in CCID-2, there is no point for a user to change it:
         - Ack Ratio is constrained by cwnd (RFC 4341, 6.1.2),
         - if Ack Ratio > cwnd, the system resorts to spurious RTO timeouts
           (since waiting for Acks which will never arrive in this window),
         - cwnd is not a user-configurable value.
      
      The only reasonable place for Ack Ratio is to print it for debugging. It is
      planned to do this later on, as part of e.g. dccp_probe.
      
      With this patch Ack Ratio is now under full control of feature negotiation:
       * Ack Ratio is resolved as a dependency of the selected CCID;
       * if the chosen CCID supports it (i.e. CCID == CCID-2), Ack Ratio is set to
         the default of 2, following RFC 4340, 11.3 - "New connections start with Ack
         Ratio 2 for both endpoints";
       * what happens then is part of another patch set, since it concerns the
         dynamic update of Ack Ratio while the connection is in full flight.
      
      Thanks to Tomasz Grobelny for discussion leading up to this patch.
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: default avatarArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd9c0e36
    • Gerrit Renker's avatar
      dccp: Feature negotiation for minimum-checksum-coverage · 29450559
      Gerrit Renker authored
      This provides feature negotiation for server minimum checksum coverage
      which so far has been missing.
      
      Since sender/receiver coverage values range only from 0...15, their
      type has also been reduced in size from u16 to u4.
      
      Feature-negotiation options are now generated for both sender and receiver
      coverage, i.e. when the peer has `forgotten' to enable partial coverage
      then feature negotiation will automatically enable (negotiate) the partial
      coverage value for this connection.
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: default avatarIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29450559
    • Gerrit Renker's avatar
      dccp: Deprecate old setsockopt framework · 49aebc66
      Gerrit Renker authored
      The previous setsockopt interface, which passed socket options via struct
      dccp_so_feat, is complicated/difficult to use. Continuing to support it leads to
      ugly code since the old approach did not distinguish between NN and SP values.
      
      This patch removes the old setsockopt interface and replaces it with two new
      functions to register NN/SP values for feature negotiation. 
      These are essentially wrappers around the internal __feat_register functions,
      with checking added to avoid
      
       * wrong usage (type);
       * changing values while the connection is in progress.
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49aebc66
    • Gerrit Renker's avatar
      dccp: Mechanism to resolve CCID dependencies · 0c116839
      Gerrit Renker authored
      This adds a hook to resolve features whose value depends on the choice of
      CCID. It is done at the server since it can only be done after the CCID
      values have been negotiated; i.e. the client will add its CCID preference
      list on the Change options sent in the Request, which will be reconciled
      with the local preference list of the server.
      
      The concept is documented on
      http://www.erg.abdn.ac.uk/users/gerrit/dccp/notes/feature_negotiation/\
      				implementation_notes.html#ccid_dependencies
      Signed-off-by: default avatarGerrit Renker <gerrit@erg.abdn.ac.uk>
      Acked-by: default avatarIan McDonald <ian.mcdonald@jandi.co.nz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c116839
    • Mark McLoughlin's avatar
      virtio_net: VIRTIO_NET_F_MSG_RXBUF (imprive rcv buffer allocation) · 3f2c31d9
      Mark McLoughlin authored
      If segmentation offload is enabled by the host, we currently allocate
      maximum sized packet buffers and pass them to the host. This uses up
      20 ring entries, allowing us to supply only 20 packet buffers to the
      host with a 256 entry ring. This is a huge overhead when receiving
      small packets, and is most keenly felt when receiving MTU sized
      packets from off-host.
      
      The VIRTIO_NET_F_MRG_RXBUF feature flag is set by hosts which support
      using receive buffers which are smaller than the maximum packet size.
      In order to transfer large packets to the guest, the host merges
      together multiple receive buffers to form a larger logical buffer.
      The number of merged buffers is returned to the guest via a field in
      the virtio_net_hdr.
      
      Make use of this support by supplying single page receive buffers to
      the host. On receive, we extract the virtio_net_hdr, copy 128 bytes of
      the payload to the skb's linear data buffer and adjust the fragment
      offset to point to the remaining data. This ensures proper alignment
      and allows us to not use any paged data for small packets. If the
      payload occupies multiple pages, we simply append those pages as
      fragments and free the associated skbs.
      
      This scheme allows us to be efficient in our use of ring entries
      while still supporting large packets. Benchmarking using netperf from
      an external machine to a guest over a 10Gb/s network shows a 100%
      improvement from ~1Gb/s to ~2Gb/s. With a local host->guest benchmark
      with GSO disabled on the host side, throughput was seen to increase
      from 700Mb/s to 1.7Gb/s.
      
      Based on a patch from Herbert Xu.
      Signed-off-by: default avatarMark McLoughlin <markmc@redhat.com>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv)
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f2c31d9
    • Mark McLoughlin's avatar
      virtio_net: hook up the set-tso ethtool op · 0276b497
      Mark McLoughlin authored
      Seems like an oversight that we have set-tx-csum and set-sg hooked
      up, but not set-tso.
      
      Also leads to the strange situation that if you e.g. disable tx-csum,
      then tso doesn't get disabled.
      Signed-off-by: default avatarMark McLoughlin <markmc@redhat.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0276b497
    • Mark McLoughlin's avatar
      virtio_net: Recycle some more rx buffer pages · 0a888fd1
      Mark McLoughlin authored
      Each time we re-fill the recv queue with buffers, we allocate
      one too many skbs and free it again when adding fails. We should
      recycle the pages allocated in this case.
      
      A previous version of this patch made trim_pages() trim trailing
      unused pages from skbs with some paged data, but this actually
      caused a barely measurable slowdown.
      Signed-off-by: default avatarMark McLoughlin <markmc@redhat.com>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (use netdev_priv)
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0a888fd1
    • Alexey Dobriyan's avatar
      net: use %pF for /proc/net/ptype · 908cd2da
      Alexey Dobriyan authored
      Technically, patch changes format for modules, but I think nobody cares.
      
      	-86dd          :ipv6:ipv6_rcv+0x0
      	+86dd          ipv6_rcv+0x0/0x400 [ipv6]
      Signed-off-by: default avatarAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      908cd2da
    • Eric Dumazet's avatar
      net: make sure struct dst_entry refcount is aligned on 64 bytes · 5635c10d
      Eric Dumazet authored
      As found in the past (commit f1dd9c37
      [NET]: Fix tbench regression in 2.6.25-rc1), it is really
      important that struct dst_entry refcount is aligned on a cache line.
      
      We cannot use __atribute((aligned)), so manually pad the structure
      for 32 and 64 bit arches.
      
      for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80
      for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0
      
      As it is not possible to guess at compile time cache line size,
      we use a generic value of 64 bytes, that satisfies many current arches.
      (Using 128 bytes alignment on 64bit arches would waste 64 bytes)
      
      Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont
      break this alignment.
      
      "tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz
      (2350 MB/s instead of 2250 MB/s)
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5635c10d
    • Eric Dumazet's avatar
      rcu: documents rculist_nulls · 536533e6
      Eric Dumazet authored
      Adds Documentation/RCU/rculist_nulls.txt file to describe how 'nulls'
      end-of-list can help in some RCU algos.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      536533e6
    • Eric Dumazet's avatar
      net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls · 3ab5aee7
      Eric Dumazet authored
      RCU was added to UDP lookups, using a fast infrastructure :
      - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the
        price of call_rcu() at freeing time.
      - hlist_nulls permits to use few memory barriers.
      
      This patch uses same infrastructure for TCP/DCCP established
      and timewait sockets.
      
      Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications
      using short lived TCP connections. A followup patch, converting
      rwlocks to spinlocks will even speedup this case.
      
      __inet_lookup_established() is pretty fast now we dont have to
      dirty a contended cache line (read_lock/read_unlock)
      
      Only established and timewait hashtable are converted to RCU
      (bind table and listen table are still using traditional locking)
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3ab5aee7
    • Eric Dumazet's avatar
      udp: Use hlist_nulls in UDP RCU code · 88ab1932
      Eric Dumazet authored
      This is a straightforward patch, using hlist_nulls infrastructure.
      
      RCUification already done on UDP two weeks ago.
      
      Using hlist_nulls permits us to avoid some memory barriers, both
      at lookup time and delete time.
      
      Patch is large because it adds new macros to include/net/sock.h.
      These macros will be used by TCP & DCCP in next patch.
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88ab1932
    • Eric Dumazet's avatar
      rcu: Introduce hlist_nulls variant of hlist · bbaffaca
      Eric Dumazet authored
      hlist uses NULL value to finish a chain.
      
      hlist_nulls variant use the low order bit set to 1 to signal an end-of-list marker.
      
      This allows to store many different end markers, so that some RCU lockless
      algos (used in TCP/UDP stack for example) can save some memory barriers in
      fast paths.
      
      Two new files are added :
      
      include/linux/list_nulls.h
        - mimics hlist part of include/linux/list.h, derived to hlist_nulls variant
      
      include/linux/rculist_nulls.h
        - mimics hlist part of include/linux/rculist.h, derived to hlist_nulls variant
      
         Only four helpers are declared for the moment :
      
           hlist_nulls_del_init_rcu(), hlist_nulls_del_rcu(),
           hlist_nulls_add_head_rcu() and hlist_nulls_for_each_entry_rcu()
      
      prefetches() were removed, since an end of list is not anymore NULL value.
      prefetches() could trigger useless (and possibly dangerous) memory transactions.
      
      Example of use (extracted from __udp4_lib_lookup())
      
      	struct sock *sk, *result;
              struct hlist_nulls_node *node;
              unsigned short hnum = ntohs(dport);
              unsigned int hash = udp_hashfn(net, hnum);
              struct udp_hslot *hslot = &udptable->hash[hash];
              int score, badness;
      
              rcu_read_lock();
      begin:
              result = NULL;
              badness = -1;
              sk_nulls_for_each_rcu(sk, node, &hslot->head) {
                      score = compute_score(sk, net, saddr, hnum, sport,
                                            daddr, dport, dif);
                      if (score > badness) {
                              result = sk;
                              badness = score;
                      }
              }
              /*
               * if the nulls value we got at the end of this lookup is
               * not the expected one, we must restart lookup.
               * We probably met an item that was moved to another chain.
               */
              if (get_nulls_value(node) != hash)
                      goto begin;
      
              if (result) {
                      if (unlikely(!atomic_inc_not_zero(&result->sk_refcnt)))
                              result = NULL;
                      else if (unlikely(compute_score(result, net, saddr, hnum, sport,
                                        daddr, dport, dif) < badness)) {
                              sock_put(result);
                              goto begin;
                      }
              }
              rcu_read_unlock();
              return result;
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Acked-by: default avatarPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      bbaffaca
    • Balazs Scheidler's avatar
      TPROXY: implemented IP_RECVORIGDSTADDR socket option · e8b2dfe9
      Balazs Scheidler authored
      In case UDP traffic is redirected to a local UDP socket,
      the originally addressed destination address/port
      cannot be recovered with the in-kernel tproxy.
      
      This patch adds an IP_RECVORIGDSTADDR sockopt that enables
      a IP_ORIGDSTADDR ancillary message in recvmsg(). This
      ancillary message contains the original destination address/port
      of the packet being received.
      Signed-off-by: default avatarBalazs Scheidler <bazsi@balabit.hu>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8b2dfe9
    • Ben Greear's avatar
      ipv4: Fix ARP behavior with many mac-vlans · 8164f1b7
      Ben Greear authored
      Ben Greear wrote:
      > I have 500 mac-vlans on a system talking to 500 other
      > mac-vlans.  My problem is that the arp-table gets extremely
      > huge because every time an arp-request comes in on all mac-vlans,
      > a stale arp entry is added for each mac-vlan.  I have filtering
      > turned on, but that doesn't help because the neigh_event_ns call
      > below will cause a stale neighbor entry to be created regardless
      > of whether a replay will be sent or not.
      > Maybe the neigh_event code should be below the checks for dont_send,
      > and only create check neigh_event_ns if we are !dont_send?
      
      The attached patch makes it work much better for me.  The patch
      will cause the code to NOT create a stale neighbor entry if we
      are not going to respond to the ARP request.  The old code
      *would* create a stale entry even if we are not going to respond.
      Signed-off-by: default avatarBen Greear <greearb@candelatech.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8164f1b7
    • Alexander Duyck's avatar
      e1000e: enable ECC correction on 82571 silicon · 6ea7ae1d
      Alexander Duyck authored
      This change enables ECC correction for the packet buffer on all 82571
      silicon.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ea7ae1d
    • Paulius Zaleckas's avatar
      phylib: make mdio-gpio work without OF (v4) · f004f3ea
      Paulius Zaleckas authored
      make mdio-gpio work with non OpenFirmware gpio implementation.
      
      Aditional changes to mdio-gpio:
      - use gpio_request() and gpio_free()
      - place irq[] array in struct mdio_gpio_info
      - add module description, author and license
      - add note about compiling this driver as module
      - rename mdc and mdio function (were ugly names)
      - change MII to MDIO in bus name
      - add __init __exit to module (un)loading functions
      - probe fails if no phys added to the bus
      - kzalloc bitbang with sizeof(*bitbang)
      
      Changes since v3:
      - keep bus naming "%x" to be compatible with existing drivers.
      
      Changes since v2:
      - more #ifdefs reduction
      - platform driver will be registered on OF platforms also
      - unified platform and OF bus_id to phy%i
      
      Changes since v1:
      - removed NO_IRQ
      - reduced #idefs
      
      Laurent, please test this driver under OF.
      Signed-off-by: default avatarPaulius Zaleckas <paulius.zaleckas@teltonika.lt>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f004f3ea
    • Paulius Zaleckas's avatar
  2. 16 Nov, 2008 3 commits
  3. 14 Nov, 2008 4 commits
    • Eric Dumazet's avatar
      net: speedup dst_release() · ef711cf1
      Eric Dumazet authored
      During tbench/oprofile sessions, I found that dst_release() was in third position.
      
      CPU: Core 2, speed 2999.68 MHz (estimated)
      Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
      samples  %        symbol name
      483726    9.0185  __copy_user_zeroing_intel
      191466    3.5697  __copy_user_intel
      185475    3.4580  dst_release
      175114    3.2648  ip_queue_xmit
      153447    2.8608  tcp_sendmsg
      108775    2.0280  tcp_recvmsg
      102659    1.9140  sysenter_past_esp
      101450    1.8914  tcp_current_mss
      95067     1.7724  __copy_from_user_ll
      86531     1.6133  tcp_transmit_skb
      
      Of course, all CPUS fight on the dst_entry associated with 127.0.0.1 
      
      Instead of first checking the refcount value, then decrement it,
      we use atomic_dec_return() to help CPU to make the right memory transaction
      (ie getting the cache line in exclusive mode)
      
      dst_release() is now at the fifth position, and tbench a litle bit faster ;)
      
      CPU: Core 2, speed 3000.1 MHz (estimated)
      Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
      samples  %        symbol name
      647107    8.8072  __copy_user_zeroing_intel
      258840    3.5229  ip_queue_xmit
      258302    3.5155  __copy_user_intel
      209629    2.8531  tcp_sendmsg
      165632    2.2543  dst_release
      149232    2.0311  tcp_current_mss
      147821    2.0119  tcp_recvmsg
      137893    1.8767  sysenter_past_esp
      127473    1.7349  __copy_from_user_ll
      121308    1.6510  ip_finish_output
      118510    1.6129  tcp_transmit_skb
      109295    1.4875  tcp_v4_rcv
      Signed-off-by: default avatarEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef711cf1
    • Jarek Poplawski's avatar
      pkt_sched: Remove qdisc->ops->requeue() etc. · f30ab418
      Jarek Poplawski authored
      After implementing qdisc->ops->peek() and changing sch_netem into
      classless qdisc there are no more qdisc->ops->requeue() users. This
      patch removes this method with its wrappers (qdisc_requeue()), and
      also unused qdisc->requeue structure. There are a few minor fixes of
      warnings (htb_enqueue()) and comments btw.
      
      The idea to kill ->requeue() and a similar patch were first developed
      by David S. Miller.
      Signed-off-by: default avatarJarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f30ab418
    • Petr Tesarik's avatar
      tcp: remove an unnecessary field in struct tcp_skb_cb · 38a7ddff
      Petr Tesarik authored
      The urg_ptr field is not used anywhere and is merely confusing.
      Signed-off-by: default avatarPetr Tesarik <ptesarik@suse.cz>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      38a7ddff
    • Harvey Harrison's avatar
      isdn: use %pI4, remove get_{u8/u16/u32} and put_{u8/u16/u32} inlines · 00bcd522
      Harvey Harrison authored
      They would have been better named as get_be16, put_be16, etc.
      as they were hiding an endian shift inside.
      
      They don't add much over explicitly coding the byteshifting
      and gcc sometimes has a problem with builtin_constant_p inside
      inline functions, so it may do a better job of byteswapping
      at compile time rather than runtime.
      Signed-off-by: default avatarHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      00bcd522
  4. 13 Nov, 2008 10 commits