Commits · 564824b0c52c34692d804bb6ea214451615b0b50 · Kirill Smelkov / linux

16 Oct, 2010 2 commits

net: allocate skbs on local node · 564824b0

Eric Dumazet authored Oct 11, 2010

commit b30973f8 (node-aware skb allocation) spread a wrong habit of
allocating net drivers skbs on a given memory node : The one closest to
the NIC hardware. This is wrong because as soon as we try to scale
network stack, we need to use many cpus to handle traffic and hit
slub/slab management on cross-node allocations/frees when these cpus
have to alloc/free skbs bound to a central node.

skb allocated in RX path are ephemeral, they have a very short
lifetime : Extra cost to maintain NUMA affinity is too expensive. What
appeared as a nice idea four years ago is in fact a bad one.

In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
and two 10Gb NIC might deliver more than 28 million packets per second,
needing all the available cpus.

Cost of cross-node handling in network and vm stacks outperforms the
small benefit hardware had when doing its DMA transfert in its 'local'
memory node at RX time. Even trying to differentiate the two allocations
done for one skb (the sk_buff on local node, the data part on NIC
hardware node) is not enough to bring good performance.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

564824b0

r8169: use 50% less ram for RX ring · 6f0333b8

Eric Dumazet authored Oct 11, 2010

Using standard skb allocations in r8169 leads to order-3 allocations (if
PAGE_SIZE=4096), because NIC needs 16383 bytes, and skb overhead makes
this bigger than 16384 -> 32768 bytes per "skb"

Using kmalloc() permits to reduce memory requirements of one r8169 nic
by 4Mbytes. (256 frames * 16Kbytes). This is fine since a hardware bug
requires us to copy incoming frames, so we build real skb when doing
this copy.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6f0333b8

15 Oct, 2010 1 commit

ixgbe: DCB: remove DCB check config · 7662ff46

John Fastabend authored Oct 15, 2010

Remove a DCB check config from DCB configuration we
continue to configure DCB even if it fails so don't
even bother to check.  Plus user space (lldpad) checks
this before programming the hw anyways.

Worse case is we program some values into the hw that
don't make total sense resulting in incorrect bandwidth
allocation.
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7662ff46

14 Oct, 2010 11 commits

igb: add check for fiber/serdes devices to igb_set_spd_dplx; · cd2638a8

Carolyn Wyborny authored Oct 12, 2010

Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Tested-by: Emil Tantilov <emil.s.tantilov@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cd2638a8

ixgbe: declare functions as static · 5d5b7c39

Emil Tantilov authored Oct 12, 2010

Following patch fixes warnings reported by `make namespacecheck`

Reported by Stephen Hemminger

CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Stephen Ko <stephen.s.ko@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

5d5b7c39

ixgbe: remove unused functions · f32f837b

Emil Tantilov authored Oct 12, 2010

Remove functions that are declared, but not used in the driver.
This patch fixes warnings reported by `make namespacecheck`

Reported by Stephen Hemminger

CC: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Stephen Ko <stephen.s.ko@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f32f837b

cnic: Add support for 57712 device · ee87a82a

Michael Chan authored Oct 13, 2010

Add new interrupt ack functions and other hardware interface logic to
support the new device.

Update version to 2.2.6.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ee87a82a

cnic: Decouple uio close from cnic shutdown · a3ceeeb8

Michael Chan authored Oct 13, 2010

During cnic shutdown, the original driver code requires userspace to
close the uio device within a few seconds.  This doesn't always happen
as the userapp may be hung or otherwise take a long time to close.  The
system may crash when this happens.

We fix the problem by decoupling the uio structures from the cnic
structures during cnic shutdown.  We do not unregister the uio device
until the cnic driver is unloaded.  This eliminates the unreliable wait
loop for uio to close.

All uio structures are kept in a linked list.  If the device is shutdown
and later brought back up again, the uio strcture will be found in the
linked list and coupled back to the cnic structures.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a3ceeeb8

cnic: Add cnic_uio_dev struct · cd801536

Michael Chan authored Oct 13, 2010

and put all uio related structures and ring buffers in it.  This allows
uio operations to be done independent of the cnic device structures.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cd801536

cnic: Add cnic_free_uio() · c06c0462

Michael Chan authored Oct 13, 2010

to free all UIO related structures.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c06c0462

cnic: Defer iscsi connection cleanup · fdf24086

Michael Chan authored Oct 13, 2010

The bnx2x devices require a 2 second quiet time before sending the last
RAMROD command to destroy a connection. This sleep wait adds up to a
long delay when iscsid is serially destroying maultiple connections.

Create a workqueue to perform the final connection cleanup in the
background to speed up the process. This significantly speeds up the
process as the wait time can be done in parallel for multiple connections.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fdf24086

cnic: Add cnic_bnx2x_destroy_ramrod() · a2c9e769

Michael Chan authored Oct 13, 2010

Refactoring code for the next patch to defer connection clean up.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

a2c9e769

cnic: Convert ctx_flags to bit fields · 6e0dda0c

Michael Chan authored Oct 13, 2010

so that we can additional bit definitions without requiring a spinlock.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6e0dda0c

cnic: Add common cnic_request_irq() · 6e0dc643

Michael Chan authored Oct 13, 2010

to reduce some duplicate code.  Also, use tasklet_kill() in
cnic_free_irq() to wait for the cnic_irq_task to complete.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

6e0dc643

13 Oct, 2010 4 commits

Documentation: Update Phonet doc for Pipe controller changes · f20ce779

Kumar Sanghvi authored Oct 12, 2010

Updates to Phonet doc for Pipe controller 'connect' socket
implementation and changes related to socket options.
Signed-off-by: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f20ce779

Phonet: 'connect' socket implementation for Pipe controller · b3d62553

Kumar Sanghvi authored Oct 12, 2010

Based on suggestion by Rémi Denis-Courmont to implement 'connect'
for Pipe controller logic,  this patch implements 'connect' socket
call for the Pipe controller logic.
The patch does following:-
- Removes setsockopts for PNPIPE_CREATE and PNPIPE_DESTROY
- Adds setsockopt for setting the Pipe handle value
- Implements connect socket call
- Updates the Pipe controller logic

User-space should now follow below sequence with Pipe controller:-
-socket
-bind
-setsockopt for PNPIPE_PIPE_HANDLE
-connect
-setsockopt for PNPIPE_ENCAP_IP
-setsockopt for PNPIPE_ENABLE

GPRS/3G data has been tested working fine with this.
Signed-off-by: Kumar Sanghvi <kumar.sanghvi@stericsson.com>
Acked-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

b3d62553

tipc: clean out all instances of #if 0'd unused code · 7368ddf1

Paul Gortmaker authored Oct 12, 2010

Remove all instances of legacy, or as yet to be implemented code
that is currently living within an #if 0 ... #endif block.
In the rare instance that some of it be needed in the future,
it can still be dragged out of history, but there is no need
for it to sit in mainline.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

7368ddf1

s390: ctcm_mpc: Fix build after netdev refcount changes. · 9fbb711e
David S. Miller authored Oct 13, 2010
```
Reported-by: Sachin Sant <sachinp@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
```
9fbb711e

12 Oct, 2010 9 commits

net: percpu net_device refcount · 29b4433d

Eric Dumazet authored Oct 11, 2010

We tried very hard to remove all possible dev_hold()/dev_put() pairs in
network stack, using RCU conversions.

There is still an unavoidable device refcount change for every dst we
create/destroy, and this can slow down some workloads (routers or some
app servers, mmap af_packet)

We can switch to a percpu refcount implementation, now dynamic per_cpu
infrastructure is mature. On a 64 cpus machine, this consumes 256 bytes
per device.

On x86, dev_hold(dev) code :

before
        lock    incl 0x280(%ebx)
after:
        movl    0x260(%ebx),%eax
        incl    fs:(%eax)

Stress bench :

(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE)

Before:

real    1m1.662s
user    0m14.373s
sys     12m55.960s

After:

real    0m51.179s
user    0m15.329s
sys     10m15.942s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

29b4433d

bnx2x: Fixing a typo: added a missing RSS enablement · f0b9f472

Dmitry Kravkov authored Oct 12, 2010

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com>
Signed-off-by: Vladislav Zolotarov <vladz@broadcom.com>
Tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

f0b9f472

Merge branch 'dccp' of git://eden-feed.erg.abdn.ac.uk/net-next-2.6 · 8fa6e3d4
David S. Miller authored Oct 12, 2010

8fa6e3d4

dccp: cosmetics - warning format · 2f34b329

Gerrit Renker authored Oct 11, 2010

This omits the redundant "DCCP:" in warning messages, since DCCP_WARN() already
echoes the function name, avoiding messages like

kernel: [10988.766503] dccp_close: DCCP: ABORT -- 209 bytes unread
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>

2f34b329

dccp: schedule an Ack when receiving timestamps · ecdfbdab

Gerrit Renker authored Oct 11, 2010

This schedules an Ack when receiving a timestamp, exploiting the
existing inet_csk_schedule_ack() function, saving one case in the
`dccp_ack_pending()' function.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>

ecdfbdab

dccp: generalise data-loss condition · d196c9a5

Ivo Calado authored Oct 11, 2010

This patch generalises the task of determining data loss from RFC 4340, 7.7.1.

Let S_A, S_B be sequence numbers such that S_B is "after" S_A, and let
N_B be the NDP count of packet S_B. Then, using modulo-2^48 arithmetic,
 D = S_B - S_A - 1  is an upper bound of the number of lost data packets,
 D - N_B            is an approximation of the number of lost data packets
                    (there are cases where this is not exact).

The patch implements this as
 dccp_loss_count(S_A, S_B, N_B) := max(S_B - S_A - 1 - N_B, 0)
Signed-off-by: Ivo Calado <ivocalado@embedded.ufcg.edu.br>
Signed-off-by: Erivaldo Xavier <desadoc@gmail.com>
Signed-off-by: Leandro Sales <leandroal@gmail.com>
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>

d196c9a5

dccp: remove unused argument in CCID tx function · baf9e782

Gerrit Renker authored Oct 11, 2010

This removes the argument `more' from ccid_hc_tx_packet_sent, since it was
nowhere used in the entire code.

(Btw, this argument was not even used in the original KAME code where the
 function initially came from; compare the variable moreToSend in the
 freebsd61-dccp-kame-28.08.2006.patch kept by Emmanuel Lochin.)
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>

baf9e782

dccp: merge now-reduced connect_init() function · 93344af4

Gerrit Renker authored Oct 11, 2010

After moving the assignment of GAR/ISS from dccp_connect_init() to
dccp_transmit_skb(), the former function becomes very small, so that
a merger with dccp_connect() suggests itself.
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>

93344af4

dccp: fix the adjustments to AWL and SWL · 0b53d460

Gerrit Renker authored Oct 11, 2010

This fixes a problem and a potential loophole with regard to seqno/ackno
validity: currently the initial adjustments to AWL/SWL are only performed
once at the begin of the connection, during the handshake.

Since the Sequence Window feature is always greater than Wmin=32 (7.5.2),
it is however necessary to perform these adjustments at least for the first
W/W' (variables as per 7.5.1) packets in the lifetime of a connection.

This requirement is complicated by the fact that W/W' can change at any time
during the lifetime of a connection.

Therefore it is better to perform that safety check each time SWL/AWL are
updated, as implemented by the patch.

A second problem solved by this patch is that the remote/local Sequence Window
feature values (which set the bounds for AWL/SWL/SWH) are undefined until the
feature negotiation has completed.

During the initial handshake we have more stringent sequence number protection;
the changes added by this patch effect that {A,S}W{L,H} are within the correct
bounds at the instant that feature negotiation completes (since the SeqWin
feature activation handlers call dccp_update_gsr/gss()).
Signed-off-by: Gerrit Renker <gerrit@erg.abdn.ac.uk>

0b53d460

11 Oct, 2010 13 commits

bnx2: Enable AER on PCIE devices only · c239f279

Michael Chan authored Oct 11, 2010

To prevent unnecessary error message.  pci_save_state() is also moved to
the end of ->probe() so that all PCI config, including AER state, will be
saved.

Update version to 2.0.18.
Signed-off-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <mchan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

c239f279

bnx2: Update firmware to 6.0.x. · 22fa159d

Michael Chan authored Oct 11, 2010

- Improved flow control and simplified interface
- Use hardware RSS indirection table instead of the slower firmware-
  based table
- Lower latency interrupt on 5709
Signed-off-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <benli@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

22fa159d

neigh: reorder struct neighbour fields · e37ef961

Eric Dumazet authored Oct 11, 2010

Le mardi 12 octobre 2010 à 00:02 +0200, Eric Dumazet a écrit :
> Here is the followup patch.
>
> Thanks !
>

Oops, this was an old version, the up2date ones also took care of "used"
field.

I guess its time for a sleep, sorry again.

[PATCH net-next V2] neigh: reorder struct neighbour fields

(refcnt) and (ha_lock, ha, used, dev, output, ops, primary_key) should
be placed on a separate cache lines.

refcnt can be often written, while other fields are mostly read.

This gave me good result on stress test :

before:

real    0m45.570s
user    0m15.525s
sys     9m56.669s

After:

real    0m41.841s
user    0m15.261s
sys     8m45.949s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

e37ef961

net dst: use a percpu_counter to track entries · fc66f95c

Eric Dumazet authored Oct 08, 2010

struct dst_ops tracks number of allocated dst in an atomic_t field,
subject to high cache line contention in stress workload.

Switch to a percpu_counter, to reduce number of time we need to dirty a
central location. Place it on a separate cache line to avoid dirtying
read only fields.

Stress test :

(Sending 160.000.000 UDP frames,
IP route cache disabled, dual E5540 @2.53GHz,
32bit kernel, FIB_TRIE, SLUB/NUMA)

Before:

real    0m51.179s
user    0m15.329s
sys     10m15.942s

After:

real	0m45.570s
user	0m15.525s
sys	9m56.669s

With a small reordering of struct neighbour fields, subject of a
following patch, (to separate refcnt from other read mostly fields)

real	0m41.841s
user	0m15.261s
sys	8m45.949s
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

fc66f95c

neigh: Protect neigh->ha[] with a seqlock · 0ed8ddf4

Eric Dumazet authored Oct 07, 2010

Add a seqlock in struct neighbour to protect neigh->ha[], and avoid
dirtying neighbour in stress situation (many different flows / dsts)

Dirtying takes place because of read_lock(&n->lock) and n->used writes.

Switching to a seqlock, and writing n->used only on jiffies changes
permits less dirtying.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

0ed8ddf4

Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 · d122179a
David S. Miller authored Oct 11, 2010
```
Conflicts:
	net/core/ethtool.c
```
d122179a

net: clear heap allocations for privileged ethtool actions · b00916b1

Kees Cook authored Oct 11, 2010

Several other ethtool functions leave heap uncleared (potentially) by
drivers. Some interfaces appear safe (eeprom, etc), in that the sizes
are well controlled. In some situations (e.g. unchecked error conditions),
the heap will remain unchanged in areas before copying back to userspace.
Note that these are less of an issue since these all require CAP_NET_ADMIN.

Cc: stable@kernel.org
Signed-off-by: Kees Cook <kees.cook@canonical.com>
Acked-by: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>