1. 04 Nov, 2013 26 commits
    • Jason Wang's avatar
      virtio-net: refill only when device is up during setting queues · 30983d3c
      Jason Wang authored
      [ Upstream commit 35ed159b ]
      
      We used to schedule the refill work unconditionally after changing the
      number of queues. This may lead an issue if the device is not
      up. Since we only try to cancel the work in ndo_stop(), this may cause
      the refill work still work after removing the device. Fix this by only
      schedule the work when device is up.
      
      The bug were introduce by commit 9b9cd802.
      (virtio-net: fix the race between channels setting and refill)
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30983d3c
    • Jason Wang's avatar
      virtio-net: fix the race between channels setting and refill · 413054e7
      Jason Wang authored
      [ Upstream commit 9b9cd802 ]
      
      Commit 55257d72 (virtio-net: fill only rx queues
      which are being used) tries to refill on demand when changing the number of
      channels by call try_refill_recv() directly, this may race:
      
      - the refill work who may do the refill in the same time
      - the try_refill_recv() called in bh since napi was not disabled
      
      Which may led guest complain during setting channels:
      
      virtio_net virtio0: input.1:id 0 is not a head!
      
      Solve this issue by scheduling a refill work which can guarantee the
      serialization of refill.
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      413054e7
    • Jason Wang's avatar
      virtio-net: don't respond to cpu hotplug notifier if we're not ready · f0272171
      Jason Wang authored
      [ Upstream commit 3ab098df ]
      
      We're trying to re-configure the affinity unconditionally in cpu hotplug
      callback. This may lead the issue during resuming from s3/s4 since
      
      - virt queues haven't been allocated at that time.
      - it's unnecessary since thaw method will re-configure the affinity.
      
      Fix this issue by checking the config_enable and do nothing is we're not ready.
      
      The bug were introduced by commit 8de4b2f3
      (virtio-net: reset virtqueue affinity when doing cpu hotplug).
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Wanlong Gao <gaowanlong@cn.fujitsu.com>
      Reviewed-by: default avatarWanlong Gao <gaowanlong@cn.fujitsu.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f0272171
    • Eric Dumazet's avatar
      bnx2x: record rx queue for LRO packets · 608be703
      Eric Dumazet authored
      [ Upstream commit 60e66fee ]
      
      RPS support is kind of broken on bnx2x, because only non LRO packets
      get proper rx queue information. This triggers reorders, as it seems
      bnx2x like to generate a non LRO packet for segment including TCP PUSH
      flag : (this might be pure coincidence, but all the reorders I've
      seen involve segments with a PUSH)
      
      11:13:34.335847 IP A > B: . 415808:447136(31328) ack 1 win 457 <nop,nop,timestamp 3789336 3985797>
      11:13:34.335992 IP A > B: . 447136:448560(1424) ack 1 win 457 <nop,nop,timestamp 3789336 3985797>
      11:13:34.336391 IP A > B: . 448560:479888(31328) ack 1 win 457 <nop,nop,timestamp 3789337 3985797>
      11:13:34.336425 IP A > B: P 511216:512640(1424) ack 1 win 457 <nop,nop,timestamp 3789337 3985798>
      11:13:34.336423 IP A > B: . 479888:511216(31328) ack 1 win 457 <nop,nop,timestamp 3789337 3985798>
      11:13:34.336924 IP A > B: . 512640:543968(31328) ack 1 win 457 <nop,nop,timestamp 3789337 3985798>
      11:13:34.336963 IP A > B: . 543968:575296(31328) ack 1 win 457 <nop,nop,timestamp 3789337 3985798>
      
      We must call skb_record_rx_queue() to properly give to RPS (and more
      generally for TX queue selection on forward path) the receive queue
      information.
      
      Similar fix is needed for skb_mark_napi_id(), but will be handled
      in a separate patch to ease stable backports.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Eilon Greenstein <eilong@broadcom.com>
      Acked-by: default avatarDmitry Kravkov <dmitry@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      608be703
    • Mathias Krause's avatar
      connector: use nlmsg_len() to check message length · 69b9c5ea
      Mathias Krause authored
      [ Upstream commit 162b2bed ]
      
      The current code tests the length of the whole netlink message to be
      at least as long to fit a cn_msg. This is wrong as nlmsg_len includes
      the length of the netlink message header. Use nlmsg_len() instead to
      fix this "off-by-NLMSG_HDRLEN" size check.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      69b9c5ea
    • Mathias Krause's avatar
      unix_diag: fix info leak · 39283085
      Mathias Krause authored
      [ Upstream commit 6865d1e8 ]
      
      When filling the netlink message we miss to wipe the pad field,
      therefore leak one byte of heap memory to userland. Fix this by
      setting pad to 0.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39283085
    • Salva Peiró's avatar
      farsync: fix info leak in ioctl · 3a267360
      Salva Peiró authored
      [ Upstream commit 96b34040 ]
      
      The fst_get_iface() code fails to initialize the two padding bytes of
      struct sync_serial_settings after the ->loopback member. Add an explicit
      memset(0) before filling the structure to avoid the info leak.
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3a267360
    • Eric Dumazet's avatar
      l2tp: must disable bh before calling l2tp_xmit_skb() · a4153677
      Eric Dumazet authored
      [ Upstream commit 455cc32b ]
      
      François Cachereul made a very nice bug report and suspected
      the bh_lock_sock() / bh_unlok_sock() pair used in l2tp_xmit_skb() from
      process context was not good.
      
      This problem was added by commit 6af88da1
      ("l2tp: Fix locking in l2tp_core.c").
      
      l2tp_eth_dev_xmit() runs from BH context, so we must disable BH
      from other l2tp_xmit_skb() users.
      
      [  452.060011] BUG: soft lockup - CPU#1 stuck for 23s! [accel-pppd:6662]
      [  452.061757] Modules linked in: l2tp_ppp l2tp_netlink l2tp_core pppoe pppox
      ppp_generic slhc ipv6 ext3 mbcache jbd virtio_balloon xfs exportfs dm_mod
      virtio_blk ata_generic virtio_net floppy ata_piix libata virtio_pci virtio_ring virtio [last unloaded: scsi_wait_scan]
      [  452.064012] CPU 1
      [  452.080015] BUG: soft lockup - CPU#2 stuck for 23s! [accel-pppd:6643]
      [  452.080015] CPU 2
      [  452.080015]
      [  452.080015] Pid: 6643, comm: accel-pppd Not tainted 3.2.46.mini #1 Bochs Bochs
      [  452.080015] RIP: 0010:[<ffffffff81059f6c>]  [<ffffffff81059f6c>] do_raw_spin_lock+0x17/0x1f
      [  452.080015] RSP: 0018:ffff88007125fc18  EFLAGS: 00000293
      [  452.080015] RAX: 000000000000aba9 RBX: ffffffff811d0703 RCX: 0000000000000000
      [  452.080015] RDX: 00000000000000ab RSI: ffff8800711f6896 RDI: ffff8800745c8110
      [  452.080015] RBP: ffff88007125fc18 R08: 0000000000000020 R09: 0000000000000000
      [  452.080015] R10: 0000000000000000 R11: 0000000000000280 R12: 0000000000000286
      [  452.080015] R13: 0000000000000020 R14: 0000000000000240 R15: 0000000000000000
      [  452.080015] FS:  00007fdc0cc24700(0000) GS:ffff8800b6f00000(0000) knlGS:0000000000000000
      [  452.080015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  452.080015] CR2: 00007fdb054899b8 CR3: 0000000074404000 CR4: 00000000000006a0
      [  452.080015] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  452.080015] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  452.080015] Process accel-pppd (pid: 6643, threadinfo ffff88007125e000, task ffff8800b27e6dd0)
      [  452.080015] Stack:
      [  452.080015]  ffff88007125fc28 ffffffff81256559 ffff88007125fc98 ffffffffa01b2bd1
      [  452.080015]  ffff88007125fc58 000000000000000c 00000000029490d0 0000009c71dbe25e
      [  452.080015]  000000000000005c 000000080000000e 0000000000000000 ffff880071170600
      [  452.080015] Call Trace:
      [  452.080015]  [<ffffffff81256559>] _raw_spin_lock+0xe/0x10
      [  452.080015]  [<ffffffffa01b2bd1>] l2tp_xmit_skb+0x189/0x4ac [l2tp_core]
      [  452.080015]  [<ffffffffa01c2d36>] pppol2tp_sendmsg+0x15e/0x19c [l2tp_ppp]
      [  452.080015]  [<ffffffff811c7872>] __sock_sendmsg_nosec+0x22/0x24
      [  452.080015]  [<ffffffff811c83bd>] sock_sendmsg+0xa1/0xb6
      [  452.080015]  [<ffffffff81254e88>] ? __schedule+0x5c1/0x616
      [  452.080015]  [<ffffffff8103c7c6>] ? __dequeue_signal+0xb7/0x10c
      [  452.080015]  [<ffffffff810bbd21>] ? fget_light+0x75/0x89
      [  452.080015]  [<ffffffff811c8444>] ? sockfd_lookup_light+0x20/0x56
      [  452.080015]  [<ffffffff811c9b34>] sys_sendto+0x10c/0x13b
      [  452.080015]  [<ffffffff8125cac2>] system_call_fastpath+0x16/0x1b
      [  452.080015] Code: 81 48 89 e5 72 0c 31 c0 48 81 ff 45 66 25 81 0f 92 c0 5d c3 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 0f b6 d4 38 d0 74 06 f3 90 <8a> 07 eb f6 5d c3 90 90 55 48 89 e5 9c 58 0f 1f 44 00 00 5d c3
      [  452.080015] Call Trace:
      [  452.080015]  [<ffffffff81256559>] _raw_spin_lock+0xe/0x10
      [  452.080015]  [<ffffffffa01b2bd1>] l2tp_xmit_skb+0x189/0x4ac [l2tp_core]
      [  452.080015]  [<ffffffffa01c2d36>] pppol2tp_sendmsg+0x15e/0x19c [l2tp_ppp]
      [  452.080015]  [<ffffffff811c7872>] __sock_sendmsg_nosec+0x22/0x24
      [  452.080015]  [<ffffffff811c83bd>] sock_sendmsg+0xa1/0xb6
      [  452.080015]  [<ffffffff81254e88>] ? __schedule+0x5c1/0x616
      [  452.080015]  [<ffffffff8103c7c6>] ? __dequeue_signal+0xb7/0x10c
      [  452.080015]  [<ffffffff810bbd21>] ? fget_light+0x75/0x89
      [  452.080015]  [<ffffffff811c8444>] ? sockfd_lookup_light+0x20/0x56
      [  452.080015]  [<ffffffff811c9b34>] sys_sendto+0x10c/0x13b
      [  452.080015]  [<ffffffff8125cac2>] system_call_fastpath+0x16/0x1b
      [  452.064012]
      [  452.064012] Pid: 6662, comm: accel-pppd Not tainted 3.2.46.mini #1 Bochs Bochs
      [  452.064012] RIP: 0010:[<ffffffff81059f6e>]  [<ffffffff81059f6e>] do_raw_spin_lock+0x19/0x1f
      [  452.064012] RSP: 0018:ffff8800b6e83ba0  EFLAGS: 00000297
      [  452.064012] RAX: 000000000000aaa9 RBX: ffff8800b6e83b40 RCX: 0000000000000002
      [  452.064012] RDX: 00000000000000aa RSI: 000000000000000a RDI: ffff8800745c8110
      [  452.064012] RBP: ffff8800b6e83ba0 R08: 000000000000c802 R09: 000000000000001c
      [  452.064012] R10: ffff880071096c4e R11: 0000000000000006 R12: ffff8800b6e83b18
      [  452.064012] R13: ffffffff8125d51e R14: ffff8800b6e83ba0 R15: ffff880072a589c0
      [  452.064012] FS:  00007fdc0b81e700(0000) GS:ffff8800b6e80000(0000) knlGS:0000000000000000
      [  452.064012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  452.064012] CR2: 0000000000625208 CR3: 0000000074404000 CR4: 00000000000006a0
      [  452.064012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  452.064012] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [  452.064012] Process accel-pppd (pid: 6662, threadinfo ffff88007129a000, task ffff8800744f7410)
      [  452.064012] Stack:
      [  452.064012]  ffff8800b6e83bb0 ffffffff81256559 ffff8800b6e83bc0 ffffffff8121c64a
      [  452.064012]  ffff8800b6e83bf0 ffffffff8121ec7a ffff880072a589c0 ffff880071096c62
      [  452.064012]  0000000000000011 ffffffff81430024 ffff8800b6e83c80 ffffffff8121f276
      [  452.064012] Call Trace:
      [  452.064012]  <IRQ>
      [  452.064012]  [<ffffffff81256559>] _raw_spin_lock+0xe/0x10
      [  452.064012]  [<ffffffff8121c64a>] spin_lock+0x9/0xb
      [  452.064012]  [<ffffffff8121ec7a>] udp_queue_rcv_skb+0x186/0x269
      [  452.064012]  [<ffffffff8121f276>] __udp4_lib_rcv+0x297/0x4ae
      [  452.064012]  [<ffffffff8121c178>] ? raw_rcv+0xe9/0xf0
      [  452.064012]  [<ffffffff8121f4a7>] udp_rcv+0x1a/0x1c
      [  452.064012]  [<ffffffff811fe385>] ip_local_deliver_finish+0x12b/0x1a5
      [  452.064012]  [<ffffffff811fe54e>] ip_local_deliver+0x53/0x84
      [  452.064012]  [<ffffffff811fe1d0>] ip_rcv_finish+0x2bc/0x2f3
      [  452.064012]  [<ffffffff811fe78f>] ip_rcv+0x210/0x269
      [  452.064012]  [<ffffffff8101911e>] ? kvm_clock_get_cycles+0x9/0xb
      [  452.064012]  [<ffffffff811d88cd>] __netif_receive_skb+0x3a5/0x3f7
      [  452.064012]  [<ffffffff811d8eba>] netif_receive_skb+0x57/0x5e
      [  452.064012]  [<ffffffff811cf30f>] ? __netdev_alloc_skb+0x1f/0x3b
      [  452.064012]  [<ffffffffa0049126>] virtnet_poll+0x4ba/0x5a4 [virtio_net]
      [  452.064012]  [<ffffffff811d9417>] net_rx_action+0x73/0x184
      [  452.064012]  [<ffffffffa01b2cc2>] ? l2tp_xmit_skb+0x27a/0x4ac [l2tp_core]
      [  452.064012]  [<ffffffff810343b9>] __do_softirq+0xc3/0x1a8
      [  452.064012]  [<ffffffff81013b56>] ? ack_APIC_irq+0x10/0x12
      [  452.064012]  [<ffffffff81256559>] ? _raw_spin_lock+0xe/0x10
      [  452.064012]  [<ffffffff8125e0ac>] call_softirq+0x1c/0x26
      [  452.064012]  [<ffffffff81003587>] do_softirq+0x45/0x82
      [  452.064012]  [<ffffffff81034667>] irq_exit+0x42/0x9c
      [  452.064012]  [<ffffffff8125e146>] do_IRQ+0x8e/0xa5
      [  452.064012]  [<ffffffff8125676e>] common_interrupt+0x6e/0x6e
      [  452.064012]  <EOI>
      [  452.064012]  [<ffffffff810b82a1>] ? kfree+0x8a/0xa3
      [  452.064012]  [<ffffffffa01b2cc2>] ? l2tp_xmit_skb+0x27a/0x4ac [l2tp_core]
      [  452.064012]  [<ffffffffa01b2c25>] ? l2tp_xmit_skb+0x1dd/0x4ac [l2tp_core]
      [  452.064012]  [<ffffffffa01c2d36>] pppol2tp_sendmsg+0x15e/0x19c [l2tp_ppp]
      [  452.064012]  [<ffffffff811c7872>] __sock_sendmsg_nosec+0x22/0x24
      [  452.064012]  [<ffffffff811c83bd>] sock_sendmsg+0xa1/0xb6
      [  452.064012]  [<ffffffff81254e88>] ? __schedule+0x5c1/0x616
      [  452.064012]  [<ffffffff8103c7c6>] ? __dequeue_signal+0xb7/0x10c
      [  452.064012]  [<ffffffff810bbd21>] ? fget_light+0x75/0x89
      [  452.064012]  [<ffffffff811c8444>] ? sockfd_lookup_light+0x20/0x56
      [  452.064012]  [<ffffffff811c9b34>] sys_sendto+0x10c/0x13b
      [  452.064012]  [<ffffffff8125cac2>] system_call_fastpath+0x16/0x1b
      [  452.064012] Code: 89 e5 72 0c 31 c0 48 81 ff 45 66 25 81 0f 92 c0 5d c3 55 b8 00 01 00 00 48 89 e5 f0 66 0f c1 07 0f b6 d4 38 d0 74 06 f3 90 8a 07 <eb> f6 5d c3 90 90 55 48 89 e5 9c 58 0f 1f 44 00 00 5d c3 55 48
      [  452.064012] Call Trace:
      [  452.064012]  <IRQ>  [<ffffffff81256559>] _raw_spin_lock+0xe/0x10
      [  452.064012]  [<ffffffff8121c64a>] spin_lock+0x9/0xb
      [  452.064012]  [<ffffffff8121ec7a>] udp_queue_rcv_skb+0x186/0x269
      [  452.064012]  [<ffffffff8121f276>] __udp4_lib_rcv+0x297/0x4ae
      [  452.064012]  [<ffffffff8121c178>] ? raw_rcv+0xe9/0xf0
      [  452.064012]  [<ffffffff8121f4a7>] udp_rcv+0x1a/0x1c
      [  452.064012]  [<ffffffff811fe385>] ip_local_deliver_finish+0x12b/0x1a5
      [  452.064012]  [<ffffffff811fe54e>] ip_local_deliver+0x53/0x84
      [  452.064012]  [<ffffffff811fe1d0>] ip_rcv_finish+0x2bc/0x2f3
      [  452.064012]  [<ffffffff811fe78f>] ip_rcv+0x210/0x269
      [  452.064012]  [<ffffffff8101911e>] ? kvm_clock_get_cycles+0x9/0xb
      [  452.064012]  [<ffffffff811d88cd>] __netif_receive_skb+0x3a5/0x3f7
      [  452.064012]  [<ffffffff811d8eba>] netif_receive_skb+0x57/0x5e
      [  452.064012]  [<ffffffff811cf30f>] ? __netdev_alloc_skb+0x1f/0x3b
      [  452.064012]  [<ffffffffa0049126>] virtnet_poll+0x4ba/0x5a4 [virtio_net]
      [  452.064012]  [<ffffffff811d9417>] net_rx_action+0x73/0x184
      [  452.064012]  [<ffffffffa01b2cc2>] ? l2tp_xmit_skb+0x27a/0x4ac [l2tp_core]
      [  452.064012]  [<ffffffff810343b9>] __do_softirq+0xc3/0x1a8
      [  452.064012]  [<ffffffff81013b56>] ? ack_APIC_irq+0x10/0x12
      [  452.064012]  [<ffffffff81256559>] ? _raw_spin_lock+0xe/0x10
      [  452.064012]  [<ffffffff8125e0ac>] call_softirq+0x1c/0x26
      [  452.064012]  [<ffffffff81003587>] do_softirq+0x45/0x82
      [  452.064012]  [<ffffffff81034667>] irq_exit+0x42/0x9c
      [  452.064012]  [<ffffffff8125e146>] do_IRQ+0x8e/0xa5
      [  452.064012]  [<ffffffff8125676e>] common_interrupt+0x6e/0x6e
      [  452.064012]  <EOI>  [<ffffffff810b82a1>] ? kfree+0x8a/0xa3
      [  452.064012]  [<ffffffffa01b2cc2>] ? l2tp_xmit_skb+0x27a/0x4ac [l2tp_core]
      [  452.064012]  [<ffffffffa01b2c25>] ? l2tp_xmit_skb+0x1dd/0x4ac [l2tp_core]
      [  452.064012]  [<ffffffffa01c2d36>] pppol2tp_sendmsg+0x15e/0x19c [l2tp_ppp]
      [  452.064012]  [<ffffffff811c7872>] __sock_sendmsg_nosec+0x22/0x24
      [  452.064012]  [<ffffffff811c83bd>] sock_sendmsg+0xa1/0xb6
      [  452.064012]  [<ffffffff81254e88>] ? __schedule+0x5c1/0x616
      [  452.064012]  [<ffffffff8103c7c6>] ? __dequeue_signal+0xb7/0x10c
      [  452.064012]  [<ffffffff810bbd21>] ? fget_light+0x75/0x89
      [  452.064012]  [<ffffffff811c8444>] ? sockfd_lookup_light+0x20/0x56
      [  452.064012]  [<ffffffff811c9b34>] sys_sendto+0x10c/0x13b
      [  452.064012]  [<ffffffff8125cac2>] system_call_fastpath+0x16/0x1b
      Reported-by: default avatarFrançois Cachereul <f.cachereul@alphalink.fr>
      Tested-by: default avatarFrançois Cachereul <f.cachereul@alphalink.fr>
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: James Chapman <jchapman@katalix.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a4153677
    • Christophe Gouault's avatar
      vti: get rid of nf mark rule in prerouting · a77db440
      Christophe Gouault authored
      [ Upstream commit 7263a518 ]
      
      This patch fixes and improves the use of vti interfaces (while
      lightly changing the way of configuring them).
      
      Currently:
      
      - it is necessary to identify and mark inbound IPsec
        packets destined to each vti interface, via netfilter rules in
        the mangle table at prerouting hook.
      
      - the vti module cannot retrieve the right tunnel in input since
        commit b9959fd3: vti tunnels all have an i_key, but the tunnel lookup
        is done with flag TUNNEL_NO_KEY, so there no chance to retrieve them.
      
      - the i_key is used by the outbound processing as a mark to lookup
        for the right SP and SA bundle.
      
      This patch uses the o_key to store the vti mark (instead of i_key) and
      enables:
      
      - to avoid the need for previously marking the inbound skbuffs via a
        netfilter rule.
      - to properly retrieve the right tunnel in input, only based on the IPsec
        packet outer addresses.
      - to properly perform an inbound policy check (using the tunnel o_key
        as a mark).
      - to properly perform an outbound SPD and SAD lookup (using the tunnel
        o_key as a mark).
      - to keep the current mark of the skbuff. The skbuff mark is neither
        used nor changed by the vti interface. Only the vti interface o_key
        is used.
      
      SAs have a wildcard mark.
      SPs have a mark equal to the vti interface o_key.
      
      The vti interface must be created as follows (i_key = 0, o_key = mark):
      
         ip link add vti1 mode vti local 1.1.1.1 remote 2.2.2.2 okey 1
      
      The SPs attached to vti1 must be created as follows (mark = vti1 o_key):
      
         ip xfrm policy add dir out mark 1 tmpl src 1.1.1.1 dst 2.2.2.2 \
            proto esp mode tunnel
         ip xfrm policy add dir in  mark 1 tmpl src 2.2.2.2 dst 1.1.1.1 \
            proto esp mode tunnel
      
      The SAs are created with the default wildcard mark. There is no
      distinction between global vs. vti SAs. Just their addresses will
      possibly link them to a vti interface:
      
         ip xfrm state add src 1.1.1.1 dst 2.2.2.2 proto esp spi 1000 mode tunnel \
                       enc "cbc(aes)" "azertyuiopqsdfgh"
      
         ip xfrm state add src 2.2.2.2 dst 1.1.1.1 proto esp spi 2000 mode tunnel \
                       enc "cbc(aes)" "sqbdhgqsdjqjsdfh"
      
      To avoid matching "global" (not vti) SPs in vti interfaces, global SPs
      should no use the default wildcard mark, but explicitly match mark 0.
      
      To avoid a double SPD lookup in input and output (in global and vti SPDs),
      the NOPOLICY and NOXFRM options should be set on the vti interfaces:
      
         echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_policy
         echo 1 > /proc/sys/net/ipv4/conf/vti1/disable_xfrm
      
      The outgoing traffic is steered to vti1 by a route via the vti interface:
      
         ip route add 192.168.0.0/16 dev vti1
      
      The incoming IPsec traffic is steered to vti1 because its outer addresses
      match the vti1 tunnel configuration.
      Signed-off-by: default avatarChristophe Gouault <christophe.gouault@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a77db440
    • Marc Kleine-Budde's avatar
      net: vlan: fix nlmsg size calculation in vlan_get_size() · 4c1f32d2
      Marc Kleine-Budde authored
      [ Upstream commit c33a39c5 ]
      
      This patch fixes the calculation of the nlmsg size, by adding the missing
      nla_total_size().
      
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4c1f32d2
    • Paul Durrant's avatar
      xen-netback: Don't destroy the netdev until the vif is shut down · f495ddc4
      Paul Durrant authored
      [ upstream commit id: 279f438e ]
      
      Without this patch, if a frontend cycles through states Closing
      and Closed (which Windows frontends need to do) then the netdev
      will be destroyed and requires re-invocation of hotplug scripts
      to restore state before the frontend can move to Connected. Thus
      when udev is not in use the backend gets stuck in InitWait.
      
      With this patch, the netdev is left alone whilst the backend is
      still online and is only de-registered and freed just prior to
      destroying the vif (which is also nicely symmetrical with the
      netdev allocation and registration being done during probe) so
      no re-invocation of hotplug scripts is required.
      Signed-off-by: default avatarPaul Durrant <paul.durrant@citrix.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Wei Liu <wei.liu2@citrix.com>
      Cc: Ian Campbell <ian.campbell@citrix.com>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f495ddc4
    • Fabio Estevam's avatar
      net: secure_seq: Fix warning when CONFIG_IPV6 and CONFIG_INET are not selected · b9396c4c
      Fabio Estevam authored
      [ Upstream commit cb03db9d ]
      
      net_secret() is only used when CONFIG_IPV6 or CONFIG_INET are selected.
      
      Building a defconfig with both of these symbols unselected (Using the ARM
      at91sam9rl_defconfig, for example) leads to the following build warning:
      
      $ make at91sam9rl_defconfig
      #
      # configuration written to .config
      #
      
      $ make net/core/secure_seq.o
      scripts/kconfig/conf --silentoldconfig Kconfig
        CHK     include/config/kernel.release
        CHK     include/generated/uapi/linux/version.h
        CHK     include/generated/utsrelease.h
      make[1]: `include/generated/mach-types.h' is up to date.
        CALL    scripts/checksyscalls.sh
        CC      net/core/secure_seq.o
      net/core/secure_seq.c:17:13: warning: 'net_secret_init' defined but not used [-Wunused-function]
      
      Fix this warning by protecting the definition of net_secret() with these
      symbols.
      Reported-by: default avatarOlof Johansson <olof@lixom.net>
      Signed-off-by: default avatarFabio Estevam <fabio.estevam@freescale.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b9396c4c
    • Marc Kleine-Budde's avatar
      can: dev: fix nlmsg size calculation in can_get_size() · 42fc155b
      Marc Kleine-Budde authored
      [ Upstream commit fe119a05 ]
      
      This patch fixes the calculation of the nlmsg size, by adding the missing
      nla_total_size().
      Signed-off-by: default avatarMarc Kleine-Budde <mkl@pengutronix.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      42fc155b
    • Jiri Benc's avatar
      ipv4: fix ineffective source address selection · b15e22da
      Jiri Benc authored
      [ Upstream commit 0a7e2260 ]
      
      When sending out multicast messages, the source address in inet->mc_addr is
      ignored and rewritten by an autoselected one. This is caused by a typo in
      commit 813b3b5d ("ipv4: Use caller's on-stack flowi as-is in output
      route lookups").
      Signed-off-by: default avatarJiri Benc <jbenc@redhat.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b15e22da
    • Mathias Krause's avatar
      proc connector: fix info leaks · df6ae0dc
      Mathias Krause authored
      [ Upstream commit e727ca82 ]
      
      Initialize event_data for all possible message types to prevent leaking
      kernel stack contents to userland (up to 20 bytes). Also set the flags
      member of the connector message to 0 to prevent leaking two more stack
      bytes this way.
      Signed-off-by: default avatarMathias Krause <minipli@googlemail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      df6ae0dc
    • Dan Carpenter's avatar
      net: heap overflow in __audit_sockaddr() · 2e8d97ab
      Dan Carpenter authored
      [ Upstream commit 1661bf36 ]
      
      We need to cap ->msg_namelen or it leads to a buffer overflow when we
      to the memcpy() in __audit_sockaddr().  It requires CAP_AUDIT_CONTROL to
      exploit this bug.
      
      The call tree is:
      ___sys_recvmsg()
        move_addr_to_user()
          audit_sockaddr()
            __audit_sockaddr()
      Reported-by: default avatarJüri Aedla <juri.aedla@gmail.com>
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2e8d97ab
    • Sebastian Hesselbarth's avatar
      net: mv643xx_eth: fix orphaned statistics timer crash · b24b4a82
      Sebastian Hesselbarth authored
      [ Upstream commit f564412c ]
      
      The periodic statistics timer gets started at port _probe() time, but
      is stopped on _stop() only. In a modular environment, this can cause
      the timer to access already deallocated memory, if the module is unloaded
      without starting the eth device. To fix this, we add the timer right
      before the port is started, instead of at _probe() time.
      Signed-off-by: default avatarSebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Acked-by: default avatarJason Cooper <jason@lakedaemon.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b24b4a82
    • Sebastian Hesselbarth's avatar
      net: mv643xx_eth: update statistics timer from timer context only · 85cd0213
      Sebastian Hesselbarth authored
      [ Upstream commit 041b4ddb ]
      
      Each port driver installs a periodic timer to update port statistics
      by calling mib_counters_update. As mib_counters_update is also called
      from non-timer context, we should not reschedule the timer there but
      rather move it to timer-only context.
      Signed-off-by: default avatarSebastian Hesselbarth <sebastian.hesselbarth@gmail.com>
      Acked-by: default avatarJason Cooper <jason@lakedaemon.net>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      85cd0213
    • David S. Miller's avatar
      l2tp: Fix build warning with ipv6 disabled. · d980ed62
      David S. Miller authored
      [ Upstream commit 8d8a51e2 ]
      
      net/l2tp/l2tp_core.c: In function ‘l2tp_verify_udp_checksum’:
      net/l2tp/l2tp_core.c:499:22: warning: unused variable ‘tunnel’ [-Wunused-variable]
      
      Create a helper "l2tp_tunnel()" to facilitate this, and as a side
      effect get rid of a bunch of unnecessary void pointer casts.
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d980ed62
    • François CACHEREUL's avatar
      l2tp: fix kernel panic when using IPv4-mapped IPv6 addresses · bd83cd77
      François CACHEREUL authored
      [ Upstream commit e18503f4 ]
      
      IPv4 mapped addresses cause kernel panic.
      The patch juste check whether the IPv6 address is an IPv4 mapped
      address. If so, use IPv4 API instead of IPv6.
      
      [  940.026915] general protection fault: 0000 [#1]
      [  940.026915] Modules linked in: l2tp_ppp l2tp_netlink l2tp_core pppox ppp_generic slhc loop psmouse
      [  940.026915] CPU: 0 PID: 3184 Comm: memcheck-amd64- Not tainted 3.11.0+ #1
      [  940.026915] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
      [  940.026915] task: ffff880007130e20 ti: ffff88000737e000 task.ti: ffff88000737e000
      [  940.026915] RIP: 0010:[<ffffffff81333780>]  [<ffffffff81333780>] ip6_xmit+0x276/0x326
      [  940.026915] RSP: 0018:ffff88000737fd28  EFLAGS: 00010286
      [  940.026915] RAX: c748521a75ceff48 RBX: ffff880000c30800 RCX: 0000000000000000
      [  940.026915] RDX: ffff88000075cc4e RSI: 0000000000000028 RDI: ffff8800060e5a40
      [  940.026915] RBP: ffff8800060e5a40 R08: 0000000000000000 R09: ffff88000075cc90
      [  940.026915] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88000737fda0
      [  940.026915] R13: 0000000000000000 R14: 0000000000002000 R15: ffff880005d3b580
      [  940.026915] FS:  00007f163dc5e800(0000) GS:ffffffff81623000(0000) knlGS:0000000000000000
      [  940.026915] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  940.026915] CR2: 00000004032dc940 CR3: 0000000005c25000 CR4: 00000000000006f0
      [  940.026915] Stack:
      [  940.026915]  ffff88000075cc4e ffffffff81694e90 ffff880000c30b38 0000000000000020
      [  940.026915]  11000000523c4bac ffff88000737fdb4 0000000000000000 ffff880000c30800
      [  940.026915]  ffff880005d3b580 ffff880000c30b38 ffff8800060e5a40 0000000000000020
      [  940.026915] Call Trace:
      [  940.026915]  [<ffffffff81356cc3>] ? inet6_csk_xmit+0xa4/0xc4
      [  940.026915]  [<ffffffffa0038535>] ? l2tp_xmit_skb+0x503/0x55a [l2tp_core]
      [  940.026915]  [<ffffffff812b8d3b>] ? pskb_expand_head+0x161/0x214
      [  940.026915]  [<ffffffffa003e91d>] ? pppol2tp_xmit+0xf2/0x143 [l2tp_ppp]
      [  940.026915]  [<ffffffffa00292e0>] ? ppp_channel_push+0x36/0x8b [ppp_generic]
      [  940.026915]  [<ffffffffa00293fe>] ? ppp_write+0xaf/0xc5 [ppp_generic]
      [  940.026915]  [<ffffffff8110ead4>] ? vfs_write+0xa2/0x106
      [  940.026915]  [<ffffffff8110edd6>] ? SyS_write+0x56/0x8a
      [  940.026915]  [<ffffffff81378ac0>] ? system_call_fastpath+0x16/0x1b
      [  940.026915] Code: 00 49 8b 8f d8 00 00 00 66 83 7c 11 02 00 74 60 49
      8b 47 58 48 83 e0 fe 48 8b 80 18 01 00 00 48 85 c0 74 13 48 8b 80 78 02
      00 00 <48> ff 40 28 41 8b 57 68 48 01 50 30 48 8b 54 24 08 49 c7 c1 51
      [  940.026915] RIP  [<ffffffff81333780>] ip6_xmit+0x276/0x326
      [  940.026915]  RSP <ffff88000737fd28>
      [  940.057945] ---[ end trace be8aba9a61c8b7f3 ]---
      [  940.058583] Kernel panic - not syncing: Fatal exception in interrupt
      Signed-off-by: default avatarFrançois CACHEREUL <f.cachereul@alphalink.fr>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bd83cd77
    • Eric Dumazet's avatar
      net: do not call sock_put() on TIMEWAIT sockets · 53d384ae
      Eric Dumazet authored
      [ Upstream commit 80ad1d61 ]
      
      commit 3ab5aee7 ("net: Convert TCP & DCCP hash tables to use RCU /
      hlist_nulls") incorrectly used sock_put() on TIMEWAIT sockets.
      
      We should instead use inet_twsk_put()
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53d384ae
    • Yuchung Cheng's avatar
      tcp: fix incorrect ca_state in tail loss probe · d8ac0f17
      Yuchung Cheng authored
      [ Upstream commit 031afe49 ]
      
      On receiving an ACK that covers the loss probe sequence, TLP
      immediately sets the congestion state to Open, even though some packets
      are not recovered and retransmisssion are on the way.  The later ACks
      may trigger a WARN_ON check in step D of tcp_fastretrans_alert(), e.g.,
      https://bugzilla.redhat.com/show_bug.cgi?id=989251
      
      The fix is to follow the similar procedure in recovery by calling
      tcp_try_keep_open(). The sender switches to Open state if no packets
      are retransmissted. Otherwise it goes to Disorder and let subsequent
      ACKs move the state to Recovery or Open.
      Reported-By: default avatarMichael Sterrett <michael@sterretts.net>
      Tested-By: default avatarDormando <dormando@rydia.net>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d8ac0f17
    • Eric Dumazet's avatar
      tcp: do not forget FIN in tcp_shifted_skb() · a50b0399
      Eric Dumazet authored
      [ Upstream commit 5e8a402f ]
      
      Yuchung found following problem :
      
       There are bugs in the SACK processing code, merging part in
       tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
       skbs FIN flag. When a receiver first SACK the FIN sequence, and later
       throw away ofo queue (e.g., sack-reneging), the sender will stop
       retransmitting the FIN flag, and hangs forever.
      
      Following packetdrill test can be used to reproduce the bug.
      
      $ cat sack-merge-bug.pkt
      `sysctl -q net.ipv4.tcp_fack=0`
      
      // Establish a connection and send 10 MSS.
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +.000 bind(3, ..., ...) = 0
      +.000 listen(3, 1) = 0
      
      +.050 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +.000 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.001 < . 1:1(0) ack 1 win 1024
      +.000 accept(3, ..., ...) = 4
      
      +.100 write(4, ..., 12000) = 12000
      +.000 shutdown(4, SHUT_WR) = 0
      +.000 > . 1:10001(10000) ack 1
      +.050 < . 1:1(0) ack 2001 win 257
      +.000 > FP. 10001:12001(2000) ack 1
      +.050 < . 1:1(0) ack 2001 win 257 <sack 10001:11001,nop,nop>
      +.050 < . 1:1(0) ack 2001 win 257 <sack 10001:12002,nop,nop>
      // SACK reneg
      +.050 < . 1:1(0) ack 12001 win 257
      +0 %{ print "unacked: ",tcpi_unacked }%
      +5 %{ print "" }%
      
      First, a typo inverted left/right of one OR operation, then
      code forgot to advance end_seq if the merged skb carried FIN.
      
      Bug was added in 2.6.29 by commit 832d11c5
      ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: default avatarIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a50b0399
    • Eric Dumazet's avatar
      tcp: must unclone packets before mangling them · b81908e1
      Eric Dumazet authored
      [ Upstream commit c52e2421 ]
      
      TCP stack should make sure it owns skbs before mangling them.
      
      We had various crashes using bnx2x, and it turned out gso_size
      was cleared right before bnx2x driver was populating TC descriptor
      of the _previous_ packet send. TCP stack can sometime retransmit
      packets that are still in Qdisc.
      
      Of course we could make bnx2x driver more robust (using
      ACCESS_ONCE(shinfo->gso_size) for example), but the bug is TCP stack.
      
      We have identified two points where skb_unclone() was needed.
      
      This patch adds a WARN_ON_ONCE() to warn us if we missed another
      fix of this kind.
      
      Kudos to Neal for finding the root cause of this bug. Its visible
      using small MSS.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarNeal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b81908e1
    • Eric Dumazet's avatar
      tcp: TSQ can use a dynamic limit · 0ae5f47e
      Eric Dumazet authored
      [ Upstream commit c9eeec26 ]
      
      When TCP Small Queues was added, we used a sysctl to limit amount of
      packets queues on Qdisc/device queues for a given TCP flow.
      
      Problem is this limit is either too big for low rates, or too small
      for high rates.
      
      Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
      auto sizing, it can better control number of packets in Qdisc/device
      queues.
      
      New limit is two packets or at least 1 to 2 ms worth of packets.
      
      Low rates flows benefit from this patch by having even smaller
      number of packets in queues, allowing for faster recovery,
      better RTT estimations.
      
      High rates flows benefit from this patch by allowing more than 2 packets
      in flight as we had reports this was a limiting factor to reach line
      rate. [ In particular if TX completion is delayed because of coalescing
      parameters ]
      
      Example for a single flow on 10Gbp link controlled by FQ/pacing
      
      14 packets in flight instead of 2
      
      $ tc -s -d qd
      qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
      buckets 1024 quantum 3028 initial_quantum 15140
       Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
      requeues 6822476)
       rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476
        2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
        2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit
      
      Note that sk_pacing_rate is currently set to twice the actual rate, but
      this might be refined in the future when a flow is in congestion
      avoidance.
      
      Additional change : skb->destructor should be set to tcp_wfree().
      
      A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Wei Liu <wei.liu2@citrix.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0ae5f47e
    • Eric Dumazet's avatar
      tcp: TSO packets automatic sizing · 5e25ba50
      Eric Dumazet authored
      [ Upstream commits 6d36824e730f247b602c90e8715a792003e3c5a7,
        02cf4ebd, and parts of
        7eec4174 ]
      
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: default avatarYuchung Cheng <ycheng@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5e25ba50
  2. 18 Oct, 2013 14 commits