1. 21 May, 2015 38 commits
    • Eric Dumazet's avatar
      inet_hashinfo: remove bsocket counter · f5af1f57
      Eric Dumazet authored
      We no longer need bsocket atomic counter, as inet_csk_get_port()
      calls bind_conflict() regardless of its value, after commit
      2b05ad33 ("tcp: bind() fix autoselection to share ports")
      
      This patch removes overhead of maintaining this counter and
      double inet_csk_get_port() calls under pressure.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
      Cc: Flavio Leitner <fbl@redhat.com>
      Acked-by: default avatarFlavio Leitner <fbl@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f5af1f57
    • Jason Baron's avatar
      tcp: ensure epoll edge trigger wakeup when write queue is empty · ce5ec440
      Jason Baron authored
      We currently rely on the setting of SOCK_NOSPACE in the write()
      path to ensure that we wake up any epoll edge trigger waiters when
      acks return to free space in the write queue. However, if we fail
      to allocate even a single skb in the write queue, we could end up
      waiting indefinitely.
      
      Fix this by explicitly issuing a wakeup when we detect the condition
      of an empty write queue and a return value of -EAGAIN. This allows
      userspace to re-try as we expect this to be a temporary failure.
      
      I've tested this approach by artificially making
      sk_stream_alloc_skb() return NULL periodically. In that case,
      epoll edge trigger waiters will hang indefinitely in epoll_wait()
      without this patch.
      Signed-off-by: default avatarJason Baron <jbaron@akamai.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ce5ec440
    • David S. Miller's avatar
      Merge branch 'cxgb4-next' · b92d5814
      David S. Miller authored
      Hariprasad Shenai says:
      
      ====================
      cxgb4: Cleanup and update T4/T4 register ranges
      
      This series cleans and optimizes setup_memwin function and also updates
      T4/T5 adapter register ranges by removing incorrect register addresses
      
      This patch series has been created against net-next tree and includes
      patches on cxgb4 driver.
      
      We have included all the maintainers of respective drivers. Kindly review
      the change and let us know in case of any review comments.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b92d5814
    • Hariprasad Shenai's avatar
      cxgb4: Update T4/T5 adapter register ranges · 9f5ac48d
      Hariprasad Shenai authored
      Remove some T4/T5 registers that were included incorrectly.
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9f5ac48d
    • Hariprasad Shenai's avatar
    • David S. Miller's avatar
      Merge branch 'sfc-next' · 4e7b3be4
      David S. Miller authored
      Shradha Shah says:
      
      ====================
      sfc: Get/Set MAC address and ndo_[set/get]_vf_* entrypoint functions
      
      This is the second installment of patches towards supporting EF10 SRIOV.
      
      This patch series implements the ndo_get_vf_config, ndo_set_vf_mac,
      ndo_set_vf_vlan and ndo_set_vf_spoofcheck function callbacks for EF10.
      
      This patch series also introduces privileges for the MCDI commands
      based on which functions are allowed to call them, i.e. Link control
      or primary function.
      
      The patch series has been tested with and without CONFIG_SFC_SRIOV.
      
      The ndo function callbacks are tested using ip link.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4e7b3be4
    • Shradha Shah's avatar
      sfc: set the MAC address using MC_CMD_VADAPTOR_SET_MAC · 910c8789
      Shradha Shah authored
      Add a set_mac_address() NIC-type function for EF10 only, and
      use this to set the MAC address on the vadaptor. For Siena and
      earlier, the MAC address continues to be set by MC_CMD_SET_MAC;
      this is still called on EF10, and including a MAC address in
      this command has no effect.
      
      The sriov_mac_address_changed() NIC-type function is no longer
      needed on EF10, but it is needed for Siena where it is used to
      update the peer address of the PF for VFDI.  Change this to use
      the new set_mac_address function pointer.
      
      efx_ef10_sriov_mac_address_changed() is no longer called, as VFs
      will try to change the MAC address on their vadaptor rather than
      trying to change to the context of the PF to alter the vport.
      
      When a VF is running in direct passthrough mode with MAC spoofing
      enabled, it will be able to change the MAC address on its vadaptor.
      In this case, there is a link to the PF, so find the correct VF in
      its ef10_vf array and update the MAC address.
      
      ndo_set_mac_address() can be called during driver unload while
      bonding, and in this case the device has already been stopped, so
      don't call efx_net_open() to restart it after reconfiguration.
      
      efx->port_enabled is set to false in efx_stop_port(), so it is
      indicator of whether the device needs to be restarted.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      910c8789
    • Shradha Shah's avatar
    • Edward Cree's avatar
      sfc: add ndo_set_vf_link_state() function for EF10 · 4392dc69
      Edward Cree authored
      Exercised with
      "ip link set <PF intf> vf <vf_i> state {auto|enable|disable}"
      Sets the reporting policy for VF link state to either
       - mirror physical link state
       - always up
       - always down
      
      get VF link state mode in efx_ef10_sriov_get_vf_config
      
      Exercised by
      "ip link show <PF intf>";
      output will include a line like
      vf 0 MAC 12:34:56:78:9a:bc, link-state auto
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4392dc69
    • Shradha Shah's avatar
      sfc: add ndo_set_vf_vlan() function for EF10 · 2d432f20
      Shradha Shah authored
      The max vlan tags that can be offloaded is 2, including any upstream VLAN
      aggregator. Currently there is no way for the net driver to know whether
      the upstream vswitch (if any) is using vlan tags, so there is no way to
      know how many tags we can request.
      Along with the implementation for the ndo_set_vf_vlan callback, this patch
      also adds 2 VLAN tags for the driver created VEB switch if possible, that
      way it is possible to offload as many tags as are allowed.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2d432f20
    • Jon Cooper's avatar
      sfc: Change entity reset on MC reboot to a new datapath-only reset. · 087e9025
      Jon Cooper authored
      Currently we do an entity reset when we detect an MC reboot.
      This messes up SRIOV because it leaves VFs orphaned. The extra
      reset is rather redundant anyway, since the MC reboot will have
      basically reset everything.
      
      This change replaces the entity reset after MC reboot with a
      simpler datapath reset that reallocates resources but doesn't
      perform the entity reset.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      087e9025
    • Shradha Shah's avatar
      sfc: Add ndo_get_vf_config() function for EF10 · b9af9049
      Shradha Shah authored
      rtnetlink calls ndo_get_vf_config when compiling information
      about a network interface, so that the VFs associated with a PF
      can be listed (eg: ip link show).
      Implement a response to this entry point and return PF-set MAC
      address for VF in ndo_get_vf_config
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9af9049
    • Shradha Shah's avatar
      sfc: add ndo_set_vf_mac() function for EF10 · e340be92
      Shradha Shah authored
      Implement a response to this entrypoint.
      The ndo_set_vf_mac() entrypoint is only exposed in the driver if
      CONFIG_SFC_SRIOV is defined.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e340be92
    • Jon Cooper's avatar
      sfc: Initialise MCDI buffers to 0 on declaration. · aa09a3da
      Jon Cooper authored
      In order to avoid MC bugs the flags field needs to be set to 0.
      Instead of explicitly clearing out the flags individually, a
      better way to do this is to memset the MCDI_BUF to 0.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aa09a3da
    • Daniel Pieczko's avatar
      sfc: Enable a VF to get its own MAC address · 0d5e0fbb
      Daniel Pieczko authored
      A VF's MAC address is set by its parent PF and added to its vport.
      To get this MAC address, the VF must use MC_CMD_ VPORT_GET_MAC_ADDRESSES.
      In the current scheme, a VF's vport should only have one MAC address,
      so warn if this is not the case.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d5e0fbb
    • Edward Cree's avatar
      sfc: protect filter table against use-after-free · 0d322413
      Edward Cree authored
      If MCDI timeouts are encountered during efx_ef10_filter_table_remove(),
      an FLR will be queued, but efx->filter_state will still be kfree()d.
      The queued FLR will then call efx_ef10_filter_table_restore(), which
      will try to use efx->filter_state. This previously caused a panic.
      This patch adds an rwsem to protect the existence of efx->filter_state,
      separately from the spinlock protecting its contents.  Users which can
      race against efx_ef10_filter_table_remove() should down_read this rwsem.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d322413
    • Shradha Shah's avatar
      sfc: Store the efx_nic struct of the current VF in the VF data struct · f1122a34
      Shradha Shah authored
      Initialised in efx_probe_vf and removal is dealt with in
      efx_ef10_remove.
      
      vf->efx is needed in future patches to change the MAC address
      of the VF via the parent PF, while the driver is bound to the
      VF.
      Example: ip link set dev vf NUM mac LLADDR
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f1122a34
    • Shradha Shah's avatar
      sfc: save old MAC address in case sriov_mac_address_changed fails · cfc77c2f
      Shradha Shah authored
      Otherwise the PF and VF can disagree on the VF's MAC address and
      this leads to strange behaviour, up to and including kernel panics.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfc77c2f
    • Shradha Shah's avatar
      sfc: Store vf_index in nic_data for Ef10. · 88a37de6
      Shradha Shah authored
      Added function efx_ef10_get_vf_index to store the vf_index
      in nic_data during probe
      
      vf_index is needed in future patches to access a particular
      VF in the VF data structure.
      
      Moved efx_ef10_probe_pf and efx_ef10_probe_vf in order to
      used efx_ef10_remove
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      88a37de6
    • Shradha Shah's avatar
      sfc: MC_CMD_SET_MAC can only be called by the link control Function · 862f894c
      Shradha Shah authored
      MC_CMD_SET_MAC is privileged and can only by called by the link
      control function.
      
      This patch adds efx_ef10_mac_reconfigure_vf which avoids the call
      to MC_CMD_SET_MAC by the Virtual function
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      862f894c
    • Shradha Shah's avatar
      af6a074d
    • Shradha Shah's avatar
      sfc: Add permissions to MCDI commands · 75122ec8
      Shradha Shah authored
      There is one primary function per adaptor, one link control function
      per port and the rest as categorised as general.
      
      This patch adds privileges to the MCDI commands based on which
      functions are allowed to call them.
      Signed-off-by: default avatarShradha Shah <sshah@solarflare.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      75122ec8
    • Vineet Gupta's avatar
      stmmac: replace open coded __netdev_alloc_skb_ip_align() with actual call · 4ec49a37
      Vineet Gupta authored
      This also matches with the sibling call netdev_alloc_skb_ip_align() made in
      rx fast path.
      Signed-off-by: default avatarVineet Gupta <vgupta@synopsys.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ec49a37
    • Joe Perches's avatar
      qlge: Move jiffies_to_usecs immediately before loop · 3f6e785f
      Joe Perches authored
      30 usecs (or really, 1 jiffy) can go by pretty fast.
      
      Move the set of the timeout immediately before the loop.
      
      Remove the unnecessary max(1ul, usecs_to_jiffies(30)) as
      usecs_to_jiffies with a non-zero constant is guaranteed
      to be non-zero.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f6e785f
    • David S. Miller's avatar
      Merge branch 'rocker-transaction-fixes' · 4ac2dc89
      David S. Miller authored
      Simon Horman says:
      
      ====================
      rocker: transaction fixes
      
      this series addresses what appear to be errors in the handling of
      prepare and then commit transactions in the rocker driver.
      
      In all cases the problem is that data structures visible outside of
      the transaction are modified during the prepare phase.
      
      In the case of the first two patches this results in the kernel reporting a
      BUG. I have noted test-cases in the change logs.
      
      The third patch is also a bug fix, as noted by  Toshiaki Makita,
      however I have not been able to reliably reproduce the problem and
      thus have not provided a test case.
      
      The last patch is a correctness fix that does not fix a bug
      that manifests as far as I can tell.
      
      Changes: v3->v4
      * All patches
        - Add Jiri Pirko's ack
      * "rocker: do not make neighbour entry changes when preparing transactions"
        - Setting of entry values in all transaction phases
          as suggested by Toshiaki Makita
      * "rocker: make rocker_port_internal_vlan_id_{get,put}() non-transactional"
        - Remove Fixes tag as I believe this is a correctness rather than a bug fix
      
      Changes: v2->v3
      * "rocker: do not make neighbour entry changes when preparing transactions"
        - Correct inverted logic
        - Added ack from Scott Feldman
      
      Changes: v1->v2
      * "rocker: do not make neighbour entry changes when preparing transactions"
        - Revised changelog to reflect information from Toshiaki Makita
          that there is a bug that can manifest
        - Update address and ttl regardless of the value of the transaction state
      * All other patches
        - Added acks from Scott Feldman
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4ac2dc89
    • Simon Horman's avatar
      rocker: make rocker_port_internal_vlan_id_{get, put}() non-transactional · df6a2067
      Simon Horman authored
      The motivation for this is that rocker_port_internal_vlan_id_{get,put} appear
      to only partially implement the transaction model: memory allocation
      and freeing is transactional, but hash and bitmap manipulation is not.
      
      The latter could be fixed, however, as it is not currently exercised
      due to trans always being SWITCHDEV_TRANS_NONE it seems cleaner
      to make rocker_port_internal_vlan_id_get non-transactional.
      
      This problem was introduced by c4f20321 ("rocker: support
      prepare-commit transaction model").
      
      Found by inspection.
      I do not believe that this change should have any run-time effect.
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df6a2067
    • Simon Horman's avatar
      rocker: do not make neighbour entry changes when preparing transactions · 550ecc92
      Simon Horman authored
      rocker_port_ipv4_nh() and in turn rocker_port_ipv4_neigh() may be
      be called with trans == SWITCHDEV_TRANS_PREPARE and then
      trans == SWITCHDEV_TRANS_COMMIT from switchdev_port_obj_set() via
      fib_table_insert().
      
      The first time that rocker_port_ipv4_nh() is called, with
      trans == SWITCHDEV_TRANS_PREPARE, _rocker_neigh_add() adds a new entry to
      the neigh table.
      
      And the second time  rocker_port_ipv4_nh() is called, with
      trans == SWITCHDEV_TRANS_COMMIT, that entry is found. This causes
      rocker_port_ipv4_nh() to believe it is not adding an entry and thus it
      frees "entry", which is still present in rocker driver's neigh table.
      
      This problem does not appear to affect deletion as my analysis is that
      deletion is always performed with trans == SWITCHDEV_TRANS_NONE.
      
      For completeness _rocker_neigh_{add,del,prepare} are updated not to
      manipulate fib table entries if trans == SWITCHDEV_TRANS_PREPARE.
      
      Fixes: c4f20321 ("rocker: support prepare-commit transaction model")
      Reported-by: default avatarToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      550ecc92
    • Simon Horman's avatar
      rocker: do not modify fdb table in rocker_port_fdb() when preparing transactions · 42e94889
      Simon Horman authored
      rocker_port_fdb_flush() may be called be called with
      trans == SWITCHDEV_TRANS_PREPARE and then trans == SWITCHDEV_TRANS_COMMIT from
      switchdev_port_attr_set() via switchdev_port_obj_add().
      
      Adding the new entry to the FDB table when trans == SWITCHDEV_TRANS_PREPARE
      may result in a memory leak because when trans == SWITCHDEV_TRANS_PREPARE
      rocker_flow_tbl_bridge() will allocate memory when called via
      rocker_port_fdb_learn(). However, when trans == SWITCHDEV_TRANS_COMMIT
      the presence of the FDB entry in the FDB table causes
      rocker_port_fdb() to set the ROCKER_OP_FLAG_REFRESH flag which results
      in rocker_port_fdb_learn() skipping the call to rocker_flow_tbl_bridge()
      which would free the memory allocated by it when
      trans == SWITCHDEV_TRANS_PREPARE.
      
      ip link add br0 type bridge
      ip link set up dev eth0
      ip link set dev eth0 master br0
      bridge fdb add 52:54:00:12:35:08 dev eth0
      bridge fdb add 52:54:00:12:35:09 dev eth0
      [    2.600730] ------------[ cut here ]------------
      [    2.601002] kernel BUG at drivers/net/ethernet/rocker/rocker.c:4369!
      [    2.601373] invalid opcode: 0000 [#1] SMP
      [    2.601963] Modules linked in:
      [    2.602355] CPU: 0 PID: 64 Comm: bridge Not tainted 4.1.0-rc3-01048-g6d0f50c50211-dirty #1075
      [    2.602721] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org 04/01/2014
      [    2.602721] task: ffff880019facef0 ti: ffff88001f96c000 task.ti: ffff88001f96c000
      [    2.602721] RIP: 0010:[<ffffffff811f1470>]  [<ffffffff811f1470>] rocker_port_obj_add+0x150/0x160
      [    2.602721] RSP: 0018:ffff88001f96fa98  EFLAGS: 00000212
      [    2.602721] RAX: ffff880019d4fa68 RBX: ffff88001f96fb18 RCX: 0000000000000000
      [    2.602721] RDX: ffff880019d4f000 RSI: ffff88001f96fb18 RDI: ffff880019d4f000
      [    2.602721] RBP: 0000000000000001 R08: 0000000000000000 R09: ffff88001f904620
      [    2.602721] R10: ffff88001f96fb60 R11: ffff880019e9d100 R12: ffff88001f96fb18
      [    2.602721] R13: ffff880019d4f680 R14: ffff88001f904610 R15: ffff8800198f7b80
      [    2.602721] FS:  00007f3eee917700(0000) GS:ffff88001b000000(0000) knlGS:0000000000000000
      [    2.602721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    2.602721] CR2: 00007f3eee4a15cb CR3: 000000001f933000 CR4: 00000000000006b0
      [    2.602721] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    2.602721] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
      [    2.602721] Stack:
      [    2.602721]  0000000000000000 ffff88001f96fb18 ffff880019d4f000 ffff88001f96fb18
      [    2.602721]  ffff880019d4f000 ffffffff81332105 ffff88001f96fb50 ffffffff814464c0
      [    2.602721]  ffff88001f96fb18 ffff88001f904600 ffff880019d4f000 ffffffff813326e5
      [    2.602721] Call Trace:
      [    2.602721]  [<ffffffff81332105>] ? __switchdev_port_obj_add+0x25/0x90
      [    2.602721]  [<ffffffff813326e5>] ? switchdev_port_obj_add+0x25/0xc0
      [    2.602721]  [<ffffffff813327b1>] ? switchdev_port_fdb_add+0x31/0x40
      [    2.602721]  [<ffffffff8123911f>] ? rtnl_fdb_add+0xff/0x1e0
      [    2.602721]  [<ffffffff81237d8e>] ? rtnetlink_rcv_msg+0x7e/0x250
      [    2.602721]  [<ffffffff8121d1ce>] ? __skb_recv_datagram+0xfe/0x4b0
      [    2.602721]  [<ffffffff81237d10>] ? rtnetlink_rcv+0x30/0x30
      [    2.602721]  [<ffffffff81247958>] ? netlink_rcv_skb+0xa8/0xd0
      [    2.602721]  [<ffffffff81237cff>] ? rtnetlink_rcv+0x1f/0x30
      [    2.602721]  [<ffffffff81247220>] ? netlink_unicast+0x150/0x200
      [    2.602721]  [<ffffffff81247714>] ? netlink_sendmsg+0x374/0x3e0
      [    2.602721]  [<ffffffff8120f8df>] ? sock_sendmsg+0xf/0x30
      [    2.602721]  [<ffffffff8120ffd3>] ? ___sys_sendmsg+0x1f3/0x200
      [    2.602721]  [<ffffffff812100e5>] ? ___sys_recvmsg+0x105/0x140
      [    2.602721]  [<ffffffff810a36f0>] ? SyS_readahead+0x90/0x90
      [    2.602721]  [<ffffffff81098dfd>] ? filemap_map_pages+0x1ed/0x210
      [    2.602721]  [<ffffffff810b77fc>] ? handle_mm_fault+0x5fc/0xe50
      [    2.602721]  [<ffffffff81210ef9>] ? __sys_sendmsg+0x39/0x70
      [    2.602721]  [<ffffffff8133ce17>] ? system_call_fastpath+0x12/0x6a
      [    2.602721] Code: b7 8f a0 06 00 00 48 83 bf 88 06 00 00 00 74 1d 48 83 c4 08 89 ee 4c 89 ef 5b 5d 41 5c 41 5d 0f b7 c9 45 31 c0 e9 51 db ff ff 90 <0f> 0b b8 ea ff ff ff e9 cf fe ff ff 0f 1f 40 00 41 57 41 56 b9
      [    2.602721] RIP  [<ffffffff811f1470>] rocker_port_obj_add+0x150/0x160
      [    2.602721]  RSP <ffff88001f96fa98>
      [    2.615848] ---[ end trace 4f7b4f1c98077108 ]---
      
      The above is resolved by not adding the new FDB entry to the FDB table
      if trans == SWITCHDEV_TRANS_PREPARE.
      
      For symmetry this patch also skips deleting FDB entries from the FDB
      table trans == SWITCHDEV_TRANS_PREPARE. However, my analysis is that
      this never occurs as trans is always SWITCHDEV_TRANS_NONE when removing
      FDB entries.
      
      Fixes: c4f20321 ("rocker: support prepare-commit transaction model")
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      42e94889
    • Simon Horman's avatar
      rocker: do not delete fdb entries in rocker_port_fdb_flush() when preparing transactions · 3098ac39
      Simon Horman authored
      rocker_port_fdb_flush() is called by rocker_port_stp_update() which in
      turn may be called with trans == SWITCHDEV_TRANS_PREPARE and then
      trans == SWITCHDEV_TRANS_COMMIT from switchdev_port_attr_set() via
      br_set_state().
      
      When rocker_port_fdb_flush() is called with trans == SWITCHDEV_TRANS_PREPARE
      it calls rocker_port_fdb_learn() for each entry in the FDB table which in
      turn calls rocker_flow_tbl_bridge() which will allocate memory using
      rocker_port_kzalloc(). rocker_port_fdb_learn() will then remove the entry
      from the FDB table.
      
      Then when rocker_port_fdb_learn() is called with
      trans == SWITCHDEV_TRANS_PREPARE no calls are made to rocker_port_fdb_learn()
      because there are no longer any entries present in the FDB table. Thus the
      memory previously allocated by rocker_port_fdb_learn() is leaked resulting
      in the kernel BUG() below.
      
      Furthermore, it looks like the driver ends up with an incorrect view of the
      fdb table as the FDB entries are purged from the driver's table but not the
      hardware's table.
      
      ip link add br0 type bridge
      ip link set up dev eth0
      sleep 1
      ip link set dev eth0 master br0
      [    3.704360] ------------[ cut here ]------------
      [    3.704611] kernel BUG at drivers/net/ethernet/rocker/rocker.c:4289!
      [    3.704962] invalid opcode: 0000 [#1] SMP
      [    3.705537] Modules linked in:
      [    3.705919] CPU: 0 PID: 63 Comm: ip Not tainted 4.1.0-rc3-01046-gb9fbe709 #1044
      [    3.706191] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.0-0-g4c59f5d-20150219_092859-nilsson.home.kraxel.org 04/01/2014
      [    3.706820] task: ffff880019f70150 ti: ffff88001f92c000 task.ti: ffff88001f92c000
      [    3.707138] RIP: 0010:[<ffffffff811f0080>]  [<ffffffff811f0080>] rocker_port_attr_set+0xe0/0xf0
      [    3.707990] RSP: 0018:ffff88001f92f808  EFLAGS: 00000212
      [    3.708200] RAX: ffff880019d4fa68 RBX: ffff880019d4f000 RCX: 0000000000000000
      [    3.708471] RDX: 000000000000000c RSI: ffff88001f92f890 RDI: ffff880019d4f680
      [    3.708740] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000004
      [    3.708999] R10: ffff880000034024 R11: 0000000000000000 R12: ffff88001f92f890
      [    3.709276] R13: ffff88001f8f1c00 R14: 000000000000000b R15: 0000000000000000
      [    3.709303] FS:  00007f8ab66bd700(0000) GS:ffff88001b000000(0000) knlGS:0000000000000000
      [    3.709303] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [    3.709303] CR2: 0000000000654988 CR3: 000000001f8f3000 CR4: 00000000000006b0
      [    3.709303] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [    3.709303] DR3: 0000000000000000 DR6: 0000000000000000 DR7: 0000000000000000
      [    3.709303] Stack:
      [    3.709303]  ffff88001f8f1c00 000000000000000b ffff88001f92f890 ffff880019d4f000
      [    3.709303]  ffff88001f92f890 ffffffff813332f5 ffff88001f92f880 0000000000000000
      [    3.709303]  ffff88001f92f890 0000000000000001 ffff880019d4f000 ffffffff81333627
      [    3.709303] Call Trace:
      [    3.709303]  [<ffffffff813332f5>] ? __switchdev_port_attr_set+0x25/0x90
      [    3.709303]  [<ffffffff81333627>] ? switchdev_port_attr_set+0x27/0x120
      [    3.709303]  [<ffffffff81318e86>] ? br_set_state+0x36/0x50
      [    3.709303]  [<ffffffff8131795c>] ? br_add_if+0x37c/0x400
      [    3.709303]  [<ffffffff81238ce1>] ? do_setlink+0x7e1/0x800
      [    3.709303]  [<ffffffff8111f980>] ? radix_tree_lookup_slot+0x10/0x30
      [    3.709303]  [<ffffffff81136fba>] ? nla_parse+0xaa/0x110
      [    3.709303]  [<ffffffff81239c98>] ? rtnl_newlink+0x548/0x870
      [    3.709303]  [<ffffffff8111f900>] ? __radix_tree_lookup+0x40/0xb0
      [    3.709303]  [<ffffffff81136f3e>] ? nla_parse+0x2e/0x110
      [    3.709303]  [<ffffffff81237d7e>] ? rtnetlink_rcv_msg+0x7e/0x250
      [    3.709303]  [<ffffffff8121d1be>] ? __skb_recv_datagram+0xfe/0x4b0
      [    3.709303]  [<ffffffff81237d00>] ? rtnetlink_rcv+0x30/0x30
      [    3.709303]  [<ffffffff81247948>] ? netlink_rcv_skb+0xa8/0xd0
      [    3.709303]  [<ffffffff81237cef>] ? rtnetlink_rcv+0x1f/0x30
      [    3.709303]  [<ffffffff81247210>] ? netlink_unicast+0x150/0x200
      [    3.709303]  [<ffffffff81247704>] ? netlink_sendmsg+0x374/0x3e0
      [    3.709303]  [<ffffffff8120f8cf>] ? sock_sendmsg+0xf/0x30
      [    3.709303]  [<ffffffff8120ffc3>] ? ___sys_sendmsg+0x1f3/0x200
      [    3.709303]  [<ffffffff812100d5>] ? ___sys_recvmsg+0x105/0x140
      [    3.709303]  [<ffffffff812228d9>] ? dev_get_by_name_rcu+0x69/0x90
      [    3.709303]  [<ffffffff812228d9>] ? dev_get_by_name_rcu+0x69/0x90
      [    3.709303]  [<ffffffff81217b7d>] ? skb_dequeue+0x4d/0x60
      [    3.709303]  [<ffffffff81217bb0>] ? skb_queue_purge+0x20/0x30
      [    3.709303]  [<ffffffff810ebdcf>] ? __inode_wait_for_writeback+0x5f/0xb0
      [    3.709303]  [<ffffffff810648b0>] ? autoremove_wake_function+0x30/0x30
      [    3.709303]  [<ffffffff81210ee9>] ? __sys_sendmsg+0x39/0x70
      [    3.709303]  [<ffffffff8133e097>] ? system_call_fastpath+0x12/0x6a
      [    3.709303] Code: bb 90 06 00 00 48 c7 04 24 00 00 00 00 45 31 c9 45 31 c0 48 c7 c1 c0 b7 1e 81 89 ea e8 da da ff ff eb 95 0f 1f 84 00 00 00 00 00 <0f> 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 48 83 fe 15 75
      [    3.709303] RIP  [<ffffffff811f0080>] rocker_port_attr_set+0xe0/0xf0
      [    3.709303]  RSP <ffff88001f92f808>
      [    3.721409] ---[ end trace b7481fcb7cb032aa ]---
      Segmentation fault
      
      Fixes: c4f20321 ("rocker: support prepare-commit transaction model")
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Acked-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarSimon Horman <simon.horman@netronome.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3098ac39
    • Joe Perches's avatar
      spider_net: Use DECLARE_BITMAP · e26cc7ff
      Joe Perches authored
      Use the generic mechanism to declare a bitmap instead of unsigned long.
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e26cc7ff
    • David S. Miller's avatar
      Merge branch 'ebpf-tail-call' · 3f55b7ed
      David S. Miller authored
      Alexei Starovoitov says:
      
      ====================
      bpf: introduce bpf_tail_call() helper
      
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      
      Use cases:
      - simplify complex programs
      - dispatch into other programs
        (for example: index in jump table can be syscall number or network protocol)
      - build dynamic chains of programs
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      patch 1 - support bpf_tail_call() in interpreter
      patch 2 - support in x64 JIT
      We've discussed what's neccessary to support it in arm64/s390 JITs
      and it looks fine.
      patch 3 - sample example for tracing
      patch 4 - sample example for networking
      
      More details in every patch.
      
      This set went through several iterations of reviews/fixes and older
      attempts can be seen:
      https://git.kernel.org/cgit/linux/kernel/git/ast/bpf.git/log/?h=tail_call_v[123456]
      - tail_call_v1 does it without touching JITs but introduces overhead
        for all programs that don't use this helper function.
      - tail_call_v2 still has some overhead and x64 JIT does full stack
        unwind (prologue skipping optimization wasn't there)
      - tail_call_v3 reuses 'call' instruction encoding and has interpreter
        overhead for every normal call
      - tail_call_v4 fixes above architectural shortcomings and v5,v6 fix few
        more bugs
      
      This last tail_call_v6 approach seems to be the best.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f55b7ed
    • Alexei Starovoitov's avatar
      samples/bpf: bpf_tail_call example for networking · 530b2c86
      Alexei Starovoitov authored
      Usage:
      $ sudo ./sockex3
      IP     src.port -> dst.port               bytes      packets
      127.0.0.1.42010 -> 127.0.0.1.12865         1568            8
      127.0.0.1.59526 -> 127.0.0.1.33778     11422636       173070
      127.0.0.1.33778 -> 127.0.0.1.59526  11260224828       341974
      127.0.0.1.12865 -> 127.0.0.1.42010         1832           12
      IP     src.port -> dst.port               bytes      packets
      127.0.0.1.42010 -> 127.0.0.1.12865         1568            8
      127.0.0.1.59526 -> 127.0.0.1.33778     23198092       351486
      127.0.0.1.33778 -> 127.0.0.1.59526  22972698518       698616
      127.0.0.1.12865 -> 127.0.0.1.42010         1832           12
      
      this example is similar to sockex2 in a way that it accumulates per-flow
      statistics, but it does packet parsing differently.
      sockex2 inlines full packet parser routine into single bpf program.
      This sockex3 example have 4 independent programs that parse vlan, mpls, ip, ipv6
      and one main program that starts the process.
      bpf_tail_call() mechanism allows each program to be small and be called
      on demand potentially multiple times, so that many vlan, mpls, ip in ip,
      gre encapsulations can be parsed. These and other protocol parsers can
      be added or removed at runtime. TLVs can be parsed in similar manner.
      Note, tail_call_cnt dynamic check limits the number of tail calls to 32.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      530b2c86
    • Alexei Starovoitov's avatar
      samples/bpf: bpf_tail_call example for tracing · 5bacd780
      Alexei Starovoitov authored
      kprobe example that demonstrates how future seccomp programs may look like.
      It attaches to seccomp_phase1() function and tail-calls other BPF programs
      depending on syscall number.
      
      Existing optimized classic BPF seccomp programs generated by Chrome look like:
      if (sd.nr < 121) {
        if (sd.nr < 57) {
          if (sd.nr < 22) {
            if (sd.nr < 7) {
              if (sd.nr < 4) {
                if (sd.nr < 1) {
                  check sys_read
                } else {
                  if (sd.nr < 3) {
                    check sys_write and sys_open
                  } else {
                    check sys_close
                  }
                }
              } else {
            } else {
          } else {
        } else {
      } else {
      }
      
      the future seccomp using native eBPF may look like:
        bpf_tail_call(&sd, &syscall_jmp_table, sd.nr);
      which is simpler, faster and leaves more room for per-syscall checks.
      
      Usage:
      $ sudo ./tracex5
      <...>-366   [001] d...     4.870033: : read(fd=1, buf=00007f6d5bebf000, size=771)
      <...>-369   [003] d...     4.870066: : mmap
      <...>-369   [003] d...     4.870077: : syscall=110 (one of get/set uid/pid/gid)
      <...>-369   [003] d...     4.870089: : syscall=107 (one of get/set uid/pid/gid)
         sh-369   [000] d...     4.891740: : read(fd=0, buf=00000000023d1000, size=512)
         sh-369   [000] d...     4.891747: : write(fd=1, buf=00000000023d3000, size=512)
         sh-369   [000] d...     4.891747: : read(fd=1, buf=00000000023d3000, size=512)
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5bacd780
    • Alexei Starovoitov's avatar
      x86: bpf_jit: implement bpf_tail_call() helper · b52f00e6
      Alexei Starovoitov authored
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      In this implementation x64 JIT bypasses stack unwind and jumps into the
      callee program after prologue, so the callee program reuses the same stack.
      
      The logic can be roughly expressed in C like:
      
      u32 tail_call_cnt;
      
      void *jumptable[2] = { &&label1, &&label2 };
      
      int bpf_prog1(void *ctx)
      {
      label1:
          ...
      }
      
      int bpf_prog2(void *ctx)
      {
      label2:
          ...
      }
      
      int bpf_prog1(void *ctx)
      {
          ...
          if (tail_call_cnt++ < MAX_TAIL_CALL_CNT)
              goto *jumptable[index]; ... and pass my 'ctx' to callee ...
      
          ... fall through if no entry in jumptable ...
      }
      
      Note that 'skip current program epilogue and next program prologue' is
      an optimization. Other JITs don't have to do it the same way.
      >From safety point of view it's valid as well, since programs always
      initialize the stack before use, so any residue in the stack left by
      the current program is not going be read. The same verifier checks are
      done for the calls from the kernel into all bpf programs.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b52f00e6
    • Alexei Starovoitov's avatar
      bpf: allow bpf programs to tail-call other bpf programs · 04fd61ab
      Alexei Starovoitov authored
      introduce bpf_tail_call(ctx, &jmp_table, index) helper function
      which can be used from BPF programs like:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        bpf_tail_call(ctx, &jmp_table, index);
        ...
      }
      that is roughly equivalent to:
      int bpf_prog(struct pt_regs *ctx)
      {
        ...
        if (jmp_table[index])
          return (*jmp_table[index])(ctx);
        ...
      }
      The important detail that it's not a normal call, but a tail call.
      The kernel stack is precious, so this helper reuses the current
      stack frame and jumps into another BPF program without adding
      extra call frame.
      It's trivially done in interpreter and a bit trickier in JITs.
      In case of x64 JIT the bigger part of generated assembler prologue
      is common for all programs, so it is simply skipped while jumping.
      Other JITs can do similar prologue-skipping optimization or
      do stack unwind before jumping into the next program.
      
      bpf_tail_call() arguments:
      ctx - context pointer
      jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
      index - index in the jump table
      
      Since all BPF programs are idenitified by file descriptor, user space
      need to populate the jmp_table with FDs of other BPF programs.
      If jmp_table[index] is empty the bpf_tail_call() doesn't jump anywhere
      and program execution continues as normal.
      
      New BPF_MAP_TYPE_PROG_ARRAY map type is introduced so that user space can
      populate this jmp_table array with FDs of other bpf programs.
      Programs can share the same jmp_table array or use multiple jmp_tables.
      
      The chain of tail calls can form unpredictable dynamic loops therefore
      tail_call_cnt is used to limit the number of calls and currently is set to 32.
      
      Use cases:
      Acked-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      
      ==========
      - simplify complex programs by splitting them into a sequence of small programs
      
      - dispatch routine
        For tracing and future seccomp the program may be triggered on all system
        calls, but processing of syscall arguments will be different. It's more
        efficient to implement them as:
        int syscall_entry(struct seccomp_data *ctx)
        {
           bpf_tail_call(ctx, &syscall_jmp_table, ctx->nr /* syscall number */);
           ... default: process unknown syscall ...
        }
        int sys_write_event(struct seccomp_data *ctx) {...}
        int sys_read_event(struct seccomp_data *ctx) {...}
        syscall_jmp_table[__NR_write] = sys_write_event;
        syscall_jmp_table[__NR_read] = sys_read_event;
      
        For networking the program may call into different parsers depending on
        packet format, like:
        int packet_parser(struct __sk_buff *skb)
        {
           ... parse L2, L3 here ...
           __u8 ipproto = load_byte(skb, ... offsetof(struct iphdr, protocol));
           bpf_tail_call(skb, &ipproto_jmp_table, ipproto);
           ... default: process unknown protocol ...
        }
        int parse_tcp(struct __sk_buff *skb) {...}
        int parse_udp(struct __sk_buff *skb) {...}
        ipproto_jmp_table[IPPROTO_TCP] = parse_tcp;
        ipproto_jmp_table[IPPROTO_UDP] = parse_udp;
      
      - for TC use case, bpf_tail_call() allows to implement reclassify-like logic
      
      - bpf_map_update_elem/delete calls into BPF_MAP_TYPE_PROG_ARRAY jump table
        are atomic, so user space can build chains of BPF programs on the fly
      
      Implementation details:
      =======================
      - high performance of bpf_tail_call() is the goal.
        It could have been implemented without JIT changes as a wrapper on top of
        BPF_PROG_RUN() macro, but with two downsides:
        . all programs would have to pay performance penalty for this feature and
          tail call itself would be slower, since mandatory stack unwind, return,
          stack allocate would be done for every tailcall.
        . tailcall would be limited to programs running preempt_disabled, since
          generic 'void *ctx' doesn't have room for 'tail_call_cnt' and it would
          need to be either global per_cpu variable accessed by helper and by wrapper
          or global variable protected by locks.
      
        In this implementation x64 JIT bypasses stack unwind and jumps into the
        callee program after prologue.
      
      - bpf_prog_array_compatible() ensures that prog_type of callee and caller
        are the same and JITed/non-JITed flag is the same, since calling JITed
        program from non-JITed is invalid, since stack frames are different.
        Similarly calling kprobe type program from socket type program is invalid.
      
      - jump table is implemented as BPF_MAP_TYPE_PROG_ARRAY to reuse 'map'
        abstraction, its user space API and all of verifier logic.
        It's in the existing arraymap.c file, since several functions are
        shared with regular array map.
      Signed-off-by: default avatarAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      04fd61ab
    • Daniel Borkmann's avatar
      net: dev: reduce both ingress hook ifdefs · e7582bab
      Daniel Borkmann authored
      Reduce ifdef pollution slightly, no functional change. We can simply
      remove the extra alternative definition of handle_ing() and nf_ingress().
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7582bab
    • Eric Dumazet's avatar
      tcp: add a force_schedule argument to sk_stream_alloc_skb() · eb934478
      Eric Dumazet authored
      In commit 8e4d980a ("tcp: fix behavior for epoll edge trigger")
      we fixed a possible hang of TCP sockets under memory pressure,
      by allowing sk_stream_alloc_skb() to use sk_forced_mem_schedule()
      if no packet is in socket write queue.
      
      It turns out there are other cases where we want to force memory
      schedule :
      
      tcp_fragment() & tso_fragment() need to split a big TSO packet into
      two smaller ones. If we block here because of TCP memory pressure,
      we can effectively block TCP socket from sending new data.
      If no further ACK is coming, this hang would be definitive, and socket
      has no chance to effectively reduce its memory usage.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eb934478
    • Erik Kline's avatar
      neigh: Better handling of transition to NUD_PROBE state · 765c9c63
      Erik Kline authored
      [1] When entering NUD_PROBE state via neigh_update(), perhaps received
          from userspace, correctly (re)initialize the probes count to zero.
      
          This is useful for forcing revalidation of a neighbor (for example
          if the host is attempting to do DNA [IPv4 4436, IPv6 6059]).
      
      [2] Notify listeners when a neighbor goes into NUD_PROBE state.
      
          By sending notifications on entry to NUD_PROBE state listeners get
          more timely warnings of imminent connectivity issues.
      
          The current notifications on entry to NUD_STALE have somewhat
          limited usefulness: NUD_STALE is a perfectly normal state, as is
          NUD_DELAY, whereas notifications on entry to NUD_FAILURE come after
          a neighbor reachability problem has been confirmed (typically after
          three probes).
      Signed-off-by: default avatarErik Kline <ek@google.com>
      Acked-By: default avatarLorenzo Colitti <lorenzo@google.com>
      Acked-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      765c9c63
  2. 19 May, 2015 2 commits
    • Andy Zhou's avatar
      ip: remove unused function prototype · 06b2c61c
      Andy Zhou authored
      ip_do_nat() function was removed prior to kernel 3.4. Remove the
      unnecessary function prototype as well.
      Reported-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarAndy Zhou <azhou@nicira.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      06b2c61c
    • Daniel Borkmann's avatar
      tcp: add rfc3168, section 6.1.1.1. fallback · 49213555
      Daniel Borkmann authored
      This work as a follow-up of commit f7b3bec6 ("net: allow setting ecn
      via routing table") and adds RFC3168 section 6.1.1.1. fallback for outgoing
      ECN connections. In other words, this work adds a retry with a non-ECN
      setup SYN packet, as suggested from the RFC on the first timeout:
      
        [...] A host that receives no reply to an ECN-setup SYN within the
        normal SYN retransmission timeout interval MAY resend the SYN and
        any subsequent SYN retransmissions with CWR and ECE cleared. [...]
      
      Schematic client-side view when assuming the server is in tcp_ecn=2 mode,
      that is, Linux default since 2009 via commit 255cac91 ("tcp: extend
      ECN sysctl to allow server-side only ECN"):
      
       1) Normal ECN-capable path:
      
          SYN ECE CWR ----->
                      <----- SYN ACK ECE
                  ACK ----->
      
       2) Path with broken middlebox, when client has fallback:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
                  SYN ----->
                      <----- SYN ACK
                  ACK ----->
      
      In case we would not have the fallback implemented, the middlebox drop
      point would basically end up as:
      
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
          SYN ECE CWR ----X crappy middlebox drops packet
                            (timeout, rtx)
      
      In any case, it's rather a smaller percentage of sites where there would
      occur such additional setup latency: it was found in end of 2014 that ~56%
      of IPv4 and 65% of IPv6 servers of Alexa 1 million list would negotiate
      ECN (aka tcp_ecn=2 default), 0.42% of these webservers will fail to connect
      when trying to negotiate with ECN (tcp_ecn=1) due to timeouts, which the
      fallback would mitigate with a slight latency trade-off. Recent related
      paper on this topic:
      
        Brian Trammell, Mirja Kühlewind, Damiano Boppart, Iain Learmonth,
        Gorry Fairhurst, and Richard Scheffenegger:
          "Enabling Internet-Wide Deployment of Explicit Congestion Notification."
          Proc. PAM 2015, New York.
        http://ecn.ethz.ch/ecn-pam15.pdf
      
      Thus, when net.ipv4.tcp_ecn=1 is being set, the patch will perform RFC3168,
      section 6.1.1.1. fallback on timeout. For users explicitly not wanting this
      which can be in DC use case, we add a net.ipv4.tcp_ecn_fallback knob that
      allows for disabling the fallback.
      
      tp->ecn_flags are not being cleared in tcp_ecn_clear_syn() on output, but
      rather we let tcp_ecn_rcv_synack() take that over on input path in case a
      SYN ACK ECE was delayed. Thus a spurious SYN retransmission will not prevent
      ECN being negotiated eventually in that case.
      
      Reference: https://www.ietf.org/proceedings/92/slides/slides-92-iccrg-1.pdf
      Reference: https://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdfSigned-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarMirja Kühlewind <mirja.kuehlewind@tik.ee.ethz.ch>
      Signed-off-by: default avatarBrian Trammell <trammell@tik.ee.ethz.ch>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Dave That <dave.taht@gmail.com>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49213555