1. 06 Mar, 2015 40 commits
    • Alexander Duyck's avatar
      fib_trie: Add tnode struct as a container for fields not needed in key_vector · dc35dbed
      Alexander Duyck authored
      This change pulls the fields not explicitly needed in the key_vector and
      placed them in the new tnode structure.  By doing this we will eventually
      be able to reduce the key_vector down to 16 bytes on 64 bit systems, and
      12 bytes on 32 bit systems.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dc35dbed
    • Alexander Duyck's avatar
      fib_trie: Rename tnode_child_length to child_length · 2e1ac88a
      Alexander Duyck authored
      We are now checking the length of a key_vector instead of a tnode so it
      makes sense to probably just rename this to child_length since it would
      probably even be applicable to a leaf.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2e1ac88a
    • Alexander Duyck's avatar
      fib_trie: replace tnode_get_child functions with get_child macros · 754baf8d
      Alexander Duyck authored
      I am replacing the tnode_get_child call with get_child since we are
      techically pulling the child out of a key_vector now and not a tnode.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      754baf8d
    • Alexander Duyck's avatar
      fib_trie: Rename tnode to key_vector · 35c6edac
      Alexander Duyck authored
      Rename the tnode to key_vector.  The key_vector will be the eventual
      container for all of the information needed by either a leaf or a tnode.
      The final result should be much smaller than the 40 bytes currently needed
      for either one.
      
      This also updates the trie struct so that it contains an array of size 1 of
      tnode pointers.  This is to bring the structure more inline with how an
      actual tnode itself is configured.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35c6edac
    • Alexander Duyck's avatar
      fib_trie: Return pointer to tnode pointer in resize/inflate/halve · 8d8e810c
      Alexander Duyck authored
      Resize related functions now all return a pointer to the pointer that
      references the object that was resized.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d8e810c
    • Alexander Duyck's avatar
      fib_trie: Minor cleanups to fib_table_flush_external · 72be7260
      Alexander Duyck authored
      This change just does a couple of minor cleanups on
      fib_table_flush_external.  Specifically it addresses the fact that resize
      was being called even though nothing was being removed from the table, and
      it drops an unecessary indent since we could just call continue on the
      inverse of the fi && flag check.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      72be7260
    • Punnaiah Choudary Kalluri's avatar
      net: macb: Fix multi queue support for xilinx ZynqMP · 20488239
      Punnaiah Choudary Kalluri authored
      ZynqMP soc has single interrupt for all the queue events. So,
      passing the IRQF_SHARED flag for interrupt registration call.
      Signed-off-by: default avatarPunnaiah Choudary Kalluri <punnaia@xilinx.com>
      Signed-off-by: default avatarMichal Simek <michal.simek@xilinx.com>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      20488239
    • Punnaiah Choudary Kalluri's avatar
      net: macb: Include multi queue support for xilinx ZynqMP ethernet version · 8a013a9c
      Punnaiah Choudary Kalluri authored
      Include multi queue support for the ethernet IP version in xilinx ZynqMP
      SoC.
      Signed-off-by: default avatarPunnaiah Choudary Kalluri <punnaia@xilinx.com>
      Signed-off-by: default avatarMichal Simek <michal.simek@xilinx.com>
      Acked-by: default avatarNicolas Ferre <nicolas.ferre@atmel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a013a9c
    • David S. Miller's avatar
      Merge tag 'wireless-drivers-next-for-davem-2015-03-06' of... · 28c0f02f
      David S. Miller authored
      Merge tag 'wireless-drivers-next-for-davem-2015-03-06' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next
      
      Major changes:
      
      brcmfmac:
      
      * sdio improvements
      * add a debugfs file so users can provide us all the revinfo we could
        ask for
      
      iwlwifi:
      
      * add triggers for firmware dump collection
      * remove support for -9.ucode
      * new statitics API
      * rate control improvements
      
      ath9k:
      
      * add per-vif TX power capability
      * BT coexistance fixes
      
      ath10k:
      
      * qca6174: enable STA transmit beamforming (TxBF) support
      * disable multi-vif power save by default
      
      bcma:
      
      * enable support for PCIe Gen 2 host devices
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      28c0f02f
    • Willem de Bruijn's avatar
      fib: make netdev_switch_fib_ipv4_abort in header file static inline · 89650ad0
      Willem de Bruijn authored
      When building without CONFIG_NET_SWITCHDEV,
      netdev_switch_fib_ipv4_abort is defined in the header file. It must
      be static inline to avoid build failure at link time.
      
      Fixes: 8e05fd71 ("fib: hook IPv4 fib for hardware offload")
      Signed-off-by: default avatarWillem de Bruijn <willemb@google.com>
      Acked-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89650ad0
    • Robert Shearman's avatar
      mpls: Properly validate RTA_VIA payload length · f8d54afc
      Robert Shearman authored
      If the nla length is less than 2 then the nla data could be accessed
      beyond the accessible bounds. So ensure that the nla is big enough to
      at least read the via_family before doing so. Replace magic value of
      2.
      
      Fixes: 03c05665 ("mpls: Basic support for adding and removing routes")
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: default avatarRobert Shearman <rshearma@brocade.com>
      Acked-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f8d54afc
    • David S. Miller's avatar
      Merge branch 'bcmgenet-next' · 5c4b934f
      David S. Miller authored
      Petri Gynther says:
      
      ====================
      net: bcmgenet: preparation for multiple Rx queues
      
      Three small patches in preparation for supporting multiple Rx queues:
      1. set hw_params->rx_queues = 0
      2. adjust the call to alloc_etherdev_mqs()
      3. add GENET_Q16_RX_BD_CNT and hw_params->rx_bds_per_q
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5c4b934f
    • Petri Gynther's avatar
      net: bcmgenet: add GENET_Q16_RX_BD_CNT and hw_params->rx_bds_per_q · 3feafa02
      Petri Gynther authored
      In preparation for supporting multiple Rx queues, add GENET_Q16_RX_BD_CNT
      and hw_params->rx_bds_per_q.
      Signed-off-by: default avatarPetri Gynther <pgynther@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3feafa02
    • Petri Gynther's avatar
      net: bcmgenet: adjust the call to alloc_etherdev_mqs() · 3feafeed
      Petri Gynther authored
      In preparation for supporting multiple Rx queues, adjust the call to
      alloc_etherdev_mqs() to allow max GENET_MAX_MQ_CNT + 1 Rx queues.
      
      The actual number of Rx queues in use is correctly adjusted with:
      netif_set_real_num_rx_queues(priv->dev, priv->hw_params->rx_queues + 1);
      Signed-off-by: default avatarPetri Gynther <pgynther@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3feafeed
    • Petri Gynther's avatar
      net: bcmgenet: set hw_params->rx_queues = 0 · 7e906e02
      Petri Gynther authored
      bcmgenet driver doesn't yet support multiple Rx queues.
      Set hw_params->rx_queues = 0 accordingly.
      The default Rx queue (Q16) is still created and operational.
      Signed-off-by: default avatarPetri Gynther <pgynther@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7e906e02
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-next · 76f53bfd
      David S. Miller authored
      Jeff Kirsher says:
      
      ====================
      Intel Wired LAN Driver Updates 2015-03-06
      
      This series contains updates to e1000, e1000e and igb.
      
      Yanir provides updates to e1000e based on the patches provided by John
      Linville.  First updates the code comment to better describe the changes
      and the impact on the driver.  Second removed calls to ioremap/unmap for
      i219 since this is only relevant to older hardware only.  Starting with
      i219, the NVM will not be mapped to its one BAR but to a address region
      in another bar.
      
      Alex Duyck provides two fixes for igb, first fixes a compile warning
      where a variable may be used uninitialized, so Alex initializes it.
      Second fixes an issue where all of the pin register values were having
      to be pushed onto the stack each time the function was called, so to
      avoid this, Alex made them static const so that they should only need
      to be allocated once and we can avoid all the instructions to get them
      onto the stack.
      
      Eliezer found an issue in e1000 where we needed to be calling
      netif_carrier_off earlier in the down() to prevent the stack from
      queuing more packets to the interface.
      
      Sabrina Dubroca resolved a potential race condition by adding a
      dummy allocator.  There was a race condition between e1000_change_mtu()
      cleanups and netpoll, when changing the MTU across jumbo sizes.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76f53bfd
    • David S. Miller's avatar
      Merge branch 'pmtu-probe' · f0fdc80b
      David S. Miller authored
      Fan Du says:
      
      ====================
      Improvements for TCP PMTU
      
      This patchset performs some improvements and enhancement
      for current TCP PMTU as per RFC4821 with the aim to find
      optimal mms size quickly, and also be adaptive to route
      changes like enlarged path MTU. Then TCP PMTU could be
      used to probe a effective pmtu in absence of ICMP message
      for tunnels(e.g. vxlan) across different networking stack.
      
      Patch1/4: Set probe mss base to 1024 Bytes per RFC4821
      Patch2/4: Do not double probe_size for each probing,
                use a simple binary search to gain maximum performance.
      	  mss for next probing.
      Patch3/4: Create a probe timer to detect enlarged path MTU.
      Patch4/4: Update ip-sysctl.txt for new sysctl knobs.
      
      Changelog:
      v5:
        - Zero probe_size before resetting search range.
        - Update ip-sysctl.txt for new sysctl knobs.
      v4:
        - Convert probe_size to mss, not directly from search_low/high
        - Clamp probe_threshold
        - Don't adjust search_high in blackhole probe, so drop orignal patch3
      v3:
        - Update commit message for patch2
        - Fix pseudo timer delta calculation in patch4
      v2:
        - Introduce sysctl_tcp_probe_threshold to control when
          probing will stop, as suggested by John Heffner.
        - Add patch3 to shrink current mss value for search low boundary.
        - Drop cannonical timer usages, implements pseudo timer based on
          32bits jiffies tcp_time_stamp, as suggested by Eric Dumazet.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f0fdc80b
    • Fan Du's avatar
      ipv4: Documenting two sysctls for tcp PMTU probe · fab42760
      Fan Du authored
      Namely tcp_probe_interval to control how often to restart
      a probe. And tcp_probe_threshold to control when stop the
      probing in respect to the width of search range in bytes
      Signed-off-by: default avatarFan Du <fan.du@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fab42760
    • Fan Du's avatar
      ipv4: Create probe timer for tcp PMTU as per RFC4821 · 05cbc0db
      Fan Du authored
      As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
      be armed once probing has converged. Once this timer expired,
      probing again to take advantage of any path PMTU change. The
      recommended probing interval is 10 minutes per RFC1981. Probing
      interval could be sysctled by sysctl_tcp_probe_interval.
      
      Eric Dumazet suggested to implement pseudo timer based on 32bits
      jiffies tcp_time_stamp instead of using classic timer for such
      rare event.
      Signed-off-by: default avatarFan Du <fan.du@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      05cbc0db
    • Fan Du's avatar
      ipv4: Use binary search to choose tcp PMTU probe_size · 6b58e0a5
      Fan Du authored
      Current probe_size is chosen by doubling mss_cache,
      the probing process will end shortly with a sub-optimal
      mss size, and the link mtu will not be taken full
      advantage of, in return, this will make user to tweak
      tcp_base_mss with care.
      
      Use binary search to choose probe_size in a fine
      granularity manner, an optimal mss will be found
      to boost performance as its maxmium.
      
      In addition, introduce a sysctl_tcp_probe_threshold
      to control when probing will stop in respect to
      the width of search range.
      
      Test env:
      Docker instance with vxlan encapuslation(82599EB)
      iperf -c 10.0.0.24  -t 60
      
      before this patch:
      1.26 Gbits/sec
      
      After this patch: increase 26%
      1.59 Gbits/sec
      Signed-off-by: default avatarFan Du <fan.du@intel.com>
      Acked-by: default avatarJohn Heffner <johnwheffner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b58e0a5
    • Fan Du's avatar
      ipv4: Raise tcp PMTU probe mss base size · dcd8fb85
      Fan Du authored
      Quotes from RFC4821 7.2.  Selecting Initial Values
      
         It is RECOMMENDED that search_low be initially set to an MTU size
         that is likely to work over a very wide range of environments.  Given
         today's technologies, a value of 1024 bytes is probably safe enough.
         The initial value for search_low SHOULD be configurable.
      
      Moreover, set a small value will introduce extra time for the search
      to converge. So set the initial probe base mss size to 1024 Bytes.
      Signed-off-by: default avatarFan Du <fan.du@intel.com>
      Acked-by: default avatarJohn Heffner <johnwheffner@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dcd8fb85
    • Eric W. Biederman's avatar
      DECnet: Only use neigh_ops for adding the link layer header · aaa4e704
      Eric W. Biederman authored
      Other users users of the neighbour table use neigh->output as the method
      to decided when and which link-layer header to place on a packet.
      DECnet has been using neigh->output to decide which DECnet headers to
      place on a packet depending which neighbour the packet is destined for.
      
      The DECnet usage isn't totally wrong but it can run into problems if the
      neighbour output function is run for a second time as the teql driver
      and the bridge netfilter code can do.
      
      Therefore to avoid pathologic problems later down the line and make the
      neighbour code easier to understand by refactoring the decnet output
      code to only use a neighbour method to add a link layer header to a
      packet.
      
      This is done by moving the neigbhour operations lookup from
      dn_to_neigh_output to dn_neigh_output_packet.
      Signed-off-by: default avatar"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      aaa4e704
    • Mahesh Bandewar's avatar
      bonding: implement bond_poll_controller() · 616f4541
      Mahesh Bandewar authored
      This patches implements the poll_controller support for all
      bonding driver. If the slaves have poll_controller net_op defined,
      this implementation calls them. This is mode agnostic implementation
      and iterates through all slaves (based on mode) and calls respective
      handler.
      Signed-off-by: default avatarMahesh Bandewar <maheshb@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      616f4541
    • Scott Feldman's avatar
      8ea69638
    • Scott Feldman's avatar
    • Sabrina Dubroca's avatar
      e1000: add dummy allocator to fix race condition between mtu change and netpoll · 08e83316
      Sabrina Dubroca authored
      There is a race condition between e1000_change_mtu's cleanups and
      netpoll, when we change the MTU across jumbo size:
      
      Changing MTU frees all the rx buffers:
          e1000_change_mtu -> e1000_down -> e1000_clean_all_rx_rings ->
              e1000_clean_rx_ring
      
      Then, close to the end of e1000_change_mtu:
          pr_info -> ... -> netpoll_poll_dev -> e1000_clean ->
              e1000_clean_rx_irq -> e1000_alloc_rx_buffers -> e1000_alloc_frag
      
      And when we come back to do the rest of the MTU change:
          e1000_up -> e1000_configure -> e1000_configure_rx ->
              e1000_alloc_jumbo_rx_buffers
      
      alloc_jumbo finds the buffers already != NULL, since data (shared with
      page in e1000_rx_buffer->rxbuf) has been re-alloc'd, but it's garbage,
      or at least not what is expected when in jumbo state.
      
      This results in an unusable adapter (packets don't get through), and a
      NULL pointer dereference on the next call to e1000_clean_rx_ring
      (other mtu change, link down, shutdown):
      
      BUG: unable to handle kernel NULL pointer dereference at           (null)
      IP: [<ffffffff81194d6e>] put_compound_page+0x7e/0x330
      
          [...]
      
      Call Trace:
       [<ffffffff81195445>] put_page+0x55/0x60
       [<ffffffff815d9f44>] e1000_clean_rx_ring+0x134/0x200
       [<ffffffff815da055>] e1000_clean_all_rx_rings+0x45/0x60
       [<ffffffff815df5e0>] e1000_down+0x1c0/0x1d0
       [<ffffffff811e2260>] ? deactivate_slab+0x7f0/0x840
       [<ffffffff815e21bc>] e1000_change_mtu+0xdc/0x170
       [<ffffffff81647050>] dev_set_mtu+0xa0/0x140
       [<ffffffff81664218>] do_setlink+0x218/0xac0
       [<ffffffff814459e9>] ? nla_parse+0xb9/0x120
       [<ffffffff816652d0>] rtnl_newlink+0x6d0/0x890
       [<ffffffff8104f000>] ? kvm_clock_read+0x20/0x40
       [<ffffffff810a2068>] ? sched_clock_cpu+0xa8/0x100
       [<ffffffff81663802>] rtnetlink_rcv_msg+0x92/0x260
      
      By setting the allocator to a dummy version, netpoll can't mess up our
      rx buffers.  The allocator is set back to a sane value in
      e1000_configure_rx.
      
      Fixes: edbbb3ca ("e1000: implement jumbo receive with partial descriptors")
      Signed-off-by: default avatarSabrina Dubroca <sd@queasysnail.net>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      08e83316
    • Eliezer Tamir's avatar
      e1000: call netif_carrier_off early on down · f9c029db
      Eliezer Tamir authored
      When bringing down an interface netif_carrier_off() should be
      one the first things we do, since this will prevent the stack
      from queuing more packets to this interface.
      This operation is very fast, and should make the device behave
      much nicer when trying to bring down an interface under load.
      
      Also, this would Do The Right Thing (TM) if this device has some
      sort of fail-over teaming and redirect traffic to the other IF.
      
      Move netif_carrier_off as early as possible.
      Signed-off-by: default avatarEliezer Tamir <eliezer.tamir@linux.intel.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      f9c029db
    • Alexander Duyck's avatar
      igb: Make arrays on stack static const to avoid reallocation · b23c0cc5
      Alexander Duyck authored
      While addressing the pin problem I noticed that all of the pin register
      values where having to be pushed onto the stack each time the function was
      called.  To avoid that I am making them static const so that they should
      only need to be allocated once and we can avoid all the instructions to get
      them onto the stack..
      
      size before:
         text	   data	    bss	    dec	    hex	filename
       161477	  10512	      8	 171997	  29fdd	drivers/net/ethernet/intel/igb/igb.ko
      
      size after:
         text	   data	    bss	    dec	    hex	filename
       161205	  10512	      8	 171725	  29ecd	drivers/net/ethernet/intel/igb/igb.ko
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      b23c0cc5
    • Alexander Duyck's avatar
      igb: Fix warning pin may be used uninitialized · e357f0aa
      Alexander Duyck authored
      When building the kernel using the gcc 4.8.3 compiler included in Fedora 20
      I was repeatedly seeing the warning:
      
       drivers/net/ethernet/intel/igb/igb_ptp.c: In function ‘igb_ptp_feature_enable_i210’:
       drivers/net/ethernet/intel/igb/igb_ptp.c:395:21: warning: ‘pin’ may be used uninitialized in this function
       [-Wmaybe-uninitialized]
         tssdp &= ~ts_sdp_en[pin];
                           ^
       drivers/net/ethernet/intel/igb/igb_ptp.c:471:6: note: ‘pin’ was declared here
         int pin;
             ^
      
      To resolve it I am assigning the pin a value of -1 when it is instantiated.
      Signed-off-by: default avatarAlexander Duyck <alexander.h.duyck@redhat.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      e357f0aa
    • Yanir Lubetkin's avatar
      e1000e: remove calls to ioremap/unmap for NVM addr · 1103a631
      Yanir Lubetkin authored
      Starting I219, the NVM will not be mapped to its own BAR, but to an
      address region in another bar.  The mapping/unmapping is relevant
      to older HW only.
      
      CC: John W Linville <linville@tuxdriver.com>
      Reported-by: default avatarJohn W Linville <linville@tuxdriver.com>
      Signed-off-by: default avatarYanir Lubetkin <yanirx.lubetkin@intel.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      1103a631
    • Yanir Lubetkin's avatar
      e1000e: fix obscure comments · 9d17ce49
      Yanir Lubetkin authored
      The interface to the device flash was modified in i219 and later HW.
      This patch better describes the change and the impact on the driver.
      
      CC: John W Linville <linville@tuxdriver.com>
      Reported-by: default avatarJohn W Linville <linville@tuxdriver.com>
      Signed-off-by: default avatarYanir Lubetkin <yanirx.lubetkin@intel.com>
      Tested-by: default avatarAaron Brown <aaron.f.brown@intel.com>
      Signed-off-by: default avatarJeff Kirsher <jeffrey.t.kirsher@intel.com>
      9d17ce49
    • David S. Miller's avatar
      ipv4: Fix unused variable warnings in fib_table_flush_external. · 23375a0f
      David S. Miller authored
      net/ipv4/fib_trie.c: In function ‘fib_table_flush_external’:
      net/ipv4/fib_trie.c:1572:6: warning: unused variable ‘found’ [-Wunused-variable]
        int found = 0;
            ^
      net/ipv4/fib_trie.c:1571:16: warning: unused variable ‘slen’ [-Wunused-variable]
        unsigned char slen;
                      ^
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      23375a0f
    • David S. Miller's avatar
      Merge branch 'l3_hw_offload' · fabe7bed
      David S. Miller authored
      Scott Feldman says:
      
      ====================
      switchdev: add IPv4 routing offload
      
      v4:
      
        - Add NETIF_F_NETNS_LOCAL to rocker port feature list to keep rocker
          ports in the default netns.  Rocker hardware can't be partitioned
          to support multiple namespaces, currently.  It would be interesting
          to add netns support to rocker device by basically adding another
          match field to each table to match on some unique netns ID, with
          a port knowing it's netns ID.  Future work TDB.
        - Up-level the RTNH_F_EXTERNAL marking of routes installed to offload
          device from driver to switchdev common code.  Now driver can't skip
          routes.  Either it can install the route or it cannot.  Yes or No.
          If no on any route, all offloading is aborted by removing routes
          from offload device and setting ipv4.fib_offload_disabled so no more
          routes can be offloaded.  This is harsh, but it's our starting point.
          We can refine the policies in follow-up work.
        - Add new net.ipv4.fib_offload_disabled bool that is set if anything
          goes wrong with route offloading.  We can refine this later to make
          the setting per-device or per-device-port-netdev, but let's start
          here simple and refine in follow-up work.
        - Rebase against Alex's latest FIB changes.  I think I did everything
          correctly, and didn't run into any issues with testing, but I'd like
          Alex to look over the changes and maybe follow-up with any cleanups.
      
      v3:
      
      Changes based on v2 review comments:
      
        - Move check for custom rules up earlier in patch set, to keep git bisect
          safe.
        - Simplify the route add/modify failure handling to simple try until
          failure, and then on failure, undo everything.  The switchdev driver
          will return err when route can normally be installed to device, but
          the install fails for one reason or another (no space left on device,
          etc).  If a failure happens, uninstall all routes from the device,
          punting forwarding for all routes back to the kernel.
        - Scan route's full nexthop list, ensuring all nexthop devs belong
          to the same switchdev device, otherwise don't try to install route
          to device.
      
      v2:
      
      Changes based on v1 review comments and discussions at netconf:
      
        - Allow route modification, but use same ndo op used for adding route.
          Driver/device is expected to modify route in-place, if it can, to avoid
          interruption of service.
        - Add new RTNH_F_EXTERNAL flag to mark FIB entries offloaded externally.
        - Don't offload routes if using custom IP rules.  If routes are already
          offloaded, and custom IP rules are turned on, flush routes from offload
          device.  (Offloaded routes are marked with RTNH_F_EXTERNAL).
        - Use kernel's neigh resolution code to resolve route's nexthops' neigh
          MAC addrs.  (Thanks davem, works great!).
        - Use fib->fib_priority in rocker driver to give priorities to routes in
          OF-DPA unicast route table.
      
      v1:
      
      This patch set adds L3 routing offload support for IPv4 routes.  The idea is to
      mirror routes installed in the kernel's FIB down to a hardware switch device to
      offload the data forwarding path for L3.  Only the data forwarding path is
      intercepted.  Control and management of the kernel's FIB remains with the
      kernel.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fabe7bed
    • Scott Feldman's avatar
      rocker: implement IPv4 fib offloading · c1beeef7
      Scott Feldman authored
      The driver implements ndo_switch_fib_ipv4_add/del ops to add/del/mod IPv4
      routes to/from switchdev device.  Once a route is added to the device, and the
      route's nexthops are resolved to neighbor MAC address, the device will forward
      matching pkts rather than the kernel.  This offloads the L3 forwarding path
      from the kernel to the device.  Note that control and management planes are
      still mananged by Linux; only the data plane is offloaded.  Standard routing
      control protocols such as OSPF and BGP run on Linux and manage the kernel's FIB
      via standard rtm netlink msgs...nothing changes here.
      
      A new hash table is added to rocker to track neighbors.  The driver listens for
      neighbor updates events using netevent notifier NETEVENT_NEIGH_UPDATE.  Any ARP
      table updates for ports on this device are recorded in this table.  Routes
      installed to the device with nexthops that reference neighbors in this table
      are "qualified".  In the case of a route with nexthops not resolved in the
      table, the kernel is asked to resolve the nexthop.
      
      The driver uses fib_info->fib_priority for the priority field in rocker's
      unicast routing table.
      
      The device can only forward to pkts matching route dst to resolved nexthops.
      Currently, the device only supports single-path routes (i.e. routes with one
      nexthop).  Equal Cost Multipath (ECMP) route support will be added in followup
      patches.
      
      This patch is driver support for unicast IPv4 routing only.  Followup patches
      will add driver and infrastructure for IPv6 routing and multicast routing.
      Signed-off-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarJiri Pirko <jiri@resnulli.us>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1beeef7
    • Scott Feldman's avatar
      fib: hook IPv4 fib for hardware offload · 8e05fd71
      Scott Feldman authored
      Call into the switchdev driver any time an IPv4 fib entry is
      added/modified/deleted from the kernel's FIB.  The switchdev driver may or
      may not install the route to the offload device.  In the case where the
      driver tries to install the route and something goes wrong (device's routing
      table is full, etc), then all of the offloaded routes will be flushed from the
      device, route forwarding falls back to the kernel, and no more routes are
      offloading.
      
      We can refine this logic later.  For now, use the simplist model of offloading
      routes up to the point of failure, and then on failure, undo everything and
      mark IPv4 offloading disabled.
      Signed-off-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8e05fd71
    • Scott Feldman's avatar
      ipv4: add net bool fib_offload_disabled · 448b128a
      Scott Feldman authored
      If something goes wrong with IPv4 FIB offload, mark entire net offload
      disabled.  This is brute force policy to basically shut down IPv4 FIB offload
      permanently if there is a problem offloading any route to an external device.
      We can refine the policy in the future, to handle failures on a per-device or
      per-route basis, but for now, this policy is per-net.
      
      What we're trying to avoid is an inconsistent split between the kernel's FIB
      and the offload device's FIB.  We don't want the device to fwd a pkt
      inconsitent with what the kernel would do.  An example of a split is if device
      has 10.0.0.0/16 and kernel has 10.0.0.0/16 and 10.0.0.0/24, the device wouldn't
      see the longest prefix 10.0.0.0/24 and potentially forward pkts incorrectly.
      
      Limited capacity or limited capability are two ways a route may fail to install
      to the offload device.  We'll not differentiate between failures at this time,
      and treat any failure as fatal and mark the net as fib_offload_disabled.
      Signed-off-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      448b128a
    • Scott Feldman's avatar
      switchdev: implement IPv4 fib ndo wrappers · b5d6fbde
      Scott Feldman authored
      Flesh out ndo wrappers to call into device driver.  To call into device driver,
      the wrapper must interate over route's nexthops to ensure all nexthop devs
      belong to the same switch device.  Currently, there is no support for route's
      nexthops spanning offloaded and non-offloaded devices, or spanning ports of
      multiple offload devices.
      
      Since switch device ports may be stacked under virtual interfaces (bonds and/or
      bridges), and the route's nexthop may be on the virtual interface, the wrapper
      will traverse the nexthop dev down to the base dev.  It's the base dev that's
      passed to the switchdev driver's ndo ops.
      Signed-off-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b5d6fbde
    • Scott Feldman's avatar
      switchdev: don't support custom ip rules, for now · 104616e7
      Scott Feldman authored
      Keep switchdev FIB offload model simple for now and don't allow custom ip
      rules.
      Signed-off-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      104616e7
    • Scott Feldman's avatar
      switchdev: add IPv4 fib ndo ops wrappers · 5e8d9049
      Scott Feldman authored
      Add IPv4 fib ndo wrapper funcs and stub them out for now.
      Signed-off-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e8d9049
    • Scott Feldman's avatar
      netdevice: add IPv4 fib add/del ops · 4586f1bb
      Scott Feldman authored
      Add two new ndo ops for IPv4 fib offload support, add and del.  Add uses
      modifiy semantics if fib entry already offloaded.  Drivers implementing the new
      ndo ops will return err<0 if programming device fails, for example if device's
      tables are full.
      Signed-off-by: default avatarScott Feldman <sfeldma@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4586f1bb