1. 09 Jun, 2016 4 commits
    • Florian Westphal's avatar
      sched: remove qdisc_rehape_fail · c3a173d7
      Florian Westphal authored
      After the removal of TCA_CBQ_POLICE in cbq scheduler qdisc->reshape_fail
      is always NULL, i.e. qdisc_rehape_fail is now the same as qdisc_drop.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3a173d7
    • Florian Westphal's avatar
      cbq: remove TCA_CBQ_POLICE support · dd47c1fa
      Florian Westphal authored
      iproute2 doesn't implement any cbq option that results in this attribute
      being sent to kernel.
      
      To make use of it, user would have to
      
      - patch iproute2
      - add a class
      - attach a qdisc to the class (default pfifo doesn't work as
        q->handle is 0 and cbq_set_police() is a no-op in this case)
      - re-'add' the same class (tc class change ...) again
      - user must also specifiy a defmap (e.g. 'split 1:0 defmap 3f'), since
        this 'police' feature relies on its presence
      - the added qdisc must be one of bfifo, pfifo or netem
      
      If all of these conditions are met and _some_ leaf qdiscs, namely
      p/bfifo, netem, plug or tbf would drop a packet, kernel calls back into
      cbq, which will attempt to re-queue the skb into a different class
      as indicated by the parents' defmap entry for TC_PRIO_BESTEFFORT.
      
      [ i.e. we behave as if tc_classify returned TC_ACT_RECLASSIFY ].
      
      This feature, which isn't documented or implemented in iproute2,
      and isn't implemented consistently (most qdiscs like sfq, codel, etc
      drop right away instead of attempting this reclassification) is the
      sole reason for the reshape_fail and __parent member in Qdisc struct.
      
      So remove TCA_CBQ_POLICE support from the kernel, reject it via EOPNOTSUPP
      so userspace knows we don't support it, and then remove no-longer needed
      infrastructure in followup commit.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      dd47c1fa
    • Florian Westphal's avatar
      cbq: remove TCA_CBQ_OVL_STRATEGY support · c3498d34
      Florian Westphal authored
      since initial revision of cbq in 2004 iproute 2 has never implemented
      support for TCA_CBQ_OVL_STRATEGY, which is what needs to be set to
      activate the class->drop() call (TC_CBQ_OVL_DROP strategy must be
      set by userspace value must be set by userspace).
      
      David Miller says:
         It seems really safe to kill this thing off, flag an error if someone
         tries to set the attribute, and therefore kill off all of the
         non-default cbq_ovl_*() functions.
      
      A followup commit can then remove all .drop qdisc methods since this
      removed the only caller.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c3498d34
    • Shweta Choudaha's avatar
      ip6gre: Allow live link address change · 76e48f9f
      Shweta Choudaha authored
      The ip6 GRE tap device should not be forced to down state to change
      the mac address and should allow live address change for tap device
      similar to ipv4 gre.
      Signed-off-by: default avatarShweta Choudaha <schoudah@brocade.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      76e48f9f
  2. 08 Jun, 2016 29 commits
    • David S. Miller's avatar
      Merge branch 'vrf-fib-rule-improve' · 753c104b
      David S. Miller authored
      David Ahern says:
      
      ====================
      net: vrf: Improve use of FIB rules
      
      Currently, VRFs require 1 oif and 1 iif rule per address family per
      VRF. As the number of VRF devices increases it brings scalability
      issues with the increasing rule list. All of the VRF rules have the
      same format with the exception of the specific table id to direct the
      lookup. Since the table id is available from the oif or iif in the
      loopup, the VRF rules can be consolidated to a single rule that pulls
      the table from the VRF device.
      
      This solution still allows a user to insert their own rules for VRFs,
      including rules with additional attributes. Accordingly, it is backwards
      compatible with existing setups and allows other policy routing as
      desired.
      
      Hopefully v5 is the charm; my e-waste can is getting full.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      753c104b
    • David Ahern's avatar
      net: vrf: Add l3mdev rules on first device create · 1aa6c4f6
      David Ahern authored
      Add l3mdev rule per address family when the first VRF device is
      created. The rules are installed with a default preference of 1000.
      Users can replace the default rule as desired.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1aa6c4f6
    • David Ahern's avatar
      net: Add l3mdev rule · 96c63fa7
      David Ahern authored
      Currently, VRFs require 1 oif and 1 iif rule per address family per
      VRF. As the number of VRF devices increases it brings scalability
      issues with the increasing rule list. All of the VRF rules have the
      same format with the exception of the specific table id to direct the
      lookup. Since the table id is available from the oif or iif in the
      loopup, the VRF rules can be consolidated to a single rule that pulls
      the table from the VRF device.
      
      This patch introduces a new rule attribute l3mdev. The l3mdev rule
      means the table id used for the lookup is pulled from the L3 master
      device (e.g., VRF) rather than being statically defined. With the
      l3mdev rule all of the basic VRF FIB rules are reduced to 1 l3mdev
      rule per address family (IPv4 and IPv6).
      
      If an admin wishes to insert higher priority rules for specific VRFs
      those rules will co-exist with the l3mdev rule. This capability means
      current VRF scripts will co-exist with this new simpler implementation.
      
      Currently, the rules list for both ipv4 and ipv6 look like this:
          $ ip  ru ls
          1000:       from all oif vrf1 lookup 1001
          1000:       from all iif vrf1 lookup 1001
          1000:       from all oif vrf2 lookup 1002
          1000:       from all iif vrf2 lookup 1002
          1000:       from all oif vrf3 lookup 1003
          1000:       from all iif vrf3 lookup 1003
          1000:       from all oif vrf4 lookup 1004
          1000:       from all iif vrf4 lookup 1004
          1000:       from all oif vrf5 lookup 1005
          1000:       from all iif vrf5 lookup 1005
          1000:       from all oif vrf6 lookup 1006
          1000:       from all iif vrf6 lookup 1006
          1000:       from all oif vrf7 lookup 1007
          1000:       from all iif vrf7 lookup 1007
          1000:       from all oif vrf8 lookup 1008
          1000:       from all iif vrf8 lookup 1008
          ...
          32765:      from all lookup local
          32766:      from all lookup main
          32767:      from all lookup default
      
      With the l3mdev rule the list is just the following regardless of the
      number of VRFs:
          $ ip ru ls
          1000:       from all lookup [l3mdev table]
          32765:      from all lookup local
          32766:      from all lookup main
          32767:      from all lookup default
      
      (Note: the above pretty print of the rule is based on an iproute2
             prototype. Actual verbage may change)
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96c63fa7
    • David S. Miller's avatar
      Merge branch 'tipc-small-fixes' · 6278e03d
      David S. Miller authored
      Jon Maloy says:
      
      ====================
      tipc: two small fixes
      
      We fix a couple of rarely seen anomalies discovered during testing.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6278e03d
    • Jon Paul Maloy's avatar
      tipc: change node timer unit from jiffies to ms · 5ca509fc
      Jon Paul Maloy authored
      The node keepalive interval is recalculated at each timer expiration
      to catch any changes in the link tolerance, and stored in a field in
      struct tipc_node. We use jiffies as unit for the stored value.
      
      This is suboptimal, because it makes the calculation unnecessary
      complex, including two unit conversions. The conversions also lead to
      a rounding error that causes the link "abort limit" to be 3 in the
      normal case, instead of 4, as intended. This again leads to unnecessary
      link resets when the network is pushed close to its limit, e.g., in an
      environment with hundreds of nodes or namesapces.
      
      In this commit, we do instead let the keepalive value be calculated and
      stored in milliseconds, so that there is only one conversion and the
      rounding error is eliminated.
      
      We also remove a redundant "keepalive" field in struct tipc_link. This
      is remnant from the previous implementation.
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ca509fc
    • Jon Paul Maloy's avatar
      tipc: correct error in node fsm · c4282ca7
      Jon Paul Maloy authored
      commit 88e8ac70 ("tipc: reduce transmission rate of reset messages
      when link is down") revealed a flaw in the node FSM, as defined in
      the log of commit 66996b6c ("tipc: extend node FSM").
      
      We see the following scenario:
      1: Node B receives a RESET message from node A before its link endpoint
         is fully up, i.e., the node FSM is in state SELF_UP_PEER_COMING. This
         event will not change the node FSM state, but the (distinct) link FSM
         will move to state RESETTING.
      2: As an effect of the previous event, the local endpoint on B will
         declare node A lost, and post the event SELF_DOWN to the its node
         FSM. This moves the FSM state to SELF_DOWN_PEER_LEAVING, meaning
         that no messages will be accepted from A until it receives another
         RESET message that confirms that A's endpoint has been reset. This
         is  wasteful, since we know this as a fact already from the first
         received RESET, but worse is that the link instance's FSM has not
         wasted this information, but instead moved on to state ESTABLISHING,
         meaning that it repeatedly sends out ACTIVATE messages to the reset
         peer A.
      3: Node A will receive one of the ACTIVATE messages, move its link FSM
         to state ESTABLISHED, and start repeatedly sending out STATE messages
         to node B.
      4: Node B will consistently drop these messages, since it can only accept
         accept a RESET according to its node FSM.
      5: After four lost STATE messages node A will reset its link and start
         repeatedly sending out RESET messages to B.
      6: Because of the reduced send rate for RESET messages, it is very
         likely that A will receive an ACTIVATE (which is sent out at a much
         higher frequency) before it gets the chance to send a RESET, and A
         may hence quickly move back to state ESTABLISHED and continue sending
         out STATE messages, which will again be dropped by B.
      7: GOTO 5.
      8: After having repeated the cycle 5-7 a number of times, node A will
         by chance get in between with sending a RESET, and the situation is
         resolved.
      
      Unfortunately, we have seen that it may take a substantial amount of
      time before this vicious loop is broken, sometimes in the order of
      minutes.
      
      We correct this by making a small correction to the node FSM: When a
      node in state SELF_UP_PEER_COMING receives a SELF_DOWN event, it now
      moves directly back to state SELF_DOWN_PEER_DOWN, instead of as now
      SELF_DOWN_PEER_LEAVING. This is logically consistent, since we don't
      need to wait for RESET confirmation from of an endpoint that we alread
      know has been reset. It also means that node B in the scenario above
      will not be dropping incoming STATE messages, and the link can come up
      immediately.
      
      Finally, a symmetry comparison reveals that the  FSM has a similar
      error when receiving the event PEER_DOWN in state PEER_UP_SELF_COMING.
      Instead of moving to PERR_DOWN_SELF_LEAVING, it should move directly
      to SELF_DOWN_PEER_DOWN. Although we have never seen any negative effect
      of this logical error, we choose fix this one, too.
      
      The node FSM looks as follows after those changes:
      
                                 +----------------------------------------+
                                 |                           PEER_DOWN_EVT|
                                 |                                        |
        +------------------------+----------------+                       |
        |SELF_DOWN_EVT           |                |                       |
        |                        |                |                       |
        |              +-----------+          +-----------+               |
        |              |NODE_      |          |NODE_      |               |
        |   +----------|FAILINGOVER|<---------|SYNCHING   |-----------+   |
        |   |SELF_     +-----------+ FAILOVER_+-----------+   PEER_   |   |
        |   |DOWN_EVT   |          A BEGIN_EVT  A         |   DOWN_EVT|   |
        |   |           |          |            |         |           |   |
        |   |           |          |            |         |           |   |
        |   |           |FAILOVER_ |FAILOVER_   |SYNCH_   |SYNCH_     |   |
        |   |           |END_EVT   |BEGIN_EVT   |BEGIN_EVT|END_EVT    |   |
        |   |           |          |            |         |           |   |
        |   |           |          |            |         |           |   |
        |   |           |         +--------------+        |           |   |
        |   |           +-------->|   SELF_UP_   |<-------+           |   |
        |   |   +-----------------|   PEER_UP    |----------------+   |   |
        |   |   |SELF_DOWN_EVT    +--------------+   PEER_DOWN_EVT|   |   |
        |   |   |                    A        A                   |   |   |
        |   |   |                    |        |                   |   |   |
        |   |   |         PEER_UP_EVT|        |SELF_UP_EVT        |   |   |
        |   |   |                    |        |                   |   |   |
        V   V   V                    |        |                   V   V   V
      +------------+       +-----------+    +-----------+       +------------+
      |SELF_DOWN_  |       |SELF_UP_   |    |PEER_UP_   |       |PEER_DOWN   |
      |PEER_LEAVING|       |PEER_COMING|    |SELF_COMING|       |SELF_LEAVING|
      +------------+       +-----------+    +-----------+       +------------+
             |               |       A        A       |                |
             |               |       |        |       |                |
             |       SELF_   |       |SELF_   |PEER_  |PEER_           |
             |       DOWN_EVT|       |UP_EVT  |UP_EVT |DOWN_EVT        |
             |               |       |        |       |                |
             |               |       |        |       |                |
             |               |    +--------------+    |                |
             |PEER_DOWN_EVT  +--->|  SELF_DOWN_  |<---+   SELF_DOWN_EVT|
             +------------------->|  PEER_DOWN   |<--------------------+
                                  +--------------+
      Acked-by: default avatarYing Xue <ying.xue@windriver.com>
      Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c4282ca7
    • David S. Miller's avatar
      Merge branch 'dsa-misc-improvements' · 8fa956e3
      David S. Miller authored
      Florian Fainelli says:
      
      ====================
      net: dsa: misc improvements
      
      This patch series builds on top of Andrew's "New DSA bind, switches as devices"
      patch set and does the following:
      
      - add a few helper functions/goodies for net/dsa/dsa2.c to be as close as possible
        from net/dsa/dsa.c in terms of what drivers can expect, in particular the slave
        MDIO bus and the enabled_port_mask and phy_mii_mask
      
      - fix the CPU port ethtools ops to work in a multiple tree setup since we can
        no longer assume a single tree is supported
      
      - make the bcm_sf2 driver register its own MDIO bus, yet assign it to
        ds->slave_mii_bus for everything to work in net/dsa/slave.c wrt. PHY probing,
        this is a tad cleaner than what we have now
      
      Changes in v2:
      
      Most of the previous patches have been dropped to just keep the relevant ones
      now.
      
      Changes in v3:
      - split the addition of the slave MII bus as a separate patch
      - properly unwind all operations at the right place and right time (ethtool ops,
        slave MDIO bus
      - fixed a few typos here and there
      
      Changes in v4:
      - removed superfluous dst agrument to dsa_cpu_port_ethtool_{setup,restore}
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8fa956e3
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Register our slave MDIO bus · 461cd1b0
      Florian Fainelli authored
      Register a slave MDIO bus which allows us to divert problematic
      read/writes towards conflicting pseudo-PHY address (30). Do no longer
      rely on DSA's slave_mii_bus, but instead provide our own implementation
      which offers more flexibility as to what to do, and when to register it.
      
      We need to register it by the time we are able to get access to our
      memory mapped registers, which is not until drv->setup() time. In order
      to avoid forward declarations, we need to re-order the function bodies a
      bit.
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      461cd1b0
    • Florian Fainelli's avatar
      net: dsa: Initialize CPU port ethtool ops per tree · 0c73c523
      Florian Fainelli authored
      Now that we can properly support multiple distinct trees in the system,
      using a global variable: dsa_cpu_port_ethtool_ops is getting clobbered
      as soon as the second switch tree gets probed, and we don't want that.
      
      We need to move this to be dynamically allocated, and since we can't
      really be comparing addresses anymore to determine first time
      initialization versus any other times, just move this to dsa.c and
      dsa2.c where the remainder of the dst/ds initialization happens.
      
      The operations teardown restores the master netdev's ethtool_ops to its
      original ethtool_ops pointer (typically within the Ethernet driver)
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c73c523
    • Florian Fainelli's avatar
      net: dsa: Add initialization helper for CPU port ethtool_ops · af42192c
      Florian Fainelli authored
      Add a helper function: dsa_cpu_port_ethtool_init() which initializes a
      custom ethtool_ops structure with custom DSA ethtool operations for CPU
      ports. This is a preliminary change to move the initialization outside
      of net/dsa/slave.c.
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      af42192c
    • Florian Fainelli's avatar
      net: dsa: Provide a slave MII bus if needed · 1eb59443
      Florian Fainelli authored
      Mimic what net/dsa/dsa.c does and provide a slave MII bus by default
      which will be created if the driver implements a phy_read method.
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1eb59443
    • Florian Fainelli's avatar
      net: dsa: Initialize ds->enabled_port_mask and ds->phys_mii_mask · 6e830d8f
      Florian Fainelli authored
      Some drivers rely on these two bitmasks to contain the correct values
      for them to successfully probe and initialize at drv->setup() time,
      calculate correct values to put in both masks as early as possible in
      dsa_get_ports_dn().
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6e830d8f
    • Florian Fainelli's avatar
      net: dsa: Provide unique DSA slave MII bus names · 0b7b498d
      Florian Fainelli authored
      In case we have multiples trees and switches with the same index, we
      need to add another discriminating id: the switch tree.
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Reviewed-by: default avatarVivien Didelot <vivien.didelot@savoirfairelinux.com>
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b7b498d
    • Eric Dumazet's avatar
      net: sched: fix missing doc annotations · 123b3652
      Eric Dumazet authored
      "make htmldocs" complains otherwise:
      
      .//net/core/gen_stats.c:168: warning: No description found for parameter 'running'
      .//include/linux/netdevice.h:1867: warning: No description found for parameter 'qdisc_running_key'
      
      Fixes: f9eb8aea ("net_sched: transform qdisc running bit into a seqcount")
      Fixes: edb09eb1 ("net: sched: do not acquire qdisc spinlock in qdisc/class stats dump")
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Reported-by: default avatarkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      123b3652
    • Hariprasad Shenai's avatar
      net: Reduce queue allocation to one in kdump kernel · 40e4e713
      Hariprasad Shenai authored
      When in kdump kernel, reduce memory usage by only using a single Queue
      Set for multiqueue devices. So make netif_get_num_default_rss_queues()
      return one, when in kdump kernel.
      Signed-off-by: default avatarHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      40e4e713
    • David S. Miller's avatar
      Merge branch 'qed-dcbnl' · df0437e1
      David S. Miller authored
      Sudarsana Reddy Kalluru says:
      
      ====================
      qed/qede support for dcbnl.
      
      This series adds the dcbnl functionality to the driver. Patch (1) adds
      the qed infrastucture for querying/configuring the dcbx parameters.
      Patch (2) adds the qed infrastructure for dcbnl APIs. And patch (3)
      adds the qede support for dcbnl.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      df0437e1
    • Sudarsana Reddy Kalluru's avatar
      qede: Add dcbnl support. · 489e45ae
      Sudarsana Reddy Kalluru authored
      This patch adds the interfaces for ieee/cee dcbnl callbacks and registers
      them with the kernel.
      Signed-off-by: default avatarSudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      489e45ae
    • Sudarsana Reddy Kalluru's avatar
      qed: Add dcbnl support. · a1d8d8a5
      Sudarsana Reddy Kalluru authored
      This patch adds the implementation for both cee/ieee dcbnl callbacks by
      using the qed query/config APIs.
      Signed-off-by: default avatarSudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1d8d8a5
    • Sudarsana Reddy Kalluru's avatar
      qed: Add support for query/config dcbx. · 6ad8c632
      Sudarsana Reddy Kalluru authored
      Query API reads the dcbx data from the device shared memory and return it
      to the caller. The config API configures the user provided dcbx values on
      the device, and initiates the dcbx negotiation with the peer.
      Signed-off-by: default avatarSudarsana Reddy Kalluru <sudarsana.kalluru@qlogic.com>
      Signed-off-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6ad8c632
    • Andreas Ziegler's avatar
      fsl/qe: Do not prefix header guard with CONFIG_ · 6f23d96c
      Andreas Ziegler authored
      The CONFIG_ prefix should only be used for options which
      can be configured through Kconfig and not for guarding headers.
      Signed-off-by: default avatarAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6f23d96c
    • Andreas Ziegler's avatar
      drivers/net/fsl_ucc: Do not prefix header guard with CONFIG_ · c5739767
      Andreas Ziegler authored
      The CONFIG_ prefix should only be used for options which
      can be configured through Kconfig and not for guarding headers.
      Signed-off-by: default avatarAndreas Ziegler <andreas.ziegler@fau.de>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c5739767
    • Tom Herbert's avatar
      ila: Perform only one translation in forwarding path · 707a2ca4
      Tom Herbert authored
      When setting up ILA in a router we noticed that the the encapsulation
      is invoked twice: once in the route input path and again upon route
      output. To resolve this we add a flag set_csum_neutral for the
      ila_update_ipv6_locator. If this flag is set and the checksum
      neutral bit is also set we assume that checksum-neutral translation
      has already been performed and take no further action. The
      flag is set only in ila_output path. The flag is not set for ila_input and
      ila_xlat.
      
      Tested:
      
      Used 3 netns to set to emulate a router and two hosts. The router
      translates SIR addresses between the two destinations in other two netns.
      Verified ping and netperf are functional.
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      707a2ca4
    • Pau Espin Pedrol's avatar
      tcp: accept RST if SEQ matches right edge of right-most SACK block · e00431bc
      Pau Espin Pedrol authored
      RFC 5961 advises to only accept RST packets containing a seq number
      matching the next expected seq number instead of the whole receive
      window in order to avoid spoofing attacks.
      
      However, this situation is not optimal in the case SACK is in use at the
      time the RST is sent. I recently run into a scenario in which packet
      losses were high while uploading data to a server, and userspace was
      willing to frequently terminate connections by sending a RST. In
      this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting
      for a lost packet retransmission and SACK blocks are used to let the
      client continue uploading data. At some point later on, the client sends
      the RST (snd_nxt), which matches the next expected seq number of the
      right-most SACK block on the receiver side which is going forward
      receiving data.
      
      In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the
      frozen main ACK at receiver side and thus gets dropped and a challenge
      ACK is sent, which gets usually lost due to network conditions. The main
      consequence is that the connection stays alive for a while even if it
      made sense to accept the RST. This can get really bad if lots of
      connections like this one are created in few seconds, allocating all the
      resources of the server easily.
      
      For security reasons, not all SACK blocks are checked (there could be a
      big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it
      wouldn't make sense to check for RST in blocks other than the right-most
      received one because the sender is not expected to be sending new data
      after the RST. For simplicity, only up to the 4 most recently updated
      SACK blocks (selective_acks[4] field) are compared to find the
      right-most block, as usually those are the ones with bigger probability
      to contain it.
      
      This patch was tested in a 3.18 kernel and probed to improve the
      situation in the scenario described above.
      Signed-off-by: default avatarPau Espin Pedrol <pau.espin@tessares.net>
      Acked-by: default avatarEric Dumazet <edumazet@google.com>
      Acked-by: default avatarNeal Cardwell <ncardwell@google.com>
      Tested-by: default avatarNeal Cardwell <ncardwell@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e00431bc
    • Dan Carpenter's avatar
      qed: potential overflow in qed_cxt_src_t2_alloc() · 01e517f1
      Dan Carpenter authored
      In the current code "ent_per_page" could be more than "conn_num" making
      "conn_num" negative after the subtraction.  In the next iteration
      through the loop then the negative is treated as a very high positive
      meaning we don't put a limit on "ent_num".  It could lead to memory
      corruption.
      
      Fixes: dbb799c3 ('qed: Initialize hardware for new protocols')
      Signed-off-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: default avatarYuval Mintz <Yuval.Mintz@qlogic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      01e517f1
    • David S. Miller's avatar
      Merge branch 'vrf-local' · f02ea215
      David S. Miller authored
      David Ahern says:
      
      ====================
      net: vrf: Add support for local traffic to local addresses
      
      Add support for locally originated traffic to VRF-local addresses,
      be it addresses on enslaved devices or addresses on the VRF device:
      
      $ ip addr show dev red
      33: red: <NOARP,MASTER,UP,LOWER_UP> mtu 65536 qdisc pfifo_fast state UP group default qlen 1000
          link/ether be:00:53:b5:e4:25 brd ff:ff:ff:ff:ff:ff
          inet 1.1.1.1/32 scope global red
             valid_lft forever preferred_lft forever
          inet6 1111:1::1/128 scope global
             valid_lft forever preferred_lft forever
      
      $ ip addr show dev eth1
      3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
          link/ether 02:e0:f9:79:34:bd brd ff:ff:ff:ff:ff:ff
          inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
             valid_lft forever preferred_lft forever
          inet6 2100:1::1/120 scope global
             valid_lft forever preferred_lft forever
          inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
             valid_lft forever preferred_lft forever
      
      $ ping -c1 -I red 10.100.1.1
          ping: Warning: source address might be selected on device other than red.
          PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data.
          64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms
      
      $ ping -c1 -I red 1.1.1.1
      PING 1.1.1.1 (1.1.1.1) from 1.1.1.1 red: 56(84) bytes of data.
      64 bytes from 1.1.1.1: icmp_seq=1 ttl=64 time=0.136 ms
      
      --- 1.1.1.1 ping statistics ---
      1 packets transmitted, 1 received, 0% packet loss, time 0ms
      rtt min/avg/max/mdev = 0.136/0.136/0.136/0.000 ms
      
      $ ping6 -c1 -I red  2100:1::1
      ping6: Warning: source address might be selected on device other than red.
      PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
      64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.167 ms
      
      --- 2100:1::1 ping statistics ---
      1 packets transmitted, 1 received, 0% packet loss, time 0ms
      rtt min/avg/max/mdev = 0.167/0.167/0.167/0.000 ms
      
      $ ping6 -c1 -I red 1111::1
      PING 1111::1(1111::1) from 1111:1::1 red: 56 data bytes
      64 bytes from 1111::1: icmp_seq=1 ttl=64 time=0.187 ms
      
      --- 1111::1 ping statistics ---
      1 packets transmitted, 1 received, 0% packet loss, time 0ms
      rtt min/avg/max/mdev = 0.187/0.187/0.187/0.000 ms
      
      This change also enables use of loopback address on the VRF device:
      $ ip addr add dev red 127.0.0.1/8
      
      $ ping -c1 -I red 127.0.0.1
      PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
      64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f02ea215
    • David Ahern's avatar
      net: vrf: ipv6 support for local traffic to local addresses · b4869aa2
      David Ahern authored
      Add support for locally originated traffic to VRF-local IPv6 addresses.
      Similar to IPv4 a local dst is set on the skb and the packet is
      reinserted with a call to netif_rx. With this patch, ping, tcp and udp
      packets to a local IPv6 address are successfully routed:
      
          $ ip addr show dev eth1
          4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
              link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
              inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
                 valid_lft forever preferred_lft forever
              inet6 2100:1::1/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
                 valid_lft forever preferred_lft forever
      
          $ ping6 -c1 -I red 2100:1::1
          ping6: Warning: source address might be selected on device other than red.
          PING 2100:1::1(2100:1::1) from 2100:1::1 red: 56 data bytes
          64 bytes from 2100:1::1: icmp_seq=1 ttl=64 time=0.098 ms
      
      ip6_input is exported so the VRF driver can use it for the dst input
      function. The dst_alloc function for IPv4 defaults to setting the input and
      output functions; IPv6's does not. VRF does not need to duplicate the Rx path
      so just export the ipv6 input function.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b4869aa2
    • David Ahern's avatar
      net: vrf: ipv4 support for local traffic to local addresses · afe80a49
      David Ahern authored
      Add support for locally originated traffic to VRF-local addresses. If
      destination device for an skb is the loopback or VRF device then set
      its dst to a local version of the VRF cached dst_entry and call netif_rx
      to insert the packet onto the rx queue - similar to what is done for
      loopback. This patch handles IPv4 support; follow on patch handles IPv6.
      
      With this patch, ping, tcp and udp packets to a local IPv4 address are
      successfully routed:
      
          $ ip addr show dev eth1
          4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000
              link/ether 02:e0:f9:1c:b9:74 brd ff:ff:ff:ff:ff:ff
              inet 10.100.1.1/24 brd 10.100.1.255 scope global eth1
                 valid_lft forever preferred_lft forever
              inet6 2100:1::1/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe1c:b974/64 scope link
                 valid_lft forever preferred_lft forever
      
          $ ping -c1 -I red 10.100.1.1
          ping: Warning: source address might be selected on device other than red.
          PING 10.100.1.1 (10.100.1.1) from 10.100.1.1 red: 56(84) bytes of data.
          64 bytes from 10.100.1.1: icmp_seq=1 ttl=64 time=0.057 ms
      
      This patch also enables use of IPv4 loopback address on the VRF device:
          $ ip addr add dev red 127.0.0.1/8
      
          $ ping -c1 -I red 127.0.0.1
          PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
          64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.058 ms
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      afe80a49
    • David Ahern's avatar
      net: vrf: Minor refactoring for local address patches · 911a66fb
      David Ahern authored
      Move the stripping of the ethernet header from is_ip_tx_frame into the
      ipv4 and ipv6 outbound functions and collapse vrf_send_v4_prep into
      vrf_process_v4_outbound.
      Signed-off-by: default avatarDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      911a66fb
    • Tom Herbert's avatar
      gue: Implement direction IP encapsulation · c1e48af7
      Tom Herbert authored
      This patch implements direct encapsulation of IPv4 and IPv6 packets
      in UDP. This is done a version "1" of GUE and as explained in I-D
      draft-ietf-nvo3-gue-03.
      
      Changes here are only in the receive path, fou with IPxIPx already
      supports the transmit side. Both the normal receive path and
      GRO path are modified to check for GUE version and check for
      IP version in the case that GUE version is "1".
      
      Tested:
      
      IPIP with direct GUE encap
        1 TCP_STREAM
          4530 Mbps
        200 TCP_RR
          1297625 tps
          135/232/444 90/95/99% latencies
      
      IP4IP6 with direct GUE encap
        1 TCP_STREAM
          4903 Mbps
        200 TCP_RR
          1184481 tps
          149/253/473 90/95/99% latencies
      
      IP6IP6 direct GUE encap
        1 TCP_STREAM
         5146 Mbps
        200 TCP_RR
          1202879 tps
          146/251/472 90/95/99% latencies
      
      SIT with direct GUE encap
        1 TCP_STREAM
          6111 Mbps
        200 TCP_RR
          1250337 tps
          139/241/467 90/95/99% latencies
      Signed-off-by: default avatarTom Herbert <tom@herbertland.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1e48af7
  3. 07 Jun, 2016 7 commits
    • David S. Miller's avatar
      Merge branch 'net-sched-fast-stats' · 34fe76ab
      David S. Miller authored
      Eric Dumazet says:
      
      ====================
      net: sched: faster stats gathering
      
      A while back, I sent one RFC patch using lockless stats gathering
      on 64bit arches.
      
      This patch series does it more cleanly, using a seqcount.
      
      Since qdisc/class stats are written at dequeue() time,
      we can ask the dequeue to change the seqcount, so that
      stats readers can avoid taking the root qdisc lock,
      and instead the typical read_seqcount_{begin|retry} guarded
      loop.
      
      This does not change fast path costs, as the seqcount
      increments are not more expensive than the bit manipulation,
      and allows readers to not freeze the fast path anymore.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34fe76ab
    • Eric Dumazet's avatar
      net: sched: do not acquire qdisc spinlock in qdisc/class stats dump · edb09eb1
      Eric Dumazet authored
      Large tc dumps (tc -s {qdisc|class} sh dev ethX) done by Google BwE host
      agent [1] are problematic at scale :
      
      For each qdisc/class found in the dump, we currently lock the root qdisc
      spinlock in order to get stats. Sampling stats every 5 seconds from
      thousands of HTB classes is a challenge when the root qdisc spinlock is
      under high pressure. Not only the dumps take time, they also slow
      down the fast path (queue/dequeue packets) by 10 % to 20 % in some cases.
      
      An audit of existing qdiscs showed that sch_fq_codel is the only qdisc
      that might need the qdisc lock in fq_codel_dump_stats() and
      fq_codel_dump_class_stats()
      
      In v2 of this patch, I now use the Qdisc running seqcount to provide
      consistent reads of packets/bytes counters, regardless of 32/64 bit arches.
      
      I also changed rate estimators to use the same infrastructure
      so that they no longer need to lock root qdisc lock.
      
      [1]
      http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43838.pdfSigned-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Kevin Athey <kda@google.com>
      Cc: Xiaotian Pei <xiaotian@google.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      edb09eb1
    • Eric Dumazet's avatar
      net_sched: transform qdisc running bit into a seqcount · f9eb8aea
      Eric Dumazet authored
      Instead of using a single bit (__QDISC___STATE_RUNNING)
      in sch->__state, use a seqcount.
      
      This adds lockdep support, but more importantly it will allow us
      to sample qdisc/class statistics without having to grab qdisc root lock.
      Signed-off-by: default avatarEric Dumazet <edumazet@google.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9eb8aea
    • David S. Miller's avatar
      Merge branch 'be2net-noncrit-fixes' · 64151ae3
      David S. Miller authored
      Sathya Perla says:
      
      ====================
      be2net: patch set
      
      Hi David, the following patch set contains three non-critical fixes that
      can go into the net-next tree.
      
      Patch 1 fixes the logic for provisioning queue pairs on VFs to take into
      account the limit on number of TXQs too as in some profiles the number
      of TXQs is less than that of RXQs.
      
      Patch 2 enables WoL support from shutdown on Skyhawk.
      
      Patch 3 enhances the logic for provisioning queue pairs on VFs on
      SR-IOV over multi-partition configs. Each PF (partition) on a port has to
      compute the number of RSS tables it's VFs can use.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      64151ae3
    • Somnath Kotur's avatar
      be2net: Fix provisioning of RSS for VFs in multi-partition configurations · de2b1e03
      Somnath Kotur authored
      Currently, we do not distribute queue resources to enable RSS for VFs
      in multi-channel/partition configurations.
      Fix this by having each PF(SRIOV capable) calculate it's share of the
      15 RSS Policy Tables available per port before provisioning resources for
      all the VFs.
      This  proportional share calculation is done based on division of the
      PF's MAX VFs with the Total MAX VFs on that port. It also needs to
      learn about the no: of NIC PFs on the port and subtract that from
      the 15 RSS Policy Tables on the port.
      Signed-off-by: default avatarSomnath Kotur <somnath.kotur@emulex.com>
      Signed-off-by: default avatarSathya Perla <sathya.perla@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de2b1e03
    • Sriharsha Basavapatna's avatar
      be2net: Enable Wake-On-LAN from shutdown for Skyhawk · 45f13df7
      Sriharsha Basavapatna authored
      Skyhawk does support wake-up from ACPI shutdown state - S5, provided the
      platform supports it (like Auxiliary power source etc). The changes listed
      below are done to fix this.
      
      1) There's no need to defer the HW configuration of WOL to be_suspend().
      Remove this in be_suspend() and move it to be_set_wol() ethtool function
      so it is configured directly in the context of ethtool. This automatically
      takes care of the shutdown case.
      
      2) The driver incorrectly uses WOL_CAP field in the FW response to
      get_acpi_wol_cap() command, to determine if WOL is enabled. Instead the
      driver must rely on the macaddr field in the response to infer WOL state.
      
      3) In be_get_config() during init, if we find that WOL is enabled in FW,
      call pci_enable_wake() to enable pmcsr.pme_en bit. This is needed to
      support persistent WOL configuration provided by the FW in some platforms.
      
      4) Remove code in be_set_wol() that writes to PCICFG_PM_CONTROL_OFFSET
      to set pme_en bit; pci_enable_wake() sets that.
      
      Fixes: 028991e4 ("Enabling Wake-on-LAN is not supported in S5 state")
      Signed-off-by: default avatarSriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
      Signed-off-by: default avatarSathya Perla <sathya.perla@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      45f13df7
    • Suresh Reddy's avatar
      be2net: use max-TXQs limit too while provisioning VF queue pairs · b9263cbf
      Suresh Reddy authored
      When the PF driver provisions resources for VFs, it currently only looks
      at max RSS queues available to calculate the number of VF queue pairs.
      This logic breaks when there are less number of TX-queues than RSS-queues.
      This patch fixes this problem by using the max-TXQs available in the
      PF-pool in the calculations. As a part of this change the
      be_calculate_vf_qs() routine is renamed as be_calculate_vf_res() and the
      code that calculates limits on other related resources is moved here to
      contain all resource calculation code inside one routine.
      Signed-off-by: default avatarSuresh Reddy <suresh.reddy@broadcom.com>
      Signed-off-by: default avatarSathya Perla <sathya.perla@broadcom.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b9263cbf