1. 12 Mar, 2021 21 commits
    • David S. Miller's avatar
      Merge branch 'nexthop-Resilient-next-hop-groups' · 2a0186a3
      David S. Miller authored
      Petr Machata says:
      
      ====================
      nexthop: Resilient next-hop groups
      
      At this moment, there is only one type of next-hop group: an mpath group.
      Mpath groups implement the hash-threshold algorithm, described in RFC
      2992[1].
      
      To select a next hop, hash-threshold algorithm first assigns a range of
      hashes to each next hop in the group, and then selects the next hop by
      comparing the SKB hash with the individual ranges. When a next hop is
      removed from the group, the ranges are recomputed, which leads to
      reassignment of parts of hash space from one next hop to another. RFC 2992
      illustrates it thus:
      
                   +-------+-------+-------+-------+-------+
                   |   1   |   2   |   3   |   4   |   5   |
                   +-------+-+-----+---+---+-----+-+-------+
                   |    1    |    2    |    4    |    5    |
                   +---------+---------+---------+---------+
      
                    Before and after deletion of next hop 3
      	      under the hash-threshold algorithm.
      
      Note how next hop 2 gave up part of the hash space in favor of next hop 1,
      and 4 in favor of 5. While there will usually be some overlap between the
      previous and the new distribution, some traffic flows change the next hop
      that they resolve to.
      
      If a multipath group is used for load-balancing between multiple servers,
      this hash space reassignment causes an issue that packets from a single
      flow suddenly end up arriving at a server that does not expect them, which
      may lead to TCP reset.
      
      If a multipath group is used for load-balancing among available paths to
      the same server, the issue is that different latencies and reordering along
      the way causes the packets to arrive in the wrong order.
      
      Resilient hashing is a technique to address the above problem. Resilient
      next-hop group has another layer of indirection between the group itself
      and its constituent next hops: a hash table. The selection algorithm uses a
      straightforward modulo operation on the SKB hash to choose a hash table
      bucket, then reads the next hop that this bucket contains, and forwards
      traffic there.
      
      This indirection brings an important feature. In the hash-threshold
      algorithm, the range of hashes associated with a next hop must be
      continuous. With a hash table, mapping between the hash table buckets and
      the individual next hops is arbitrary. Therefore when a next hop is deleted
      the buckets that held it are simply reassigned to other next hops:
      
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                   |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      	                      v v v v
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
                   |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
                   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      
                    Before and after deletion of next hop 3
      	      under the resilient hashing algorithm.
      
      When weights of next hops in a group are altered, it may be possible to
      choose a subset of buckets that are currently not used for forwarding
      traffic, and use those to satisfy the new next-hop distribution demands,
      keeping the "busy" buckets intact. This way, established flows are ideally
      kept being forwarded to the same endpoints through the same paths as before
      the next-hop group change.
      
      This patch set adds the implementation of resilient next-hop groups.
      
      In a nutshell, the algorithm works as follows. Each next hop has a number
      of buckets that it wants to have, according to its weight and the number of
      buckets in the hash table. In case of an event that might cause bucket
      allocation change, the numbers for individual next hops are updated,
      similarly to how ranges are updated for mpath group next hops. Following
      that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
      next hop that is currently occupying more buckets than it wants (it is
      "overweight"), it migrates the buckets to one of the next hops that has
      fewer buckets than it wants (it is "underweight"). If, after this, there
      are still underweight next hops, another upkeep run is scheduled to a
      future time.
      
      Chances are there are not enough "idle" buckets to satisfy the new demands.
      The algorithm has knobs to select both what it means for a bucket to be
      idle, and for whether and when to forcefully migrate buckets if there keeps
      being an insufficient number of idle ones.
      
      To illustrate the usage, consider the following commands:
      
       # ip nexthop add id 1 via 192.0.2.2 dev dummy1
       # ip nexthop add id 2 via 192.0.2.3 dev dummy1
       # ip nexthop add id 10 group 1/2 type resilient \
      	buckets 8 idle_timer 60 unbalanced_timer 300
      
      The last command creates a resilient next-hop group. It will have 8
      buckets, each bucket will be considered idle when no traffic hits it for at
      least 60 seconds, and if the table remains out of balance for 300 seconds,
      it will be forcefully brought into balance.
      
      If not present in netlink message, the idle timer defaults to 120 seconds,
      and there is no unbalanced timer, meaning the group may remain unbalanced
      indefinitely. The value of 120 is the default in Cumulus implementation of
      resilient next-hop groups. To a degree the default is arbitrary, the only
      value that certainly does not make sense is 0. Therefore going with an
      existing deployed implementation is reasonable.
      
      Unbalanced time, i.e. how long since the last time that all nexthops had as
      many buckets as they should according to their weights, is reported when
      the group is dumped:
      
       # ip nexthop show id 10
       id 10 group 1/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0
      
      When replacing next hops or changing weights, if one does not specify some
      parameters, their value is left as it was:
      
       # ip nexthop replace id 10 group 1,2/2 type resilient
       # ip nexthop show id 10
       id 10 group 1,2/2 type resilient buckets 8 idle_timer 60 unbalanced_timer 300 unbalanced_time 0
      
      It is also possible to do a dump of individual buckets (and now you know
      why there were only 8 of them in the example above):
      
       # ip nexthop bucket show id 10
       id 10 index 0 idle_time 5.59 nhid 1
       id 10 index 1 idle_time 5.59 nhid 1
       id 10 index 2 idle_time 8.74 nhid 2
       id 10 index 3 idle_time 8.74 nhid 2
       id 10 index 4 idle_time 8.74 nhid 1
       id 10 index 5 idle_time 8.74 nhid 1
       id 10 index 6 idle_time 8.74 nhid 1
       id 10 index 7 idle_time 8.74 nhid 1
      
      Note the two buckets that have a shorter idle time. Those are the ones that
      were migrated after the nexthop replace command to satisfy the new demand
      that nexthop 1 be given 6 buckets instead of 4.
      
      The patchset proceeds as follows:
      
      - Patches #1 and #2 are small refactoring patches.
      
      - Patch #3 adds a new flag to struct nh_group, is_multipath. This flag is
        meant to be set for all nexthop groups that in general have several
        nexthops from which they choose, and avoids a more expensive dispatch
        based on reading several flags, one for each nexthop group type.
      
      - Patch #4 contains defines of new UAPI attributes and the new next-hop
        group type. At this point, the nexthop code is made to bounce the new
        type. As the resilient hashing code is gradually added in the following
        patch sets, it will remain dead. The last patch will make it accessible.
      
        This patch also adds a suite of new messages related to next hop buckets.
        This approach was taken instead of overloading the information on the
        existing RTM_{NEW,DEL,GET}NEXTHOP messages for the following reasons.
      
        First, a next-hop group can contain a large number of next-hop buckets
        (4k is not unheard of). This imposes limits on the amount of information
        that can be encoded for each next-hop bucket given a netlink message is
        limited to 64k bytes.
      
        Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at this
        point, in the future it can be extended to provide user space with
        control over next-hop buckets configuration.
      
      - Patch #5 contains the meat of the resilient next-hop group support.
      
      - Patches #6 and #7 implement support for notifications towards the
        drivers.
      
      - Patch #8 adds an interface for the drivers to report resilient hash
        table bucket activity. Drivers will be able to report through this
        interface whether traffic is hitting a given bucket.
      
      - Patch #9 adds an interface for the drivers to report whether a given
        hash table bucket is offloaded or trapping traffic.
      
      - In patches #10, #11, #12 and #13, UAPI is implemented. This includes all
        the code necessary for creation of resilient groups, bucket dumping and
        getting, and bucket migration notifications.
      
      - In patch #14 the next-hop groups are finally made available.
      
      The overall plan is to contribute approximately the following patchsets:
      
      1) Nexthop policy refactoring (already pushed)
      2) Preparations for resilient next-hop groups (already pushed)
      3) Implementation of resilient next-hop groups (this patchset)
      4) Netdevsim offload plus a suite of selftests
      5) Preparations for mlxsw offload of resilient next-hop groups
      6) mlxsw offload including selftests
      
      Interested parties can look at the current state of the code at [2] and
      [3].
      
      [1] https://tools.ietf.org/html/rfc2992
      [2] https://github.com/idosch/linux/commits/submit/res_integ_v1
      [3] https://github.com/idosch/iproute2/commits/submit/res_v1
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2a0186a3
    • Petr Machata's avatar
      nexthop: Enable resilient next-hop groups · 15e1dd57
      Petr Machata authored
      Now that all the code is in place, stop rejecting requests to create
      resilient next-hop groups.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      15e1dd57
    • Petr Machata's avatar
      nexthop: Notify userspace about bucket migrations · 0b4818aa
      Petr Machata authored
      Nexthop replacements et.al. are notified through netlink, but if a delayed
      work migrates buckets on the background, userspace will stay oblivious.
      Notify these as RTM_NEWNEXTHOPBUCKET events.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0b4818aa
    • Petr Machata's avatar
      nexthop: Add netlink handlers for bucket get · 187d4c6b
      Petr Machata authored
      Allow getting (but not setting) individual buckets to inspect the next hop
      mapped therein, idle time, and flags.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      187d4c6b
    • Petr Machata's avatar
      nexthop: Add netlink handlers for bucket dump · 8a1bbabb
      Petr Machata authored
      Add a dump handler for resilient next hop buckets. When next-hop group ID
      is given, it walks buckets of that group, otherwise it walks buckets of all
      groups. It then dumps the buckets whose next hops match the given filtering
      criteria.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8a1bbabb
    • Petr Machata's avatar
      nexthop: Add netlink handlers for resilient nexthop groups · a2601e2b
      Petr Machata authored
      Implement the netlink messages that allow creation and dumping of resilient
      nexthop groups.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a2601e2b
    • Ido Schimmel's avatar
      nexthop: Allow reporting activity of nexthop buckets · cfc15c1d
      Ido Schimmel authored
      The kernel periodically checks the idle time of nexthop buckets to
      determine if they are idle and can be re-populated with a new nexthop.
      
      When the resilient nexthop group is offloaded to hardware, the kernel
      will not see activity on nexthop buckets unless it is reported from
      hardware.
      
      Add a function that can be periodically called by device drivers to
      report activity on nexthop buckets after querying it from the underlying
      device.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cfc15c1d
    • Ido Schimmel's avatar
      nexthop: Allow setting "offload" and "trap" indication of nexthop buckets · 56ad5ba3
      Ido Schimmel authored
      Add a function that can be called by device drivers to set "offload" or
      "trap" indication on nexthop buckets following nexthop notifications and
      other changes such as a neighbour becoming invalid.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      56ad5ba3
    • Petr Machata's avatar
      nexthop: Implement notifiers for resilient nexthop groups · 7c37c7e0
      Petr Machata authored
      Implement the following notifications towards drivers:
      
      - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created.
      
      - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of
        next hops to hash table buckets. That includes replacements, deletions,
        and delayed upkeep cycles. Some bucket notifications can be vetoed by the
        driver, to make it possible to propagate bucket busy-ness flags from the
        HW back to the algorithm. Some are however forced, e.g. if a next hop is
        deleted, all buckets that use this next hop simply must be migrated,
        whether the HW wishes so or not.
      
      - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is
        replaced. Usually the driver will get the bucket notifications as well,
        and could veto those. But in some cases, a bucket may not be migrated
        immediately, but during delayed upkeep, and that is too late to roll the
        transaction back. This notification allows the driver to take a look and
        veto the new proposed group up front, before anything is committed.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7c37c7e0
    • Ido Schimmel's avatar
      nexthop: Add data structures for resilient group notifications · b8f090d0
      Ido Schimmel authored
      Add data structures that will be used for in-kernel notifications about
      addition / deletion of a resilient nexthop group and about changes to a
      hash bucket within a resilient group.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b8f090d0
    • Petr Machata's avatar
      nexthop: Add implementation of resilient next-hop groups · 283a72a5
      Petr Machata authored
      At this moment, there is only one type of next-hop group: an mpath group,
      which implements the hash-threshold algorithm.
      
      To select a next hop, hash-threshold algorithm first assigns a range of
      hashes to each next hop in the group, and then selects the next hop by
      comparing the SKB hash with the individual ranges. When a next hop is
      removed from the group, the ranges are recomputed, which leads to
      reassignment of parts of hash space from one next hop to another. While
      there will usually be some overlap between the previous and the new
      distribution, some traffic flows change the next hop that they resolve to.
      That causes problems e.g. as established TCP connections are reset, because
      the traffic is forwarded to a server that is not familiar with the
      connection.
      
      Resilient hashing is a technique to address the above problem. Resilient
      next-hop group has another layer of indirection between the group itself
      and its constituent next hops: a hash table. The selection algorithm uses a
      straightforward modulo operation to choose a hash bucket, and then reads
      the next hop that this bucket contains, and forwards traffic there.
      
      This indirection brings an important feature. In the hash-threshold
      algorithm, the range of hashes associated with a next hop must be
      continuous. With a hash table, mapping between the hash table buckets and
      the individual next hops is arbitrary. Therefore when a next hop is deleted
      the buckets that held it are simply reassigned to other next hops. When
      weights of next hops in a group are altered, it may be possible to choose a
      subset of buckets that are currently not used for forwarding traffic, and
      use those to satisfy the new next-hop distribution demands, keeping the
      "busy" buckets intact. This way, established flows are ideally kept being
      forwarded to the same endpoints through the same paths as before the
      next-hop group change.
      
      In a nutshell, the algorithm works as follows. Each next hop has a number
      of buckets that it wants to have, according to its weight and the number of
      buckets in the hash table. In case of an event that might cause bucket
      allocation change, the numbers for individual next hops are updated,
      similarly to how ranges are updated for mpath group next hops. Following
      that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
      next hop that is currently occupying more buckets than it wants (it is
      "overweight"), it migrates the buckets to one of the next hops that has
      fewer buckets than it wants (it is "underweight"). If, after this, there
      are still underweight next hops, another upkeep run is scheduled to a
      future time.
      
      Chances are there are not enough "idle" buckets to satisfy the new demands.
      The algorithm has knobs to select both what it means for a bucket to be
      idle, and for whether and when to forcefully migrate buckets if there keeps
      being an insufficient number of idle buckets.
      
      There are three users of the resilient data structures.
      
      - The forwarding code accesses them under RCU, and does not modify them
        except for updating the time a selected bucket was last used.
      
      - Netlink code, running under RTNL, which may modify the data.
      
      - The delayed upkeep code, which may modify the data. This runs unlocked,
        and mutual exclusion between the RTNL code and the delayed upkeep is
        maintained by canceling the delayed work synchronously before the RTNL
        code touches anything. Later it restarts the delayed work if necessary.
      
      The RTNL code has to implement next-hop group replacement, next hop
      removal, etc. For removal, the mpath code uses a neat trick of having a
      backup next hop group structure, doing the necessary changes offline, and
      then RCU-swapping them in. However, the hash tables for resilient hashing
      are about an order of magnitude larger than the groups themselves (the size
      might be e.g. 4K entries), and it was felt that keeping two of them is an
      overkill. Both the primary next-hop group and the spare therefore use the
      same resilient table, and writers are careful to keep all references valid
      for the forwarding code. The hash table references next-hop group entries
      from the next-hop group that is currently in the primary role (i.e. not
      spare). During the transition from primary to spare, the table references a
      mix of both the primary group and the spare. When a next hop is deleted,
      the corresponding buckets are not set to NULL, but instead marked as empty,
      so that the pointer is valid and can be used by the forwarding code. The
      buckets are then migrated to a new next-hop group entry during upkeep. The
      only times that the hash table is invalid is the very beginning and very
      end of its lifetime. Between those points, it is always kept valid.
      
      This patch introduces the core support code itself. It does not handle
      notifications towards drivers, which are kept as if the group were an mpath
      one. It does not handle netlink either. The only bit currently exposed to
      user space is the new next-hop group type, and that is currently bounced.
      There is therefore no way to actually access this code.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      283a72a5
    • Ido Schimmel's avatar
      nexthop: Add netlink defines and enumerators for resilient NH groups · 710ec562
      Ido Schimmel authored
      - RTM_NEWNEXTHOP et.al. that handle resilient groups will have a new nested
        attribute, NHA_RES_GROUP, whose elements are attributes NHA_RES_GROUP_*.
      
      - RTM_NEWNEXTHOPBUCKET et.al. is a suite of new messages that will
        currently serve only for dumping of individual buckets of resilient next
        hop groups. For nexthop group buckets, these messages will carry a nested
        attribute NHA_RES_BUCKET, whose elements are attributes NHA_RES_BUCKET_*.
      
        There are several reasons why a new suite of messages is created for
        nexthop buckets instead of overloading the information on the existing
        RTM_{NEW,DEL,GET}NEXTHOP messages.
      
        First, a nexthop group can contain a large number of nexthop buckets (4k
        is not unheard of). This imposes limits on the amount of information that
        can be encoded for each nexthop bucket given a netlink message is limited
        to 64k bytes.
      
        Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at
        this point, in the future it can be extended to provide user space with
        control over nexthop buckets configuration.
      
      - The new group type is NEXTHOP_GRP_TYPE_RES. Note that nexthop code is
        adjusted to bounce groups with that type for now.
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      710ec562
    • Petr Machata's avatar
      nexthop: Add a dedicated flag for multipath next-hop groups · 90e1a9e2
      Petr Machata authored
      With the introduction of resilient nexthop groups, there will be two types
      of multipath groups: the current hash-threshold "mpath" ones, and resilient
      groups. Both are multipath, but to determine the fact, the system needs to
      consider two flags. This might prove costly in the datapath. Therefore,
      introduce a new flag, that should be set for next-hop groups that have more
      than one nexthop, and should be considered multipath.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90e1a9e2
    • Petr Machata's avatar
      nexthop: __nh_notifier_single_info_init(): Make nh_info an argument · 96a85625
      Petr Machata authored
      The cited function currently uses rtnl_dereference() to get nh_info from a
      handed-in nexthop. However, under the resilient hashing scheme, this
      function will not always be called under RTNL, sometimes the mutual
      exclusion will be achieved differently. Therefore move the nh_info
      extraction from the function to its callers to make it possible to use a
      different synchronization guarantee.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      96a85625
    • Petr Machata's avatar
      nexthop: Pass nh_config to replace_nexthop() · 597f48e4
      Petr Machata authored
      Currently, replace assumes that the new group that is given is a
      fully-formed object. But mpath groups really only have one attribute, and
      that is the constituent next hop configuration. This may not be universally
      true. From the usability perspective, it is desirable to allow the replace
      operation to adjust just the constituent next hop configuration and leave
      the group attributes as such intact.
      
      But the object that keeps track of whether an attribute was or was not
      given is the nh_config object, not the next hop or next-hop group. To allow
      (selective) attribute updates during NH group replacement, propagate `cfg'
      to replace_nexthop() and further to replace_nexthop_grp().
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: default avatarDavid Ahern <dsahern@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      597f48e4
    • David S. Miller's avatar
      Merge branch 'seg6-next' · 1d5d0a07
      David S. Miller authored
      Julien Massonneau says:
      
      ====================
      SRv6: SRH processing improvements
      
      Add support for IPv4 decapsulation in ipv6_srh_rcv() and
      ignore routing header with segments left equal to 0 for
      seg6local actions that doesn't perfom decapsulation.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1d5d0a07
    • Julien Massonneau's avatar
      seg6: ignore routing header with segments left equal to 0 · fbbc5bc2
      Julien Massonneau authored
      When there are 2 segments routing header, after an End.B6 action
      for example, the second SRH will never be handled by an action, packet will
      be dropped when the first SRH has segments left equal to 0.
      For actions that doesn't perform decapsulation (currently: End, End.X,
      End.T, End.B6, End.B6.Encaps), this patch adds the IP6_FH_F_SKIP_RH flag
      in arguments for ipv6_find_hdr().
      Signed-off-by: default avatarJulien Massonneau <julien.massonneau@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fbbc5bc2
    • Julien Massonneau's avatar
      seg6: add support for IPv4 decapsulation in ipv6_srh_rcv() · ee90c6ba
      Julien Massonneau authored
      As specified in IETF RFC 8754, section 4.3.1.2, if the upper layer
      header is IPv4 or IPv6, perform IPv6 decapsulation and resubmit the
      decapsulated packet to the IPv4 or IPv6 module.
      Only IPv6 decapsulation was implemented. This patch adds support for IPv4
      decapsulation.
      
      Link: https://tools.ietf.org/html/rfc8754#section-4.3.1.2Signed-off-by: default avatarJulien Massonneau <julien.massonneau@6wind.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ee90c6ba
    • David S. Miller's avatar
      Merge branch 'hns3-next' · 6c609521
      David S. Miller authored
      Huazhong Tan says:
      
      ====================
      net: hns3: two updates for -next
      
      This series includes two updates for the HNS3 ethernet driver.
      ====================
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6c609521
    • Yufeng Mo's avatar
      net: hns3: use pause capability queried from firmware · e8194f32
      Yufeng Mo authored
      For maintainability and compatibility, add support to use pause
      capability queried from firmware, and add debugfs support to dump
      this capability.
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e8194f32
    • Yufeng Mo's avatar
      net: hns3: use FEC capability queried from firmware · 433ccce8
      Yufeng Mo authored
      For maintainability and compatibility, add support to use FEC
      capability queried from firmware.
      Signed-off-by: default avatarYufeng Mo <moyufeng@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      433ccce8
  2. 11 Mar, 2021 3 commits
  3. 10 Mar, 2021 16 commits
    • Bhaskar Chowdhury's avatar
      net: fddi: skfp: Mundane typo fixes throughout the file smt.h · 34bb9751
      Bhaskar Chowdhury authored
      Few spelling fixes throughout the file.
      Signed-off-by: default avatarBhaskar Chowdhury <unixbhaskar@gmail.com>
      Acked-by: default avatarRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      34bb9751
    • Shubhankar Kuranagatti's avatar
      net: ipv4: route.c: fix space before tab · 6b9c8f46
      Shubhankar Kuranagatti authored
      The extra space before tab space has been removed.
      Signed-off-by: default avatarShubhankar Kuranagatti <shubhankarvk@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6b9c8f46
    • David S. Miller's avatar
      Merge branch 'ionic-next' · f2050d91
      David S. Miller authored
      Shannon Nelson says:
      
      ====================
      ionic Rx updates
      
      The ionic driver's Rx path is due for an overhaul in order to
      better use memory buffers and to clean up the data structures.
      
      The first two patches convert the driver to using page sharing
      between buffers so as to lessen the  page alloc and free overhead.
      
      The remaining patches clean up the structs and fastpath code for
      better efficency.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f2050d91
    • Shannon Nelson's avatar
      ionic: simplify use of completion types · a25edab9
      Shannon Nelson authored
      Make better use of our struct types and type checking by passing
      the actual Rx or Tx completion type rather than a generic void
      pointer type.
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a25edab9
    • Shannon Nelson's avatar
      ionic: rebuild debugfs on qcq swap · 55eda6bb
      Shannon Nelson authored
      With a reconfigure of each queue is needed a rebuild of
      the matching debugfs information.
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      55eda6bb
    • Shannon Nelson's avatar
      ionic: simplify rx skb alloc · 89e572e7
      Shannon Nelson authored
      Remove an unnecessary layer over rx skb allocation.
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      89e572e7
    • Shannon Nelson's avatar
      ionic: optimize fastpath struct usage · f37bc346
      Shannon Nelson authored
      Clean up a couple of struct uses to make for better fast path
      access.
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f37bc346
    • Shannon Nelson's avatar
      ionic: implement Rx page reuse · 4b0a7539
      Shannon Nelson authored
      Rework the Rx buffer allocations to use pages twice when using
      normal MTU in order to cut down on buffer allocation and mapping
      overhead.
      
      Instead of tracking individual pages, in which we may have
      wasted half the space when using standard 1500 MTU, we track
      buffers which use half pages, so we can use the second half
      of the page rather than allocate and map a new page once the
      first buffer has been used.
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4b0a7539
    • Shannon Nelson's avatar
      ionic: move rx_page_alloc and free · 2b5720f2
      Shannon Nelson authored
      Move ionic_rx_page_alloc() and ionic_rx_page_free() to earlier
      in the file to make the next patch easier to review.
      Signed-off-by: default avatarShannon Nelson <snelson@pensando.io>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      2b5720f2
    • David S. Miller's avatar
      Merge branch 'dpaa2-switch-next' · eeada410
      David S. Miller authored
      Ioana Ciornei says:
      
      ====================
      dpaa2-switch: CPU terminated traffic and move out of staging
      
      This patch set adds support for Rx/Tx capabilities on DPAA2 switch port
      interfaces as well as fixing up some major blunders in how we take care
      of the switching domains. The last patch actually moves the driver out
      of staging now that the minimum requirements are met.
      
      I am sending this directly towards the net-next tree so that I can use
      the rest of the development cycle adding new features on top of the
      current driver without worrying about merge conflicts between the
      staging and net-next tree.
      
      The control interface is comprised of 3 queues in total: Rx, Rx error
      and Tx confirmation. In this patch set we only enable Rx and Tx conf.
      All switch ports share the same queues when frames are redirected to the
      CPU.  Information regarding the ingress switch port is passed through
      frame metadata - the flow context field of the descriptor.
      
      NAPI instances are also shared between switch net_devices and are
      enabled when at least on one of the switch ports .dev_open() was called
      and disabled when no switch port is still up.
      
      Since the last version of this feature was submitted to the list, I
      reworked how the switching and flooding domains are taken care of by the
      driver, thus the switch is now able to also add the control port (the
      queues that the CPU can dequeue from) into the flooding domains of a
      port (broadcast, unknown unicast etc). With this, we are able to receive
      and sent traffic from the switch interfaces.
      
      Also, the capability to properly partition the DPSW object into multiple
      switching domains was added so that when not under a bridge, the ports
      are not actually capable to switch between them. This is possible by
      adding a private FDB table per switch interface.  When multiple switch
      interfaces are under the same bridge, they will all use the same FDB
      table.
      
      Another thing that is fixed in this patch set is how the driver handles
      VLAN awareness. The DPAA2 switch is not capable to run as VLAN unaware
      but this was not reflected in how the driver responded to requests to
      change the VLAN awareness. In the last patch, this is fixed by
      describing the switch interfaces as Rx VLAN filtering on [fixed] and
      declining any request to join a VLAN unaware bridge.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eeada410
    • Ioana Ciornei's avatar
      staging: dpaa2-switch: move the driver out of staging · f48298d3
      Ioana Ciornei authored
      Now that the dpaa2-switch driver has basic I/O capabilities on the
      switch port net_devices and multiple bridging domains are supported,
      move the driver out of staging.
      
      The dpaa2-switch driver is placed right next to the dpaa2-eth driver
      since, in the near future, they will be sharing most of the data path.
      I didn't implement code reuse in this patch series because I wanted to
      keep it as small as possible.
      
      Also, the README is removed from staging with the intention to add
      proper rst documentation afterwards to actually match was is supported
      by the driver.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f48298d3
    • Ioana Ciornei's avatar
      staging: dpaa2-switch: prevent joining a bridge while VLAN uppers are present · 1c4928fc
      Ioana Ciornei authored
      Each time a switch port joins a bridge, it will start to use a FDB table
      common with all the other switch ports that are under the same bridge.
      This means that any VLAN added prior to a bridge join, will retain its
      previous FDB table destination. With this patch, I choose to restrict
      when a switch port can change it's upper device (either join or leave)
      so that the driver does not have to delete all the previously installed
      VLANs from the previous FDB and add them into the new one.
      
      Thus, in the PRECHANGEUPPER  notification we check if there are any VLAN
      type upper devices and if that's true, deny the CHANGEUPPER.
      
      This way, the user is not restricted in the topology but rather in the
      order in which the setup is done: it must first create the bridging
      domain layout and after that add the necessary VLAN devices if
      necessary. The teardown is similar, the VLAN devices will need to be
      destroyed prior to a change in the bridging layout.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1c4928fc
    • Ioana Ciornei's avatar
      staging: dpaa2-switch: add fast-ageing on bridge leave · 685b4801
      Ioana Ciornei authored
      Upon leaving a bridge, any MAC addresses learnt on the switch port prior
      to this point have to be removed so that we preserve the bridging domain
      configuration.
      
      Restructure the dpaa2_switch_port_fdb_dump() function in order to have a
      common dpaa2_switch_fdb_iterate() function between the FDB dump callback
      and the fast age procedure. To accomplish this, add a new callback -
      dpaa2_switch_fdb_cb_t - which will be called on each MAC addr and,
      depending on the situation, will either dump the FDB entry into a
      netlink message or will delete the address from the FDB table, in case
      of the fast-age.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      685b4801
    • Ioana Ciornei's avatar
      staging: dpaa2-switch: accept only vlan-aware upper devices · d671407f
      Ioana Ciornei authored
      The DPAA2 Switch is not capable to handle traffic in a VLAN unaware
      fashion, thus the previous handling of both the accepted upper devices
      and the SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING flag was wrong.
      
      Fix this by checking if the bridge that we are joining is indeed VLAN
      aware, if not return an error. Also, the RX VLAN filtering feature is
      defined as 'on [fixed]' and the .ndo_vlan_rx_add_vid() and
      .ndo_vlan_rx_kill_vid() callbacks are implemented just by recreating a
      switchdev_obj_port_vlan object and then calling the same functions used
      on the switchdev notifier path.
      In addition, changing the vlan_filtering flag to 0 on a bridge under
      which a DPAA2 switch interface is present is not supported, thus
      rejected when SWITCHDEV_ATTR_ID_BRIDGE_VLAN_FILTERING is received with
      such a request.
      
      This patch is also adding the use of the switchdev_handle_port_attr_set
      function so that we can iterate through all the lower devices of the
      bridge that the notification was received on and actually catch if the
      user is trying to change the vlan_filtering state. Since on a VLAN
      filtering change the net_device is the bridge, we also move the
      dpaa2_switch_port_dev_check call so that we do not return NOTIFY_DONE
      right away.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d671407f
    • Ioana Ciornei's avatar
      staging: dpaa2-switch: move the notifier register to module_init() · 16abb6ad
      Ioana Ciornei authored
      Move the notifier blocks register into the module_init() step, instead of
      object probe, so that all DPSW devices probed by the dpaa2-switch driver
      can use the same notifiers.
      
      This will enable us to have a more straightforward approach in
      determining if an event is intended for an object managed by this driver
      or not. Previously, the dpaa2_switch_port_dev_check() function was
      forced to also check the notifier block beside the net_device_ops
      structure to determine if the event is for us or not.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      16abb6ad
    • Ioana Ciornei's avatar
      staging: dpaa2-switch: properly setup switching domains · 539dda3c
      Ioana Ciornei authored
      Until now, the DPAA2 switch was not capable to properly setup its
      switching domains depending on the existence, or lack thereof, of a
      upper bridge device. This meant that all switch ports of a DPSW object
      were switching by default even though they were not under the same
      bridge device.
      
      Another issue was the inability to actually add the CPU in the flooding
      domains (broadcast, unknown unicast etc) of a particular switch port.
      This meant that a simple ping on a switch interface was not possible
      since no broadcast ARP frame would actually reach the CPU queues.
      
      This patch tries to fix exactly these problems by:
      
      * Creating and managing a FDB table for each flooding domain. This means
        that when a switch interface is not bridged it will use its own FDB
        table. While in bridged mode all DPAA2 switch interfaces under the
        same upper will use the same FDB table, thus leverage the same FDB
        entries.
      
      * Adding a new MC firmware command - dpsw_set_egress_flood() - through
        which the driver can setup the flooding domains as needed. For
        example, when the switch interface is standalone, thus not in a
        bridge with any other DPAA2 switch port, it will setup its broadcast
        and unknown unicast flooding domains to only include the control
        interface (the queues that reach the CPU and the driver can dequeue
        from). This flooding domain changes when the interface joins a bridge
        and is configured to include, beside the control interface, all other
        DPAA2 switch interfaces.
      
      We impose a minimum limit of FDB tables available equal to the number of
      switch interfaces so that we guarantee that, in the maximal
      configuration - all interfaces are standalone, each switch port will
      have a private FDB table. At the same time, we only probe DPSW objects
      that have the flooding and broadcast replicators configured to be per
      FDB (DPSW_*_PER_FDB). Without this, the dpaa2-switch driver would not
      be able to configure multiple switching domains.
      
      At probe time, a FDB table will be allocated for each port. At a bridge
      join event, the switch port will either continue to use the current FDB
      table (if it's the first dpaa2-switch port to join that bridge) or will
      switch to use the FDB table associated with the port that it's already
      under the bridge. If a FDB switch is necessary, the private FDB table
      which was previously used will be returned to the pool of unused FDBs.
      
      Upon a bridge leave, the switch port needs a private FDB table thus it
      will search and get the first unused FDB table. This way, all the other
      ports remaining under the bridge will continue to use the same FDB
      table.
      Signed-off-by: default avatarIoana Ciornei <ioana.ciornei@nxp.com>
      Reviewed-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      539dda3c