1. 09 Feb, 2021 24 commits
    • Jian Shen's avatar
      net: hns3: remove redundant client_setup_tc handle · ae9e492a
      Jian Shen authored
      Since the real tx queue number and real rx queue number
      always be updated when netdev opens, it's redundant
      to call hclge_client_setup_tc to do the same thing.
      So remove it.
      Signed-off-by: default avatarJian Shen <shenjian15@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ae9e492a
    • Yonglong Liu's avatar
      net: hns3: clean up some incorrect variable types in hclge_dbg_dump_tm_map() · 0256844d
      Yonglong Liu authored
      queue_id, qset_id and other IDs are unsigned type, so modify
      the corresponding local variables' type in hclge_dbg_dump_tm_map()
      from signed to unsigned. kstrtouint() and the print format should
      be updated as well.
      Signed-off-by: default avatarYonglong Liu <liuyonglong@huawei.com>
      Signed-off-by: default avatarHuazhong Tan <tanhuazhong@huawei.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0256844d
    • David S. Miller's avatar
      Merge branch 'implement-kthread-based-napi-poll' · adbb4fb0
      David S. Miller authored
      Wei Wang says:
      
      ====================
      implement kthread based napi polle
      
      The idea of moving the napi poll process out of softirq context to a
      kernel thread based context is not new.
      Paolo Abeni and Hannes Frederic Sowa have proposed patches to move napi
      poll to kthread back in 2016. And Felix Fietkau has also proposed
      patches of similar ideas to use workqueue to process napi poll just a
      few weeks ago.
      
      The main reason we'd like to push forward with this idea is that the
      scheduler has poor visibility into cpu cycles spent in softirq context,
      and is not able to make optimal scheduling decisions of the user threads.
      For example, we see in one of the application benchmark where network
      load is high, the CPUs handling network softirqs has ~80% cpu util. And
      user threads are still scheduled on those CPUs, despite other more idle
      cpus available in the system. And we see very high tail latencies. In this
      case, we have to explicitly pin away user threads from the CPUs handling
      network softirqs to ensure good performance.
      With napi poll moved to kthread, scheduler is in charge of scheduling both
      the kthreads handling network load, and the user threads, and is able to
      make better decisions. In the previous benchmark, if we do this and we
      pin the kthreads processing napi poll to specific CPUs, scheduler is
      able to schedule user threads away from these CPUs automatically.
      
      And the reason we prefer 1 kthread per napi, instead of 1 workqueue
      entity per host, is that kthread is more configurable than workqueue,
      and we could leverage existing tuning tools for threads, like taskset,
      chrt, etc to tune scheduling class and cpu set, etc. Another reason is
      if we eventually want to provide busy poll feature using kernel threads
      for napi poll, kthread seems to be more suitable than workqueue.
      Furthermore, for large platforms with 2 NICs attached to 2 sockets,
      kthread is more flexible to be pinned to different sets of CPUs.
      
      In this patch series, I revived Paolo and Hannes's patch in 2016 and
      made modifications. Then there are changes proposed by Felix, Jakub,
      Paolo and myself on top of those, with suggestions from Eric Dumazet.
      
      In terms of performance, I ran tcp_rr tests with 1000 flows with
      various request/response sizes, with RFS/RPS disabled, and compared
      performance between softirq vs kthread vs workqueue (patchset proposed
      by Felix Fietkau).
      Host has 56 hyper threads and 100Gbps nic, 8 rx queues and only 1 numa
      node. All threads are unpinned.
      
              req/resp   QPS   50%tile    90%tile    99%tile    99.9%tile
      softirq   1B/1B   2.75M   337us       376us      1.04ms     3.69ms
      kthread   1B/1B   2.67M   371us       408us      455us      550us
      workq     1B/1B   2.56M   384us       435us      673us      822us
      
      softirq 5KB/5KB   1.46M   678us       750us      969us      2.78ms
      kthread 5KB/5KB   1.44M   695us       789us      891us      1.06ms
      workq   5KB/5KB   1.34M   720us       905us     1.06ms      1.57ms
      
      softirq 1MB/1MB   11.0K   79ms       166ms      306ms       630ms
      kthread 1MB/1MB   11.0K   75ms       177ms      303ms       596ms
      workq   1MB/1MB   11.0K   79ms       180ms      303ms       587ms
      
      When running workqueue implementation, I found the number of threads
      used is usually twice as much as kthread implementation. This probably
      introduces higher scheduling cost, which results in higher tail
      latencies in most cases.
      
      I also ran an application benchmark, which performs fixed qps remote SSD
      read/write operations, with various sizes. Again, both with RFS/RPS
      disabled.
      The result is as follows:
               op_size  QPS   50%tile 95%tile 99%tile 99.9%tile
      softirq   4K     572.6K   385us   1.5ms  3.16ms   6.41ms
      kthread   4K     572.6K   390us   803us  2.21ms   6.83ms
      workq     4k     572.6K   384us   763us  3.12ms   6.87ms
      
      softirq   64K    157.9K   736us   1.17ms 3.40ms   13.75ms
      kthread   64K    157.9K   745us   1.23ms 2.76ms    9.87ms
      workq     64K    157.9K   746us   1.23ms 2.76ms    9.96ms
      
      softirq   1M     10.98K   2.03ms  3.10ms  3.7ms   11.56ms
      kthread   1M     10.98K   2.13ms  3.21ms  4.02ms  13.3ms
      workq     1M     10.98K   2.13ms  3.20ms  3.99ms  14.12ms
      
      In this set of tests, the latency is predominant by the SSD operation.
      Also, the user threads are much busier compared to tcp_rr tests. We have
      to pin the kthreads/workqueue threads to limit to a few CPUs, to not
      disturb user threads, and provide some isolation.
      
      Changes since v9:
      Small change in napi_poll() in patch 1.
      Split napi_kthread_stop() functionality to add separately in
      napi_disable() and netif_napi_del() in patch 2.
      Add description for napi_set_threaded() and return dev->threaded when
      dev->napi_list is empty for threaded sysfs in patch 3.
      
      Changes since v8:
      Added description for threaded param in struct net_device in patch 2.
      
      Changes since v7:
      Break napi_set_threaded() into 2 parts, one to create kthread called
      from netif_napi_add(), the other to set threaded bit in napi_enable(),
      to get rid of inconsistency through all napi in 1 dev.
      Added documentation for /sys/class/net/<dev>/threaded.
      
      Changes since v6:
      Added memory barrier in napi_set_threaded().
      Changed /sys/class/net/<dev>/thread to a ternary value.
      Change dev->threaded to a bit instead of bool.
      
      Changes since v5:
      Removed ASSERT_RTNL() from napi_set_threaded() and removed rtnl_lock()
      operation from napi_enable().
      
      Changes since v4:
      Recorded the threaded setting in dev and restore it in napi_enable().
      
      Changes since v3:
      Merged and rearranged patches in a logical order for easier review.
      Changed sysfs control to be per device.
      
      Changes since v2:
      Corrected typo in patch 1, and updated the cover letter with more
      detailed and updated test results.
      
      Changes since v1:
      Replaced kthread_create() with kthread_run() in patch 5 as suggested by
      Felix Fietkau.
      
      Changes since RFC:
      Renamed the kthreads to be napi/<dev>-<napi_id> in patch 5 as suggested
      by Hannes Frederic Sowa.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      adbb4fb0
    • Wei Wang's avatar
      net: add sysfs attribute to control napi threaded mode · 5fdd2f0e
      Wei Wang authored
      This patch adds a new sysfs attribute to the network device class.
      Said attribute provides a per-device control to enable/disable the
      threaded mode for all the napi instances of the given network device,
      without the need for a device up/down.
      User sets it to 1 or 0 to enable or disable threaded mode.
      Note: when switching between threaded and the current softirq based mode
      for a napi instance, it will not immediately take effect if the napi is
      currently being polled. The mode switch will happen for the next time
      napi_schedule() is called.
      Co-developed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Co-developed-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Co-developed-by: default avatarFelix Fietkau <nbd@nbd.name>
      Signed-off-by: default avatarFelix Fietkau <nbd@nbd.name>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fdd2f0e
    • Wei Wang's avatar
      net: implement threaded-able napi poll loop support · 29863d41
      Wei Wang authored
      This patch allows running each napi poll loop inside its own
      kernel thread.
      The kthread is created during netif_napi_add() if dev->threaded
      is set. And threaded mode is enabled in napi_enable(). We will
      provide a way to set dev->threaded and enable threaded mode
      without a device up/down in the following patch.
      
      Once that threaded mode is enabled and the kthread is
      started, napi_schedule() will wake-up such thread instead
      of scheduling the softirq.
      
      The threaded poll loop behaves quite likely the net_rx_action,
      but it does not have to manipulate local irqs and uses
      an explicit scheduling point based on netdev_budget.
      Co-developed-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      Co-developed-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: default avatarHannes Frederic Sowa <hannes@stressinduktion.org>
      Co-developed-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      29863d41
    • Felix Fietkau's avatar
      net: extract napi poll functionality to __napi_poll() · 898f8015
      Felix Fietkau authored
      This commit introduces a new function __napi_poll() which does the main
      logic of the existing napi_poll() function, and will be called by other
      functions in later commits.
      This idea and implementation is done by Felix Fietkau <nbd@nbd.name> and
      is proposed as part of the patch to move napi work to work_queue
      context.
      This commit by itself is a code restructure.
      Signed-off-by: default avatarFelix Fietkau <nbd@nbd.name>
      Signed-off-by: default avatarWei Wang <weiwan@google.com>
      Reviewed-by: default avatarAlexander Duyck <alexanderduyck@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      898f8015
    • Rafał Miłecki's avatar
      net: broadcom: bcm4908enet: add BCM4908 controller driver · 4feffead
      Rafał Miłecki authored
      BCM4908 SoCs family uses Ethernel controller that includes UniMAC but
      uses different DMA engine (than other controllers) and requires
      different programming.
      Signed-off-by: default avatarRafał Miłecki <rafal@milecki.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4feffead
    • Rafał Miłecki's avatar
      dt-bindings: net: document BCM4908 Ethernet controller · 387d1c18
      Rafał Miłecki authored
      BCM4908 is a family of SoCs with integrated Ethernet controller.
      Signed-off-by: default avatarRafał Miłecki <rafal@milecki.pl>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      387d1c18
    • David S. Miller's avatar
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next · fc1a8db3
      David S. Miller authored
      Steffen Klassert says:
      
      ====================
      pull request (net-next): ipsec-next 2021-02-09
      
      1) Support TSO on xfrm interfaces.
         From Eyal Birger.
      
      2) Variable calculation simplifications in esp4/esp6.
         From Jiapeng Chong / Jiapeng Zhong.
      
      3) Fix a return code in xfrm_do_migrate.
         From Zheng Yongjun.
      
      Please pull or let me know if there are problems.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fc1a8db3
    • Jay Vosburgh's avatar
      Documentation: networking: ip-sysctl: Document src_valid_mark sysctl · 8cf5d8cc
      Jay Vosburgh authored
      Provide documentation for src_valid_mark sysctl, which was added
      in commit 28f6aeea ("net: restore ip source validation").
      Signed-off-by: default avatarJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8cf5d8cc
    • Michael Walle's avatar
      net: phy: broadcom: remove BCM5482 1000Base-BX support · 1e2e61af
      Michael Walle authored
      It is nowhere used in the kernel. It also seems to be lacking the
      proper fiber advertise flags. Remove it.
      Signed-off-by: default avatarMichael Walle <michael@walle.cc>
      Acked-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1e2e61af
    • Michael Walle's avatar
      net: phy: drop explicit genphy_read_status() op · f15008fb
      Michael Walle authored
      genphy_read_status() is already the default for the .read_status() op.
      Drop the unnecessary references.
      Signed-off-by: default avatarMichael Walle <michael@walle.cc>
      Reviewed-by: default avatarAndrew Lunn <andrew@lunn.ch>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f15008fb
    • David S. Miller's avatar
      Merge branch 'route-offload-failure' · 5ea3c72c
      David S. Miller authored
      net: Add support for route offload failure notifications
      
      Ido Schimmel  says:
      
      ====================
      This is a complementary series to the one merged in commit 389cb1ec
      ("Merge branch 'add-notifications-when-route-hardware-flags-change'").
      
      The previous series added RTM_NEWROUTE notifications to user space
      whenever a route was successfully installed in hardware or when its
      state in hardware changed. This allows routing daemons to delay
      advertisement of routes until they are installed in hardware.
      
      However, if route installation failed, a routing daemon will wait
      indefinitely for a notification that will never come. The aim of this
      series is to provide a failure notification via a new flag
      (RTM_F_OFFLOAD_FAILED) in the RTM_NEWROUTE message. Upon such a
      notification a routing daemon may decide to withdraw the route from the
      FIB.
      
      Series overview:
      
      Patch #1 adds the new RTM_F_OFFLOAD_FAILED flag
      
      Patches #2-#3 and #4-#5 add failure notifications to IPv4 and IPv6,
      respectively
      
      Patches #6-#8 teach netdevsim to fail route installation via a new knob
      in debugfs
      
      Patch #9 extends mlxsw to mark routes with the new flag
      
      Patch #10 adds test cases for the new notification over netdevsim
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5ea3c72c
    • Amit Cohen's avatar
      selftests: netdevsim: Test route offload failure notifications · 9ee53e37
      Amit Cohen authored
      Add cases to verify that when debugfs variable "fail_route_offload" is
      set, notification with "rt_offload_failed" flag is received.
      
      Extend the existing cases to verify that when sysctl
      "fib_notify_on_flag_change" is set to 2, the kernel emits notifications
      only for failed route installation.
      
      $ ./fib_notifications.sh
      TEST: IPv4 route addition				[ OK ]
      TEST: IPv4 route deletion				[ OK ]
      TEST: IPv4 route replacement				[ OK ]
      TEST: IPv4 route offload failed				[ OK ]
      TEST: IPv6 route addition				[ OK ]
      TEST: IPv6 route deletion				[ OK ]
      TEST: IPv6 route replacement				[ OK ]
      TEST: IPv6 route offload failed				[ OK ]
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ee53e37
    • Amit Cohen's avatar
      mlxsw: spectrum_router: Set offload_failed flag · a4cb1c02
      Amit Cohen authored
      When FIB_EVENT_ENTRY_{REPLACE, APPEND} are triggered and route insertion
      fails, FIB abort is triggered.
      
      After aborting, set the appropriate hardware flag to make the kernel emit
      RTM_NEWROUTE notification with RTM_F_OFFLOAD_FAILED flag.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a4cb1c02
    • Amit Cohen's avatar
      netdevsim: fib: Add debugfs to debug route offload failure · 134c7532
      Amit Cohen authored
      Add "fail_route_offload" flag to disallow offloading routes.
      It is needed to test "offload failed" notifications.
      
      Create the flag as part of nsim_fib_create() under fib directory and set
      it to false by default.
      
      When FIB_EVENT_ENTRY_{REPLACE, APPEND} are triggered and
      "fail_route_offload" value is true, set the appropriate hardware flag to
      make the kernel emit RTM_NEWROUTE notification with RTM_F_OFFLOAD_FAILED
      flag.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      134c7532
    • Ido Schimmel's avatar
      netdevsim: dev: Initialize FIB module after debugfs · f57ab5b7
      Ido Schimmel authored
      Initialize the dummy FIB offload module after debugfs, so that the FIB
      module could create its own directory there.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f57ab5b7
    • Amit Cohen's avatar
      netdevsim: fib: Do not warn if route was not found for several events · 484a4dfb
      Amit Cohen authored
      The next patch will add the ability to fail route offload controlled by
      debugfs variable called "fail_route_offload".
      
      If we vetoed the addition, we might get a delete or append notification
      for a route we do not have. Therefore, do not warn if route was not found.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      484a4dfb
    • Amit Cohen's avatar
      IPv6: Extend 'fib_notify_on_flag_change' sysctl · 6fad361a
      Amit Cohen authored
      Add the value '2' to 'fib_notify_on_flag_change' to allow sending
      notifications only for failed route installation.
      
      Separate value is added for such notifications because there are less of
      them, so they do not impact performance and some users will find them more
      important.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fad361a
    • Amit Cohen's avatar
      IPv6: Add "offload failed" indication to routes · 0c5fcf9e
      Amit Cohen authored
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel, but not
      necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead to a
      routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      To avoid such cases, previous patch set added the ability to emit
      RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed, this behavior is controlled by sysctl.
      
      With the above mentioned behavior, it is possible to know from user-space
      if the route was offloaded, but if the offload fails there is no indication
      to user-space. Following a failure, a routing daemon will wait indefinitely
      for a notification that will never come.
      
      This patch adds an "offload_failed" indication to IPv6 routes, so that
      users will have better visibility into the offload process.
      
      'struct fib6_info' is extended with new field that indicates if route
      offload failed. Note that the new field is added using unused bit and
      therefore there is no need to increase struct size.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0c5fcf9e
    • Amit Cohen's avatar
      IPv4: Extend 'fib_notify_on_flag_change' sysctl · 648106c3
      Amit Cohen authored
      Add the value '2' to 'fib_notify_on_flag_change' to allow sending
      notifications only for failed route installation.
      
      Separate value is added for such notifications because there are less of
      them, so they do not impact performance and some users will find them more
      important.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      648106c3
    • Amit Cohen's avatar
      IPv4: Add "offload failed" indication to routes · 36c5100e
      Amit Cohen authored
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel, but not
      necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead to a
      routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      To avoid such cases, previous patch set added the ability to emit
      RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed, this behavior is controlled by sysctl.
      
      With the above mentioned behavior, it is possible to know from user-space
      if the route was offloaded, but if the offload fails there is no indication
      to user-space. Following a failure, a routing daemon will wait indefinitely
      for a notification that will never come.
      
      This patch adds an "offload_failed" indication to IPv4 routes, so that
      users will have better visibility into the offload process.
      
      'struct fib_alias', and 'struct fib_rt_info' are extended with new field
      that indicates if route offload failed. Note that the new field is added
      using unused bit and therefore there is no need to increase structs size.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      36c5100e
    • Amit Cohen's avatar
      rtnetlink: Add RTM_F_OFFLOAD_FAILED flag · 49fc2513
      Amit Cohen authored
      The flag indicates to user space that route offload failed.
      
      Previous patch set added the ability to emit RTM_NEWROUTE notifications
      whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed, but if the offload
      fails there is no indication to user-space.
      
      The flag will be used in subsequent patches by netdevsim and mlxsw to
      indicate to user space that route offload failed, so that users will
      have better visibility into the offload process.
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      49fc2513
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2021-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 08cbabb7
      David S. Miller authored
      mlx5-updates-2021-02-04
      
      Vlad Buslov says:
      =================
      
      Implement support for VF tunneling
      
      Abstract
      
      Currently, mlx5 only supports configuration with tunnel endpoint IP address on
      uplink representor. Remove implicit and explicit assumptions of tunnel always
      being terminated on uplink and implement necessary infrastructure for
      configuring tunnels on VF representors and updating rules on such tunnels
      according to routing changes.
      
      SW TC model
      
      From TC perspective VF tunnel configuration requires two rules in both
      directions:
      
      TX rules
      
      1. Rule that redirects packets from UL to VF rep that has the tunnel
      endpoint IP address:
      
      $ tc -s filter show dev enp8s0f0 ingress
      filter protocol ip pref 4 flower chain 0
      filter protocol ip pref 4 flower chain 0 handle 0x1
        dst_mac 16:c9:a0:2d:69:2c
        src_mac 0c:42:a1:58:ab:e4
        eth_type ipv4
        ip_flags nofrag
        in_hw in_hw_count 1
              action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen
              index 3 ref 1 bind 1 installed 377 sec used 0 sec
              Action statistics:
              Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0)
              Sent software 0 bytes 0 pkt
              Sent hardware 114096 bytes 952 pkt
              backlog 0b 0p requeues 0
              cookie 878fa48d8c423fc08c3b6ca599b50a97
              no_percpu
              used_hw_stats delayed
      
      2. Rule that decapsulates the tunneled flow and redirects to destination VF
      representor:
      
      $ tc -s filter show dev vxlan_sys_4789 ingress
      filter protocol ip pref 4 flower chain 0
      filter protocol ip pref 4 flower chain 0 handle 0x1
        dst_mac ca:2e:a7:3f:f5:0f
        src_mac 0a:40:bd:30:89:99
        eth_type ipv4
        enc_dst_ip 7.7.7.5
        enc_src_ip 7.7.7.1
        enc_key_id 98
        enc_dst_port 4789
        enc_tos 0
        ip_flags nofrag
        in_hw in_hw_count 1
              action order 1: tunnel_key  unset pipe
               index 2 ref 1 bind 1 installed 434 sec used 434 sec
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              used_hw_stats delayed
      
              action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
              index 4 ref 1 bind 1 installed 434 sec used 0 sec
              Action statistics:
              Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
              Sent software 0 bytes 0 pkt
              Sent hardware 129936 bytes 1082 pkt
              backlog 0b 0p requeues 0
              cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
              no_percpu
              used_hw_stats delayed
      
      RX rules
      
      1. Rule that encapsulates the tunneled flow and redirects packets from
      source VF rep to tunnel device:
      
      $ tc -s filter show dev enp8s0f0_1 ingress
      filter protocol ip pref 4 flower chain 0
      filter protocol ip pref 4 flower chain 0 handle 0x1
        dst_mac 0a:40:bd:30:89:99
        src_mac ca:2e:a7:3f:f5:0f
        eth_type ipv4
        ip_tos 0/0x3
        ip_flags nofrag
        in_hw in_hw_count 1
              action order 1: tunnel_key  set
              src_ip 7.7.7.5
              dst_ip 7.7.7.1
              key_id 98
              dst_port 4789
              nocsum
              ttl 64 pipe
               index 1 ref 1 bind 1 installed 411 sec used 411 sec
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              no_percpu
              used_hw_stats delayed
      
              action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen
              index 1 ref 1 bind 1 installed 411 sec used 0 sec
              Action statistics:
              Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0)
              Sent software 0 bytes 0 pkt
              Sent hardware 5615833 bytes 4028 pkt
              backlog 0b 0p requeues 0
              cookie bb406d45d343bf7ade9690ae80c7cba4
              no_percpu
              used_hw_stats delayed
      
      2. Rule that redirects from tunnel device to UL rep:
      
      $ tc -s filter show dev vxlan_sys_4789 ingress
      filter protocol ip pref 4 flower chain 0
      filter protocol ip pref 4 flower chain 0 handle 0x1
        dst_mac ca:2e:a7:3f:f5:0f
        src_mac 0a:40:bd:30:89:99
        eth_type ipv4
        enc_dst_ip 7.7.7.5
        enc_src_ip 7.7.7.1
        enc_key_id 98
        enc_dst_port 4789
        enc_tos 0
        ip_flags nofrag
        in_hw in_hw_count 1
              action order 1: tunnel_key  unset pipe
               index 2 ref 1 bind 1 installed 434 sec used 434 sec
              Action statistics:
              Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
              backlog 0b 0p requeues 0
              used_hw_stats delayed
      
              action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen
              index 4 ref 1 bind 1 installed 434 sec used 0 sec
              Action statistics:
              Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0)
              Sent software 0 bytes 0 pkt
              Sent hardware 129936 bytes 1082 pkt
              backlog 0b 0p requeues 0
              cookie ac17cf398c4c69e4a5b2f7aabd1b88ff
              no_percpu
              used_hw_stats delayed
      
      HW offloads model
      
      For hardware offload the goal is to mach packet on both rules without exposing
      it to software on tunnel endpoint VF. In order to achieve this for tx, TC
      implementation marks encap rules with tunnel endpoint on mlx5 VF of same eswitch
      with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE flag and adds header modification
      rule to overwrite packet source port to the value of tunnel VF. Eswitch code is
      modified to recirculate such packets after source port value is changed, which
      allows second tx rules to match.
      
      For rx path indirect table infrastructure is used to allow fully processing VF
      tunnel traffic in hardware. To implement such pipeline driver needs to program
      the hardware after matching on UL rule to overwrite source vport from UL to
      tunnel VF and recirculate the packet to the root table to allow matching on the
      rule installed on tunnel VF. For this, indirect table matches all encapsulated
      traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by
      the miss rule. Such configuration will cause packet to appear on VF representor
      instead of VF itself if packet has been matches by indirect table rule based on
      tunnel parameters but missed on second rule (after recirculation). Handle such
      case by marking packets processed by indirect table with special 0xFFF value in
      reg_c1 and extending slow table with additional flow group that matches on
      reg_c0 (source port value set by indirect tables) and reg_c1 (special 0xFFF
      mark). When creating offloads fdb tables, install one rule per VF vport to match
      on recirculated miss packets and redirect them to appropriate VF vport.
      
      Routing events
      
      In order to support routing changes and migration of tunnel device between
      different endpoint VFs, implement routing infrastructure and update it with FIB
      events. Routing entry table is introduced to mlx5 TC. Every rx and tx VF tunnel
      rule is attached to a routing entry, which is shared for rules of same tunnel.
      On FIB event the work is scheduled to delete/recreate all rules of affected
      tunnel.
      
      Note: only vxlan tunnel type is supported by this series.
      
      =================
      08cbabb7
  2. 08 Feb, 2021 9 commits
  3. 07 Feb, 2021 1 commit
    • Jakub Kicinski's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · badc6ac3
      Jakub Kicinski authored
      Tony Nguyen says:
      
      ====================
      100GbE Intel Wired LAN Driver Updates 2021-02-05
      
      This series contains updates to ice driver only.
      
      Jake adds adds reporting of timeout length during devlink flash and
      implements support to report devlink info regarding the version of
      firmware that is stored (downloaded) to the device, but is not yet active.
      ice_devlink_info_get will report "stored" versions when there is no
      pending flash update. Version info includes the UNDI Option ROM, the
      Netlist module, and the fw.bundle_id.
      
      Gustavo A. R. Silva replaces a one-element array to flexible-array
      member.
      
      Bruce utilizes flex_array_size() helper and removes dead code on a check
      for a condition that can't occur.
      
      v2:
      * removed security revision implementation, and re-ordered patches to
      account for this removal
      * squashed patches implementing ice_read_flash_module to avoid patches
      refactoring the implementation of a previous patch in the series
      * modify ice_devlink_info_get to always report "stored" versions instead
      of only reporting them when a pending flash update is ready.
      
      * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
        ice: remove dead code
        ice: use flex_array_size where possible
        ice: Replace one-element array with flexible-array member
        ice: display stored UNDI firmware version via devlink info
        ice: display stored netlist versions via devlink info
        ice: display some stored NVM versions via devlink info
        ice: introduce function for reading from flash modules
        ice: cache NVM module bank information
        ice: introduce context struct for info report
        ice: create flash_info structure and separate NVM version
        ice: report timeout length for erasing during devlink flash
      ====================
      
      Link: https://lore.kernel.org/r/20210206044101.636242-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      badc6ac3
  4. 06 Feb, 2021 6 commits
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next · c273a20c
      Jakub Kicinski authored
      Pablo Neira Ayuso says:
      
      ====================
      Netfilter/IPVS updates for net-next
      
      1) Remove indirection and use nf_ct_get() instead from nfnetlink_log
         and nfnetlink_queue, from Florian Westphal.
      
      2) Add weighted random twos choice least-connection scheduling for IPVS,
         from Darby Payne.
      
      3) Add a __hash placeholder in the flow tuple structure to identify
         the field to be included in the rhashtable key hash calculation.
      
      4) Add a new nft_parse_register_load() and nft_parse_register_store()
         to consolidate register load and store in the core.
      
      5) Statify nft_parse_register() since it has no more module clients.
      
      6) Remove redundant assignment in nft_cmp, from Colin Ian King.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next:
        netfilter: nftables: remove redundant assignment of variable err
        netfilter: nftables: statify nft_parse_register()
        netfilter: nftables: add nft_parse_register_store() and use it
        netfilter: nftables: add nft_parse_register_load() and use it
        netfilter: flowtable: add hash offset field to tuple
        ipvs: add weighted random twos choice algorithm
        netfilter: ctnetlink: remove get_ct indirection
      ====================
      
      Link: https://lore.kernel.org/r/20210206015005.23037-1-pablo@netfilter.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c273a20c
    • Heiner Kallweit's avatar
      r8169: don't try to disable interrupts if NAPI is scheduled already · 7274c414
      Heiner Kallweit authored
      There's no benefit in trying to disable interrupts if NAPI is
      scheduled already. This allows us to save a PCI write in this case.
      Signed-off-by: default avatarHeiner Kallweit <hkallweit1@gmail.com>
      Link: https://lore.kernel.org/r/78c7f2fb-9772-1015-8c1d-632cbdff253f@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7274c414
    • Xie He's avatar
      net/packet: Improve the comment about LL header visibility criteria · 21c85974
      Xie He authored
      The "dev_has_header" function, recently added in
      commit d5496990 ("net/packet: fix packet receive on L3 devices
      without visible hard header"),
      is more accurate as criteria for determining whether a device exposes
      the LL header to upper layers, because in addition to dev->header_ops,
      it also checks for dev->header_ops->create.
      
      When transmitting an skb on a device, dev_hard_header can be called to
      generate an LL header. dev_hard_header will only generate a header if
      dev->header_ops->create is present.
      Signed-off-by: default avatarXie He <xie.he.0141@gmail.com>
      Acked-by: default avatarWillem de Bruijn <willemb@google.com>
      Link: https://lore.kernel.org/r/20210205224124.21345-1-xie.he.0141@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      21c85974
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-a-mix-of-small-improvements' · 163a1802
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: a mix of small improvements
      
      Version 2 of this series restructures a couple of the changed
      functions (in patches 1 and 2) to avoid blocks of indented code
      by returning early when possible, as suggested by Jakub.  The
      description of the first patch was changed as a result, to better
      reflect what the updated patch does.  It also fixes one spot I
      identified when updating the code, where gsi_channel_stop() was
      doing the wrong thing on error.
      
      The original description for this series is below.
      
      This series contains a sort of unrelated set of code cleanups.
      
      The first two are things I wanted to do in a series that updated
      some NAPI code recently.  I didn't want to change things in a way
      that affected existing testing so I set these aside for later
      (i.e., now).
      
      The third makes a change to event ring handling that's similar to
      what was done a while back for channels.  There's little benefit to
      cacheing the current state of an event ring, so with this we'll just
      fetch the state from hardware whenever we need it.
      
      The fourth patch removes the definitions of two unused symbols.
      
      The fifth replaces a count that is always 0 or 1 with a Boolean.
      
      The sixth removes a build-time validation check that doesn't really
      provide benefit.
      
      And the last one fixes a problem (in two spots) that could cause a
      build-time check to fail "bogusly".
      ====================
      
      Link: https://lore.kernel.org/r/20210205221100.1738-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      163a1802
    • Alex Elder's avatar
      net: ipa: avoid field overflow · cd115009
      Alex Elder authored
      It's possible that the length passed to ipa_header_size_encoded()
      is larger than what can be represented by the HDR_LEN field alone
      (starting with IPA v4.5).  If we attempted that, u32_encode_bits()
      would trigger a build-time error.
      
      Avoid this problem by masking off high-order bits of the value
      encoded as the lower portion of the header length.
      
      The same sort of problem exists in ipa_metadata_offset_encoded(),
      so implement the same fix there.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      cd115009
    • Alex Elder's avatar
      net: ipa: get rid of status size constraint · 48735374
      Alex Elder authored
      There is a build-time check that the packet status structure is a
      multiple of 4 bytes in size.  It's not clear where that constraint
      comes from, but the structure defines what hardware provides so its
      definition won't change.  Get rid of the check; it adds no value.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      48735374