1. 26 Jan, 2021 14 commits
  2. 25 Jan, 2021 1 commit
  3. 24 Jan, 2021 6 commits
  4. 23 Jan, 2021 19 commits
    • Jakub Kicinski's avatar
      Merge branch 'remove-unneeded-phy-time-stamping-option' · 692347a9
      Jakub Kicinski authored
      Richard Cochran says:
      
      ====================
      Remove unneeded PHY time stamping option.
      
      The NETWORK_PHY_TIMESTAMPING configuration option adds additional
      checks into the networking hot path, and it is only needed by two
      rather esoteric devices, namely the TI DP83640 PHYTER and the ZHAW
      InES 1588 IP core.  Very few end users have these devices, and those
      that do have them are building specialized embedded systems.
      
      Unfortunately two unrelated drivers depend on this option, and two
      defconfigs enable it.  It is probably my fault for not paying enough
      attention in reviews.
      
      This series corrects the gratuitous use of NETWORK_PHY_TIMESTAMPING.
      ====================
      
      Link: https://lore.kernel.org/r/cover.1611198584.git.richardcochran@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      692347a9
    • Richard Cochran's avatar
      net: mvpp2: Remove unneeded Kconfig dependency. · 04cbb740
      Richard Cochran authored
      The mvpp2 is an Ethernet driver, and it implements MAC style time
      stamping of PTP frames.  It has no need of the expensive option to
      enable PHY time stamping.  Remove the incorrect dependency.
      Signed-off-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      04cbb740
    • Richard Cochran's avatar
      net: dsa: mv88e6xxx: Remove bogus Kconfig dependency. · 57ba0077
      Richard Cochran authored
      The mv88e6xxx is a DSA driver, and it implements DSA style time
      stamping of PTP frames.  It has no need of the expensive option to
      enable PHY time stamping.  Remove the bogus dependency.
      Signed-off-by: default avatarRichard Cochran <richardcochran@gmail.com>
      Acked-by: default avatarBrandon Streiff <brandon.streiff@ni.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      57ba0077
    • Jakub Kicinski's avatar
      Merge branch 'net-ipa-napi-poll-updates' · e7b76db3
      Jakub Kicinski authored
      Alex Elder says:
      
      ====================
      net: ipa: NAPI poll updates
      
      While reviewing the IPA NAPI polling code in detail I found two
      problems.  This series fixes those, and implements a few other
      improvements to this part of the code.
      
      The first two patches are minor bug fixes that avoid extra passes
      through the poll function.  The third simplifies code inside the
      polling loop a bit.
      
      The last two update how interrupts are disabled; previously it was
      possible for another I/O completion condition to be recorded before
      NAPI got scheduled.
      ====================
      
      Link: https://lore.kernel.org/r/20210121114821.26495-1-elder@linaro.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e7b76db3
    • Alex Elder's avatar
      net: ipa: disable IEOB interrupts before clearing · 7bd9785f
      Alex Elder authored
      Currently in gsi_isr_ieob(), event ring IEOB interrupts are disabled
      one at a time.  The loop disables the IEOB interrupt for all event
      rings represented in the event mask.  Instead, just disable them all
      at once.
      
      Disable them all *before* clearing the interrupt condition.  This
      guarantees we'll schedule NAPI for each event once, before another
      IEOB interrupt could be signaled.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7bd9785f
    • Alex Elder's avatar
      net: ipa: repurpose gsi_irq_ieob_disable() · 5725593e
      Alex Elder authored
      Rename gsi_irq_ieob_disable() to be gsi_irq_ieob_disable_one().
      
      Introduce a new function gsi_irq_ieob_disable() that takes a mask of
      events to disable rather than a single event id.  This will be used
      in the next patch.
      
      Rename gsi_irq_ieob_enable() to be gsi_irq_ieob_enable_one() to be
      consistent.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5725593e
    • Alex Elder's avatar
      net: ipa: have gsi_channel_update() return a value · 223f5b34
      Alex Elder authored
      Have gsi_channel_update() return the first transaction in the
      updated completed transaction list, or NULL if no new transactions
      have been added.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      223f5b34
    • Alex Elder's avatar
      net: ipa: heed napi_complete() return value · 148604e7
      Alex Elder authored
      Pay attention to the return value of napi_complete(), completing
      polling only if it returns true.
      
      Just use napi rather than &channel->napi as the argument passed to
      napi_complete().
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      148604e7
    • Alex Elder's avatar
      net: ipa: count actual work done in gsi_channel_poll() · c80c4a1e
      Alex Elder authored
      There is an off-by-one problem in gsi_channel_poll().  The count of
      transactions completed is incremented each time through the loop
      *before* determining whether there is any more work to do.  As a
      result, if we exit the loop early the counter its value is one more
      than the number of transactions actually processed.
      
      Instead, increment the count after processing, to ensure it reflects
      the number of processed transactions.  The result is more naturally
      described as a for loop rather than a while loop, so change that.
      Signed-off-by: default avatarAlex Elder <elder@linaro.org>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c80c4a1e
    • Jakub Kicinski's avatar
      Merge branch 'mlxsw-expose-number-of-physical-ports' · 59a49d96
      Jakub Kicinski authored
      Ido Schimmel says:
      
      ====================
      mlxsw: Expose number of physical ports
      
      The switch ASIC has a limited capacity of physical ports that it can
      support. While each system is brought up with a different number of
      ports, this number can be increased via splitting up to the ASIC's
      limit.
      
      Expose physical ports as a devlink resource so that user space will have
      visibility into the maximum number of ports that can be supported and
      the current occupancy. With this resource it is possible, for example,
      to write generic (i.e., not platform dependent) tests for port
      splitting.
      
      Patch #1 adds the new resource and patch #2 adds a selftest.
      
      v2:
      * Add the physical ports resource as a generic devlink resource so that
        it could be re-used by other device drivers
      ====================
      
      Link: https://lore.kernel.org/r/20210121131024.2656154-1-idosch@idosch.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      59a49d96
    • Danielle Ratson's avatar
      selftests: mlxsw: Add a scale test for physical ports · 5154b1b8
      Danielle Ratson authored
      Query the maximum number of supported physical ports using devlink-resource
      and test that this number can be reached by splitting each of the
      splittable ports to its width. Test that an error is returned in case
      the maximum number is exceeded.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5154b1b8
    • Danielle Ratson's avatar
      mlxsw: Register physical ports as a devlink resource · 321f7ab0
      Danielle Ratson authored
      The switch ASIC has a limited capacity of physical ('flavour physical'
      in devlink terminology) ports that it can support. While each system is
      brought up with a different number of ports, this number can be
      increased via splitting up to the ASIC's limit.
      
      Expose physical ports as a devlink resource so that user space will have
      visibility to the maximum number of ports that can be supported and the
      current occupancy.
      
      In addition, add a "Generic Resources" section in devlink-resource
      documentation so the different drivers will be aligned by the same resource
      name when exposing to user space.
      Signed-off-by: default avatarDanielle Ratson <danieller@nvidia.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      321f7ab0
    • Jakub Kicinski's avatar
      Merge branch 'htb-offload' · 35187642
      Jakub Kicinski authored
      Maxim Mikityanskiy says:
      
      ====================
      HTB offload
      
      This series adds support for HTB offload to the HTB qdisc, and adds
      usage to mlx5 driver.
      
      The previous RFCs are available at [1], [2].
      
      The feature is intended to solve the performance bottleneck caused by
      the single lock of the HTB qdisc, which prevents it from scaling well.
      The HTB algorithm itself is offloaded to the device, eliminating the
      need to take the root lock of HTB on every packet. Classification part
      is done in clsact (still in software) to avoid acquiring the lock, which
      imposes a limitation that filters can target only leaf classes.
      
      The speedup on Mellanox ConnectX-6 Dx was 14.2 times in the UDP
      multi-stream test, compared to software HTB implementation (more details
      in the mlx5 patch).
      
      [1]: https://www.spinics.net/lists/netdev/msg628422.html
      [2]: https://www.spinics.net/lists/netdev/msg663548.html
      
      v2 changes:
      
      Fixed sparse and smatch warnings. Formatted HTB patches to 80 chars per
      line.
      
      v3 changes:
      
      Fixed the CI failure on parisc with 16-bit xchg by replacing it with
      WRITE_ONCE. Fixed the capability bits in mlx5_ifc.h and the value of
      MLX5E_QOS_MAX_LEAF_NODES.
      
      v4 changes:
      
      Check if HTB is root when offloading. Add extack for hardware errors.
      Rephrase explanations of how it works in the commit message. Remove %hu
      from format strings. Add resiliency when leaf_del_last fails to create a
      new leaf node.
      ====================
      
      Link: https://lore.kernel.org/r/20210119120815.463334-1-maximmi@mellanox.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      35187642
    • Maxim Mikityanskiy's avatar
      net/mlx5e: Support HTB offload · 214baf22
      Maxim Mikityanskiy authored
      This commit adds support for HTB offload in the mlx5e driver.
      
      Performance:
      
        NIC: Mellanox ConnectX-6 Dx
        CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24 cores with HT)
      
        100 Gbit/s line rate, 500 UDP streams @ ~200 Mbit/s each
        48 traffic classes, flower used for steering
        No shaping (rate limits set to 4 Gbit/s per TC) - checking for max
        throughput.
      
        Baseline: 98.7 Gbps, 8.25 Mpps
        HTB: 6.7 Gbps, 0.56 Mpps
        HTB offload: 95.6 Gbps, 8.00 Mpps
      
      Limitations:
      
      1. 256 leaf nodes, 3 levels of depth.
      
      2. Granularity for ceil is 1 Mbit/s. Rates are converted to weights, and
      the bandwidth is split among the siblings according to these weights.
      Other parameters for classes are not supported.
      
      Ethtool statistics support for QoS SQs are also added. The counters are
      called qos_txN_*, where N is the QoS queue number (starting from 0, the
      numeration is separate from the normal SQs), and * is the counter name
      (the counters are the same as for the normal SQs).
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      214baf22
    • Maxim Mikityanskiy's avatar
      sch_htb: Stats for offloaded HTB · 83271586
      Maxim Mikityanskiy authored
      This commit adds support for statistics of offloaded HTB. Bytes and
      packets counters for leaf and inner nodes are supported, the values are
      taken from per-queue qdiscs, and the numbers that the user sees should
      have the same behavior as the software (non-offloaded) HTB.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      83271586
    • Maxim Mikityanskiy's avatar
      sch_htb: Hierarchical QoS hardware offload · d03b195b
      Maxim Mikityanskiy authored
      HTB doesn't scale well because of contention on a single lock, and it
      also consumes CPU. This patch adds support for offloading HTB to
      hardware that supports hierarchical rate limiting.
      
      In the offload mode, HTB passes control commands to the driver using
      ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
      and their settings (rate, ceil) in the NIC. Every modification of the
      HTB tree caused by the admin results in ndo_setup_tc being called.
      
      After this setup, the HTB algorithm is done completely in the NIC. An SQ
      (send queue) is created for every leaf class and attached to the
      hierarchy, so that the NIC can calculate and obey aggregated rate
      limits, too. In the future, it can be changed, so that multiple SQs will
      back a single leaf class.
      
      ndo_select_queue is responsible for selecting the right queue that
      serves the traffic class of each packet.
      
      The data path works as follows: a packet is classified by clsact, the
      driver selects a hardware queue according to its class, and the packet
      is enqueued into this queue's qdisc.
      
      This solution addresses two main problems of scaling HTB:
      
      1. Contention by flow classification. Currently the filters are attached
      to the HTB instance as follows:
      
          # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
          classid 1:10
      
      It's possible to move classification to clsact egress hook, which is
      thread-safe and lock-free:
      
          # tc filter add dev eth0 egress protocol ip flower dst_port 80
          action skbedit priority 1:10
      
      This way classification still happens in software, but the lock
      contention is eliminated, and it happens before selecting the TX queue,
      allowing the driver to translate the class to the corresponding hardware
      queue in ndo_select_queue.
      
      Note that this is already compatible with non-offloaded HTB and doesn't
      require changes to the kernel nor iproute2.
      
      2. Contention by handling packets. HTB is not multi-queue, it attaches
      to a whole net device, and handling of all packets takes the same lock.
      When HTB is offloaded, it registers itself as a multi-queue qdisc,
      similarly to mq: HTB is attached to the netdev, and each queue has its
      own qdisc.
      
      Some features of HTB may be not supported by some particular hardware,
      for example, the maximum number of classes may be limited, the
      granularity of rate and ceil parameters may be different, etc. - so, the
      offload is not enabled by default, a new parameter is used to enable it:
      
          # tc qdisc replace dev eth0 root handle 1: htb offload
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d03b195b
    • Maxim Mikityanskiy's avatar
      net: sched: Add extack to Qdisc_class_ops.delete · 4dd78a73
      Maxim Mikityanskiy authored
      In a following commit, sch_htb will start using extack in the delete
      class operation to pass hardware errors in offload mode. This commit
      prepares for that by adding the extack parameter to this callback and
      converting usage of the existing qdiscs.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      4dd78a73
    • Maxim Mikityanskiy's avatar
      net: sched: Add multi-queue support to sch_tree_lock · ca1e4ab1
      Maxim Mikityanskiy authored
      The existing qdiscs that set TCQ_F_MQROOT don't use sch_tree_lock.
      However, hardware-offloaded HTB will start setting this flag while also
      using sch_tree_lock.
      
      The current implementation of sch_tree_lock basically locks on
      qdisc->dev_queue->qdisc, and it works fine when the tree is attached to
      some queue. However, it's not the case for MQROOT qdiscs: such a qdisc
      is the root itself, and its dev_queue just points to queue 0, while not
      actually being used, because there are real per-queue qdiscs.
      
      This patch changes the logic of sch_tree_lock and sch_tree_unlock to
      lock the qdisc itself if it's the MQROOT.
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@mellanox.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      ca1e4ab1
    • Jakub Kicinski's avatar
      Merge branch 'tcp-add-cmsg-rx-timestamps-to-rx-zerocopy' · 04a88637
      Jakub Kicinski authored
      Arjun Roy says:
      
      ====================
      tcp: add CMSG+rx timestamps to rx. zerocopy
      
      Provide CMSG and receive timestamp support to TCP
      receive zerocopy. Patch 1 refactors CMSG pending state for
      tcp_recvmsg() to avoid the use of magic numbers; patch 2 implements
      receive timestamp via CMSG support for receive zerocopy, and uses the
      constants added in patch 1.
      ====================
      
      Link: https://lore.kernel.org/r/20210121004148.2340206-1-arjunroy.kdev@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      04a88637