1. 06 Feb, 2023 40 commits
    • Vladimir Oltean's avatar
      net/sched: taprio: pass mqprio queue configuration to ndo_setup_tc() · 09c794c0
      Vladimir Oltean authored
      The taprio qdisc does not currently pass the mqprio queue configuration
      down to the offloading device driver. So the driver cannot act upon the
      TXQ counts/offsets per TC, or upon the prio->tc map. It was probably
      assumed that the driver only wants to offload num_tc (see
      TC_MQPRIO_HW_OFFLOAD_TCS), which it can get from netdev_get_num_tc(),
      but there's clearly more to the mqprio configuration than that.
      
      I've considered 2 mechanisms to remedy that. First is to pass a struct
      tc_mqprio_qopt_offload as part of the tc_taprio_qopt_offload. The second
      is to make taprio actually call TC_SETUP_QDISC_MQPRIO, *in addition to*
      TC_SETUP_QDISC_TAPRIO.
      
      The difference is that in the first case, existing drivers (offloading
      or not) all ignore taprio's mqprio portion currently, whereas in the
      second case, we could control whether to call TC_SETUP_QDISC_MQPRIO,
      based on a new capability. The question is which approach would be
      better.
      
      I'm afraid that calling TC_SETUP_QDISC_MQPRIO unconditionally (not based
      on a taprio capability bit) would risk introducing regressions. For
      example, taprio doesn't populate (or validate) qopt->hw, as well as
      mqprio.flags, mqprio.shaper, mqprio.min_rate, mqprio.max_rate.
      
      In comparison, adding a capability is functionally equivalent to just
      passing the mqprio in a way that drivers can ignore it, except it's
      slightly more complicated to use it (need to set the capability).
      
      Ultimately, what made me go for the "mqprio in taprio" variant was that
      it's easier for offloading drivers to interpret the mqprio qopt slightly
      differently when it comes from taprio vs when it comes from mqprio,
      should that ever become necessary.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09c794c0
    • Vladimir Oltean's avatar
      net/sched: refactor mqprio qopt reconstruction to a library function · 9dd6ad67
      Vladimir Oltean authored
      The taprio qdisc will need to reconstruct a struct tc_mqprio_qopt from
      netdev settings once more in a future patch, but this code was already
      written twice, once in taprio and once in mqprio.
      
      Refactor the code to a helper in the common mqprio library.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dd6ad67
    • Vladimir Oltean's avatar
      net/sched: taprio: centralize mqprio qopt validation · 1dfe086d
      Vladimir Oltean authored
      There is a lot of code in taprio which is "borrowed" from mqprio.
      It makes sense to put a stop to the "borrowing" and start actually
      reusing code.
      
      Because taprio and mqprio are built as part of different kernel modules,
      code reuse can only take place either by writing it as static inline
      (limiting), putting it in sch_generic.o (not generic enough), or
      creating a third auto-selectable kernel module which only holds library
      code. I opted for the third variant.
      
      In a previous change, mqprio gained support for reverse TC:TXQ mappings,
      something which taprio still denies. Make taprio use the same validation
      logic so that it supports this configuration as well.
      
      The taprio code didn't enforce TXQ overlaps in txtime-assist mode and
      that looks intentional, even if I've no idea why that might be. Preserve
      that, but add a comment.
      
      There isn't any dedicated MAINTAINERS entry for mqprio, so nothing to
      update there.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      1dfe086d
    • Vladimir Oltean's avatar
      net/sched: mqprio: add extack messages for queue count validation · d404959f
      Vladimir Oltean authored
      To make mqprio more user-friendly, create netlink extended ack messages
      which say exactly what is wrong about the queue counts. This uses the
      new support for printf-formatted extack messages.
      
      Example:
      
      $ tc qdisc add dev eno0 root handle 1: mqprio num_tc 8 \
      	map 0 1 2 3 4 5 6 7 queues 3@0 1@1 1@2 1@3 1@4 1@5 1@6 1@7 hw 0
      Error: sch_mqprio: TC 0 queues 3@0 overlap with TC 1 queues 1@1.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d404959f
    • Vladimir Oltean's avatar
      net/sched: mqprio: allow offloading drivers to request queue count validation · 19278d76
      Vladimir Oltean authored
      mqprio_parse_opt() proudly has a comment:
      
      	/* If hardware offload is requested we will leave it to the device
      	 * to either populate the queue counts itself or to validate the
      	 * provided queue counts.
      	 */
      
      Unfortunately some device drivers did not get this memo, and don't
      validate the queue counts, or populate them.
      
      In case drivers don't want to populate the queue counts themselves, just
      act upon the requested configuration, it makes sense to introduce a tc
      capability, and make mqprio query it, so they don't have to do the
      validation themselves.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      19278d76
    • Vladimir Oltean's avatar
      net/sched: mqprio: allow reverse TC:TXQ mappings · d7045f52
      Vladimir Oltean authored
      By imposing that the last TXQ of TC i is smaller than the first TXQ of
      any TC j (j := i+1 .. n), mqprio imposes a strict ordering condition for
      the TXQ indices (they must increase as TCs increase).
      
      Claudiu points out that the complexity of the TXQ count validation is
      too high for this logic, i.e. instead of iterating over j, it is
      sufficient that the TXQ indices of TC i and i + 1 are ordered, and that
      will eventually ensure global ordering.
      
      This is true, however it doesn't appear to me that is what the code
      really intended to do. Instead, based on the comments, it just wanted to
      check for overlaps (and this isn't how one does that).
      
      So the following mqprio configuration, which I had recommended to
      Vinicius more than once for igb/igc (to account for the fact that on
      this hardware, lower numbered TXQs have higher dequeue priority than
      higher ones):
      
      num_tc 4 map 0 1 2 3 queues 1@3 1@2 1@1 1@0
      
      is in fact denied today by mqprio.
      
      The full story is that in fact, it's only denied with "hw 0"; if
      hardware offloading is requested, mqprio defers TXQ range overlap
      validation to the device driver (a strange decision in itself).
      
      This is most certainly a bug, but it's not one that has any merit for
      being fixed on "stable" as far as I can tell. This is because mqprio
      always rejected a configuration which was in fact valid, and this has
      shaped the way in which mqprio configuration scripts got built for
      various hardware (see igb/igc in the link below). Therefore, one could
      consider it to be merely an improvement for mqprio to allow reverse
      TC:TXQ mappings.
      
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230130173145.475943-9-vladimir.oltean@nxp.com/#25188310
      Link: https://patchwork.kernel.org/project/netdevbpf/patch/20230128010719.2182346-6-vladimir.oltean@nxp.com/#25186442Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Reviewed-by: default avatarGerhard Engleder <gerhard@engleder-embedded.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d7045f52
    • Vladimir Oltean's avatar
      net/sched: move struct tc_mqprio_qopt_offload from pkt_cls.h to pkt_sched.h · 9adafe2b
      Vladimir Oltean authored
      Since mqprio is a scheduler and not a classifier, move its offload
      structure to pkt_sched.h, where struct tc_taprio_qopt_offload also lies.
      
      Also update some header inclusions in drivers that access this
      structure, to the best of my abilities.
      
      Cc: Igor Russkikh <irusskikh@marvell.com>
      Cc: Yisen Zhuang <yisen.zhuang@huawei.com>
      Cc: Salil Mehta <salil.mehta@huawei.com>
      Cc: Jesse Brandeburg <jesse.brandeburg@intel.com>
      Cc: Tony Nguyen <anthony.l.nguyen@intel.com>
      Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
      Cc: Saeed Mahameed <saeedm@nvidia.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: Horatiu Vultur <horatiu.vultur@microchip.com>
      Cc: Lars Povlsen <lars.povlsen@microchip.com>
      Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
      Cc: Daniel Machon <daniel.machon@microchip.com>
      Cc: UNGLinuxDriver@microchip.com
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9adafe2b
    • Vladimir Oltean's avatar
      net/sched: mqprio: refactor offloading and unoffloading to dedicated functions · 5cfb45e2
      Vladimir Oltean authored
      Some more logic will be added to mqprio offloading, so split that code
      up from mqprio_init(), which is already large, and create a new
      function, mqprio_enable_offload(), similar to taprio_enable_offload().
      Also create the opposite function mqprio_disable_offload().
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5cfb45e2
    • Vladimir Oltean's avatar
      net/sched: mqprio: refactor nlattr parsing to a separate function · feb2cf3d
      Vladimir Oltean authored
      mqprio_init() is quite large and unwieldy to add more code to.
      Split the netlink attribute parsing to a dedicated function.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarJacob Keller <jacob.e.keller@intel.com>
      Reviewed-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      feb2cf3d
    • Praveen Kaligineedi's avatar
      gve: Fix gve interrupt names · 84371145
      Praveen Kaligineedi authored
      IRQs are currently requested before the netdevice is registered
      and a proper name is assigned to the device. Changing interrupt
      name to avoid using the format string in the name.
      
      Interrupt name before change: eth%d-ntfy-block.<blk_id>
      Interrupt name after change: gve-ntfy-blk<blk_id>@pci:<pci_name>
      Signed-off-by: default avatarPraveen Kaligineedi <pkaligineedi@google.com>
      Reviewed-by: default avatarJeroen de Borst <jeroendb@google.com>
      Acked-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      84371145
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue · d78f8d83
      David S. Miller authored
      Tony Nguyen says:
      
      ====================
      net: implement devlink reload in ice
      
      Michal Swiatkowski says:
      
      This is a part of changes done in patchset [0]. Resource management is
      kind of controversial part, so I split it into two patchsets.
      
      It is the first one, covering refactor and implement reload API call.
      The refactor will unblock some of the patches needed by SIOV or
      subfunction.
      
      Most of this patchset is about implementing driver reload mechanism.
      Part of code from probe and rebuild is used to not duplicate code.
      To allow this reuse probe and rebuild path are split into smaller
      functions.
      
      Patch "ice: split ice_vsi_setup into smaller functions" changes
      boolean variable in function call to integer and adds define
      for it. Instead of having the function called with true/false now it
      can be called with readable defines ICE_VSI_FLAG_INIT or
      ICE_VSI_FLAG_NO_INIT. It was suggested by Jacob Keller and probably this
      mechanism will be implemented across ice driver in follow up patchset.
      
      Previously the code was reviewed here [0].
      
      [0] https://lore.kernel.org/netdev/Y3ckRWtAtZU1BdXm@unreal/T/#m3bb8feba0a62f9b4cd54cd94917b7e2143fc2ecd
      
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d78f8d83
    • Jesper Dangaard Brouer's avatar
      net: introduce skb_poison_list and use in kfree_skb_list · 9dde0cd3
      Jesper Dangaard Brouer authored
      First user of skb_poison_list is in kfree_skb_list_reason, to catch bugs
      earlier like introduced in commit eedade12 ("net: kfree_skb_list use
      kmem_cache_free_bulk"). For completeness mentioned bug have been fixed in
      commit f72ff8b8 ("net: fix kfree_skb_list use of skb_mark_not_on_list").
      
      In case of a bug like mentioned commit we would have seen OOPS with:
       general protection fault, probably for non-canonical address 0xdead000000000870
      And content of one the registers e.g. R13: dead000000000800
      
      In this case skb->len is at offset 112 bytes (0x70) why fault happens at
       0x800+0x70 = 0x870
      Signed-off-by: default avatarJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9dde0cd3
    • David S. Miller's avatar
      Merge branch 'wangxun-interrupts' · 149e8fb0
      David S. Miller authored
      Jiawen Wu says:
      
      ====================
      Wangxun interrupt and RxTx support
      
      Configure interrupt, setup RxTx ring, support to receive and transmit
      packets.
      
      change log:
      v3:
      - Use upper_32_bits() to avoid compile warning.
      - Remove useless codes.
      v2:
      - Andrew Lunn: https://lore.kernel.org/netdev/Y86kDphvyHj21IxK@lunn.ch/
      - Add a judgment when allocate dma for descriptor.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      149e8fb0
    • Mengyuan Lou's avatar
      net: ngbe: Support Rx and Tx process path · b97f955e
      Mengyuan Lou authored
      Add enable and disable operation process for ngbe open/close.
      Clean Rx and Tx ring interrupts, process packets in the data path.
      Signed-off-by: default avatarMengyuan Lou <mengyuanlou@net-swift.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b97f955e
    • Jiawen Wu's avatar
      net: txgbe: Support Rx and Tx process path · 0d22be52
      Jiawen Wu authored
      Clean Rx and Tx ring interrupts, process packets in the data path.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0d22be52
    • Mengyuan Lou's avatar
      net: libwx: Add tx path to process packets · 09a50880
      Mengyuan Lou authored
      Support to transmit packets without hardware features.
      Signed-off-by: default avatarMengyuan Lou <mengyuanlou@net-swift.com>
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      09a50880
    • Jiawen Wu's avatar
      net: libwx: Support to receive packets in NAPI · 3c47e8ae
      Jiawen Wu authored
      Clean all queues associated with a q_vector, to simple receive packets
      without hardware features.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3c47e8ae
    • Jiawen Wu's avatar
      net: txgbe: Setup Rx and Tx ring · 0ef7e159
      Jiawen Wu authored
      Improve the configuration of Rx and Tx ring, set Rx flags and implement
      ndo_set_rx_mode ops.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ef7e159
    • Jiawen Wu's avatar
      net: libwx: Allocate Rx and Tx resources · 850b9711
      Jiawen Wu authored
      Setup Rx and Tx descriptors for specefic rings.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      850b9711
    • Jiawen Wu's avatar
      net: libwx: Configure Rx and Tx unit on hardware · 18b5b8a9
      Jiawen Wu authored
      Configure hardware for preparing to process packets. Including configure
      receive and transmit unit of the MAC layer, and setup the specific rings.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      18b5b8a9
    • Jiawen Wu's avatar
      net: txgbe: Add interrupt support · 5d3ac705
      Jiawen Wu authored
      Determine proper interrupt scheme to enable and handle interrupt.
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5d3ac705
    • Mengyuan Lou's avatar
      net: ngbe: Add irqs request flow · e7956139
      Mengyuan Lou authored
      Add request_irq for tx/rx rings and misc other events.
      If the application is successful, config vertors for interrupts.
      Enable some base interrupts mask in ngbe_irq_enable.
      Signed-off-by: default avatarMengyuan Lou <mengyuanlou@net-swift.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e7956139
    • Mengyuan Lou's avatar
      net: libwx: Add irq flow functions · 3f703186
      Mengyuan Lou authored
      Add irq flow functions for ngbe and txgbe.
      Alloc pcie msix irqs for drivers, otherwise fall back to msi/legacy.
      Signed-off-by: default avatarMengyuan Lou <mengyuanlou@net-swift.com>
      Signed-off-by: default avatarJiawen Wu <jiawenwu@trustnetic.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3f703186
    • Qingfang DENG's avatar
      net: page_pool: use in_softirq() instead · 542bcea4
      Qingfang DENG authored
      We use BH context only for synchronization, so we don't care if it's
      actually serving softirq or not.
      
      As a side node, in case of threaded NAPI, in_serving_softirq() will
      return false because it's in process context with BH off, making
      page_pool_recycle_in_cache() unreachable.
      Signed-off-by: default avatarQingfang DENG <qingfang.deng@siflower.com.cn>
      Tested-by: default avatarFelix Fietkau <nbd@nbd.name>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      542bcea4
    • David S. Miller's avatar
      Merge tag 'mlx5-updates-2023-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux · 637bc8f0
      David S. Miller authored
      Saeed Mahameed says:
      
      ====================
      mlx5-updates-2023-02-04
      
      This series provides misc updates to mlx5 driver:
      
      1) Trivial LAG code cleanup patches from Roi
      
      2) Rahul improves mlx5's documentation structure
      Separates the documentation into multiple pages related to different
      components in the device driver. Adds Kconfig parameters, devlink
      parameters, and tracepoints that were previously introduced but not added
      to the documentation. Introduces a new page on ethtool statistics counters
      with information about counters previously implemented in the mlx5_core
      driver but not documented in the kernel tree.
      
      3) From Raed, policy/state selector support for IPSec.
      
      4) From Fragos, add support for XDR speed in IPoIB mlx5 netdev
      
      5) Few more misc cleanups and trivial changes
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      637bc8f0
    • Parav Pandit's avatar
      virtio-net: Maintain reverse cleanup order · 27369c9c
      Parav Pandit authored
      To easily audit the code, better to keep the device stop()
      sequence to be mirror of the device open() sequence.
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarJiri Pirko <jiri@nvidia.com>
      Signed-off-by: default avatarParav Pandit <parav@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27369c9c
    • David S. Miller's avatar
      Merge branch 'bridge-mdb-limit' · cb3086ce
      David S. Miller authored
      Petr Machata says:
      
      ====================
      bridge: Limit number of MDB entries per port, port-vlan
      
      The MDB maintained by the bridge is limited. When the bridge is configured
      for IGMP / MLD snooping, a buggy or malicious client can easily exhaust its
      capacity. In SW datapath, the capacity is configurable through the
      IFLA_BR_MCAST_HASH_MAX parameter, but ultimately is finite. Obviously a
      similar limit exists in the HW datapath for purposes of offloading.
      
      In order to prevent the issue of unilateral exhaustion of MDB resources,
      introduce two parameters in each of two contexts:
      
      - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
        per-port-VLAN number of MDB entries that the port is member in.
      
      - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
        per-port-VLAN maximum permitted number of MDB entries, or 0 for
        no limit.
      
      Per-port number of entries keeps track of the total number of MDB entries
      configured on a given port. The per-port-VLAN value then keeps track of the
      subset of MDB entries configured specifically for the given VLAN, on that
      port. The number is adjusted as port_groups are created and deleted, and
      therefore under multicast lock.
      
      A maximum value, if non-zero, then places a limit on the number of entries
      that can be configured in a given context. Attempts to add entries above
      the maximum are rejected.
      
      Rejection reason of netlink-based requests to add MDB entries is
      communicated through extack. This channel is unavailable for rejections
      triggered from the control path. To address this lack of visibility, the
      patchset adds a tracepoint, bridge:br_mdb_full:
      
      	# perf record -e bridge:br_mdb_full &
      	# [...]
      	# perf script | cut -d: -f4-
      	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0
      	 dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0
      	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10
      	 dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10
      	 dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10
      
      Another option to consume the tracepoint is e.g. through the bpftrace tool:
      
      	# bpftrace -e ' tracepoint:bridge:br_mdb_full /args->af != 0/ {
      			    printf("dev %s src %s grp %s vid %u\n",
      				   str(args->dev), ntop(args->src),
      				   ntop(args->grp), args->vid);
      			}
      			tracepoint:bridge:br_mdb_full /args->af == 0/ {
      			    printf("dev %s grp %s vid %u\n",
      				   str(args->dev),
      				   macaddr(args->grpmac), args->vid);
      			}'
      
      This tracepoint is triggered for mcast_hash_max exhaustions as well.
      
      The following is an example of how the feature is used. A more extensive
      example is available in patch #8:
      
      	# bridge vlan set dev v1 vid 1 mcast_max_groups 1
      	# bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1
      	# bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1
      	Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1.
      
      The patchset progresses as follows:
      
      - In patch #1, set strict_start_type at two bridge-related policies. The
        reason is we are adding a new attribute to one of these, and want the new
        attribute to be parsed strictly. The other was adjusted for completeness'
        sake.
      
      - In patches #2 to #5, br_mdb and br_multicast code is adjusted to make the
        following additions smoother.
      
      - In patch #6, add the tracepoint.
      
      - In patch #7, the code to maintain number of MDB entries is added as
        struct net_bridge_mcast_port::mdb_n_entries. The maximum is added, too,
        as struct net_bridge_mcast_port::mdb_max_entries, however at this point
        there is no way to set the value yet, and since 0 is treated as "no
        limit", the functionality doesn't change at this point. Note however,
        that mcast_hash_max violations already do trigger at this point.
      
      - In patch #8, netlink plumbing is added: reading of number of entries, and
        reading and writing of maximum.
      
        The per-port values are passed through RTM_NEWLINK / RTM_GETLINK messages
        in IFLA_BRPORT_MCAST_N_GROUPS and _MAX_GROUPS, inside IFLA_PROTINFO nest.
      
        The per-port-vlan values are passed through RTM_GETVLAN / RTM_NEWVLAN
        messages in BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS, _MAX_GROUPS, inside
        BRIDGE_VLANDB_ENTRY.
      
      The following patches deal with the selftest:
      
      - Patches #9 and #10 clean up and move around some selftest code.
      
      - Patches #11 to #14 add helpers and generalize the existing IGMP / MLD
        support to allow generating packets with configurable group addresses and
        varying source lists for (S,G) memberships.
      
      - Patch #15 adds code to generate IGMP leave and MLD done packets.
      
      - Patch #16 finally adds the selftest itself.
      
      v3:
      - Patch #7:
          - Access mdb_max_/_n_entries through READ_/WRITE_ONCE
          - Move extack setting to br_multicast_port_ngroups_inc_one().
            Since we use NL_SET_ERR_MSG_FMT_MOD, the correct context
            (port / port-vlan) can be passed through an argument.
            This also removes the need for more READ/WRITE_ONCE's
            at the extack-setting site.
      - Patch #8:
          - Move the br_multicast_port_ctx_vlan_disabled() check
            out to the _vlan_ helpers callers. Thus these helpers
            cannot fail, which makes them very similar to the
            _port_ helpers. Have them take the MC context directly
            and unify them.
      
      v2:
      - Cover letter:
          - Add an example of a bpftrace-based probe script
      - Patch #6:
          - Report IPv4 as an IPv6-mapped address through the IPv6 buffer
            as well, to save ring buffer space.
      - Patch #7:
          - In br_multicast_port_ngroups_inc_one(), bounce
            if n>=max, not if n==max
          - Adjust extack messages to mention ngroups, now
            that the bounces appear when n>=max, not n==max
          - In __br_multicast_enable_port_ctx(), do not reset
            max to 0. Also do not count number of entries by
            going through _inc, as that would end up incorrectly
            bouncing the entries.
      - Patch #8:
          - Drop locks around accesses in
            br_multicast_{port,vlan}_ngroups_{get,set_max}(),
          - Drop bounces due to max<n in
            br_multicast_{port,vlan}_ngroups_set_max().
      - Patch #12:
          - In the comment at payload_template_calc_checksum(),
            s/%#02x/%02x/, that's the mausezahn payload format.
      - Patch #16:
          - Adjust the tests that check setting max below n and
            reset of max on VLAN snooping enablement
          - Make test naming uniform
          - Enable testing of control path (IGMP/MLD) in
            mcast_vlan_snooping bridge
          - Reorganize the code so that test instances (per bridge
            type and configuration type) always come right after
            the test, in order of {d,q,qvs}{4,6}{cfg,ctl}.
            Then groups of selftests are at the end of the file.
            Similarly adjust invocation order of the tests.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cb3086ce
    • Petr Machata's avatar
      selftests: forwarding: bridge_mdb_max: Add a new selftest · 3446dcd7
      Petr Machata authored
      Add a suite covering mcast_n_groups and mcast_max_groups bridge features.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      3446dcd7
    • Petr Machata's avatar
      selftests: forwarding: lib: Add helpers to build IGMP/MLD leave packets · 9ae85469
      Petr Machata authored
      The testsuite that checks for mcast_max_groups functionality will need to
      wipe the added groups as well. Add helpers to build an IGMP or MLD packets
      announcing that host is leaving a given group.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9ae85469
    • Petr Machata's avatar
      selftests: forwarding: lib: Allow list of IPs for IGMPv3/MLDv2 · 705d4bc7
      Petr Machata authored
      The testsuite that checks for mcast_max_groups functionality will need
      to generate IGMP and MLD packets with configurable number of (S,G)
      addresses. To that end, further extend igmpv3_is_in_get() and
      mldv2_is_in_get() to allow a list of IP addresses instead of one
      address.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      705d4bc7
    • Petr Machata's avatar
      selftests: forwarding: lib: Parameterize IGMPv3/MLDv2 generation · 506a1ac9
      Petr Machata authored
      In order to generate IGMPv3 and MLDv2 packets on the fly, the
      functions that generate these packets need to be able to generate
      packets for different groups and different sources. Generating MLDv2
      packets further needs the source address of the packet for purposes of
      checksum calculation. Add the necessary parameters, and generate the
      payload accordingly by dispatching to helpers added in the previous
      patches.
      
      Adjust the sole client, bridge_mdb.sh, as well.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      506a1ac9
    • Petr Machata's avatar
      selftests: forwarding: lib: Add helpers for checksum handling · 952e0ee3
      Petr Machata authored
      In order to generate IGMPv3 and MLDv2 packets on the fly, we will need
      helpers to calculate the packet checksum.
      
      The approach presented in this patch revolves around payload templates
      for mausezahn. These are mausezahn-like payload strings (01:23:45:...)
      with possibly one 2-byte sequence replaced with the word PAYLOAD. The
      main function is payload_template_calc_checksum(), which calculates
      RFC 1071 checksum of the message. There are further helpers to then
      convert the checksum to the payload format, and to expand it.
      
      For IPv6, MLDv2 message checksum is computed using a pseudoheader that
      differs from the header used in the payload itself. The fact that the
      two messages are different means that the checksum needs to be
      returned as a separate quantity, instead of being expanded in-place in
      the payload itself. Furthermore, the pseudoheader includes a length of
      the message. Much like the checksum, this needs to be expanded in
      mausezahn format. And likewise for number of addresses for (S,G)
      entries. Thus we have several places where a computed quantity needs
      to be presented in the payload format. Add a helper u16_to_bytes(),
      which will be used in all these cases.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      952e0ee3
    • Petr Machata's avatar
      selftests: forwarding: lib: Add helpers for IP address handling · fcf49276
      Petr Machata authored
      In order to generate IGMPv3 and MLDv2 packets on the fly, we will need
      helpers to expand IPv4 and IPv6 addresses given as parameters in
      mausezahn payload notation. Add helpers that do it.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      fcf49276
    • Petr Machata's avatar
      selftests: forwarding: bridge_mdb: Fix a typo · f7ccf60c
      Petr Machata authored
      Add the letter missing from the word "INCLUDE".
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f7ccf60c
    • Petr Machata's avatar
      selftests: forwarding: Move IGMP- and MLD-related functions to lib · 344dd2c9
      Petr Machata authored
      These functions will be helpful for other testsuites as well. Extract them
      to a common place.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      344dd2c9
    • Petr Machata's avatar
      net: bridge: Add netlink knobs for number / maximum MDB entries · a1aee20d
      Petr Machata authored
      The previous patch added accounting for number of MDB entries per port and
      per port-VLAN, and the logic to verify that these values stay within
      configured bounds. However it didn't provide means to actually configure
      those bounds or read the occupancy. This patch does that.
      
      Two new netlink attributes are added for the MDB occupancy:
      IFLA_BRPORT_MCAST_N_GROUPS for the per-port occupancy and
      BRIDGE_VLANDB_ENTRY_MCAST_N_GROUPS for the per-port-VLAN occupancy.
      And another two for the maximum number of MDB entries:
      IFLA_BRPORT_MCAST_MAX_GROUPS for the per-port maximum, and
      BRIDGE_VLANDB_ENTRY_MCAST_MAX_GROUPS for the per-port-VLAN one.
      
      Note that the two new IFLA_BRPORT_ attributes prompt bumping of
      RTNL_SLAVE_MAX_TYPE to size the slave attribute tables large enough.
      
      The new attributes are used like this:
      
       # ip link add name br up type bridge vlan_filtering 1 mcast_snooping 1 \
                                            mcast_vlan_snooping 1 mcast_querier 1
       # ip link set dev v1 master br
       # bridge vlan add dev v1 vid 2
      
       # bridge vlan set dev v1 vid 1 mcast_max_groups 1
       # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 1
       # bridge mdb add dev br port v1 grp 230.1.2.4 temp vid 1
       Error: bridge: Port-VLAN is already in 1 groups, and mcast_max_groups=1.
      
       # bridge link set dev v1 mcast_max_groups 1
       # bridge mdb add dev br port v1 grp 230.1.2.3 temp vid 2
       Error: bridge: Port is already in 1 groups, and mcast_max_groups=1.
      
       # bridge -d link show
       5: v1@v2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master br [...]
           [...] mcast_n_groups 1 mcast_max_groups 1
      
       # bridge -d vlan show
       port              vlan-id
       br                1 PVID Egress Untagged
                           state forwarding mcast_router 1
       v1                1 PVID Egress Untagged
                           [...] mcast_n_groups 1 mcast_max_groups 1
                         2
                           [...] mcast_n_groups 0 mcast_max_groups 0
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1aee20d
    • Petr Machata's avatar
      net: bridge: Maintain number of MDB entries in net_bridge_mcast_port · b57e8d87
      Petr Machata authored
      The MDB maintained by the bridge is limited. When the bridge is configured
      for IGMP / MLD snooping, a buggy or malicious client can easily exhaust its
      capacity. In SW datapath, the capacity is configurable through the
      IFLA_BR_MCAST_HASH_MAX parameter, but ultimately is finite. Obviously a
      similar limit exists in the HW datapath for purposes of offloading.
      
      In order to prevent the issue of unilateral exhaustion of MDB resources,
      introduce two parameters in each of two contexts:
      
      - Per-port and per-port-VLAN number of MDB entries that the port
        is member in.
      
      - Per-port and (when BROPT_MCAST_VLAN_SNOOPING_ENABLED is enabled)
        per-port-VLAN maximum permitted number of MDB entries, or 0 for
        no limit.
      
      The per-port multicast context is used for tracking of MDB entries for the
      port as a whole. This is available for all bridges.
      
      The per-port-VLAN multicast context is then only available on
      VLAN-filtering bridges on VLANs that have multicast snooping on.
      
      With these changes in place, it will be possible to configure MDB limit for
      bridge as a whole, or any one port as a whole, or any single port-VLAN.
      
      Note that unlike the global limit, exhaustion of the per-port and
      per-port-VLAN maximums does not cause disablement of multicast snooping.
      It is also permitted to configure the local limit larger than hash_max,
      even though that is not useful.
      
      In this patch, introduce only the accounting for number of entries, and the
      max field itself, but not the means to toggle the max. The next patch
      introduces the netlink APIs to toggle and read the values.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b57e8d87
    • Petr Machata's avatar
      net: bridge: Add a tracepoint for MDB overflows · d47230a3
      Petr Machata authored
      The following patch will add two more maximum MDB allowances to the global
      one, mcast_hash_max, that exists today. In all these cases, attempts to add
      MDB entries above the configured maximums through netlink, fail noisily and
      obviously. Such visibility is missing when adding entries through the
      control plane traffic, by IGMP or MLD packets.
      
      To improve visibility in those cases, add a trace point that reports the
      violation, including the relevant netdevice (be it a slave or the bridge
      itself), and the MDB entry parameters:
      
      	# perf record -e bridge:br_mdb_full &
      	# [...]
      	# perf script | cut -d: -f4-
      	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 0
      	 dev v2 af 10 src :: grp ff0e::112/00:00:00:00:00:00 vid 0
      	 dev v2 af 2 src ::ffff:0.0.0.0 grp ::ffff:239.1.1.112/00:00:00:00:00:00 vid 10
      	 dev v2 af 10 src 2001:db8:1::1 grp ff0e::1/00:00:00:00:00:00 vid 10
      	 dev v2 af 2 src ::ffff:192.0.2.1 grp ::ffff:239.1.1.1/00:00:00:00:00:00 vid 10
      
      CC: Steven Rostedt <rostedt@goodmis.org>
      CC: linux-trace-kernel@vger.kernel.org
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarSteven Rostedt (Google) <rostedt@goodmis.org>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d47230a3
    • Petr Machata's avatar
      net: bridge: Change a cleanup in br_multicast_new_port_group() to goto · eceb3085
      Petr Machata authored
      This function is getting more to clean up in the following patches.
      Structuring the cleanups in one labeled block will allow reusing the same
      cleanup from several places.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      eceb3085
    • Petr Machata's avatar
      net: bridge: Add br_multicast_del_port_group() · 976b3858
      Petr Machata authored
      Since cleaning up the effects of br_multicast_new_port_group() just
      consists of delisting and freeing the memory, the function
      br_mdb_add_group_star_g() inlines the corresponding code. In the following
      patches, number of per-port and per-port-VLAN MDB entries is going to be
      maintained, and that counter will have to be updated. Because that logic
      is going to be hidden in the br_multicast module, introduce a new hook
      intended to again remove a newly-created group.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Reviewed-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Acked-by: default avatarNikolay Aleksandrov <razor@blackwall.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      976b3858