1. 03 Mar, 2022 40 commits
    • Martin KaFai Lau's avatar
      bpf: Keep the (rcv) timestamp behavior for the existing tc-bpf@ingress · 7449197d
      Martin KaFai Lau authored
      The current tc-bpf@ingress reads and writes the __sk_buff->tstamp
      as a (rcv) timestamp which currently could either be 0 (not available)
      or ktime_get_real().  This patch is to backward compatible with the
      (rcv) timestamp expectation at ingress.  If the skb->tstamp has
      the delivery_time, the bpf insn rewrite will read 0 for tc-bpf
      running at ingress as it is not available.  When writing at ingress,
      it will also clear the skb->mono_delivery_time bit.
      
      /* BPF_READ: a = __sk_buff->tstamp */
      if (!skb->tc_at_ingress || !skb->mono_delivery_time)
      	a = skb->tstamp;
      else
      	a = 0
      
      /* BPF_WRITE: __sk_buff->tstamp = a */
      if (skb->tc_at_ingress)
      	skb->mono_delivery_time = 0;
      skb->tstamp = a;
      
      [ A note on the BPF_CGROUP_INET_INGRESS which can also access
        skb->tstamp.  At that point, the skb is delivered locally
        and skb_clear_delivery_time() has already been done,
        so the skb->tstamp will only have the (rcv) timestamp. ]
      
      If the tc-bpf@egress writes 0 to skb->tstamp, the skb->mono_delivery_time
      has to be cleared also.  It could be done together during
      convert_ctx_access().  However, the latter patch will also expose
      the skb->mono_delivery_time bit as __sk_buff->delivery_time_type.
      Changing the delivery_time_type in the background may surprise
      the user, e.g. the 2nd read on __sk_buff->delivery_time_type
      may need a READ_ONCE() to avoid compiler optimization.  Thus,
      in expecting the needs in the latter patch, this patch does a
      check on !skb->tstamp after running the tc-bpf and clears the
      skb->mono_delivery_time bit if needed.  The earlier discussion
      on v4 [0].
      
      The bpf insn rewrite requires the skb's mono_delivery_time bit and
      tc_at_ingress bit.  They are moved up in sk_buff so that bpf rewrite
      can be done at a fixed offset.  tc_skip_classify is moved together with
      tc_at_ingress.  To get one bit for mono_delivery_time, csum_not_inet is
      moved down and this bit is currently used by sctp.
      
      [0]: https://lore.kernel.org/bpf/20220217015043.khqwqklx45c4m4se@kafai-mbp.dhcp.thefacebook.com/Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7449197d
    • Martin KaFai Lau's avatar
      net: Postpone skb_clear_delivery_time() until knowing the skb is delivered locally · cd14e9b7
      Martin KaFai Lau authored
      The previous patches handled the delivery_time in the ingress path
      before the routing decision is made.  This patch can postpone clearing
      delivery_time in a skb until knowing it is delivered locally and also
      set the (rcv) timestamp if needed.  This patch moves the
      skb_clear_delivery_time() from dev.c to ip_local_deliver_finish()
      and ip6_input_finish().
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      cd14e9b7
    • Martin KaFai Lau's avatar
      net: Get rcv tstamp if needed in nfnetlink_{log, queue}.c · 80fcec67
      Martin KaFai Lau authored
      If skb has the (rcv) timestamp available, nfnetlink_{log, queue}.c
      logs/outputs it to the userspace.  When the locally generated skb is
      looping from egress to ingress over a virtual interface (e.g. veth,
      loopback...),  skb->tstamp may have the delivery time before it is
      known that will be delivered locally and received by another sk.  Like
      handling the delivery time in network tapping,  use ktime_get_real() to
      get the (rcv) timestamp.  The earlier added helper skb_tstamp_cond() is
      used to do this.  false is passed to the second 'cond' arg such
      that doing ktime_get_real() or not only depends on the
      netstamp_needed_key static key.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      80fcec67
    • Martin KaFai Lau's avatar
      net: ipv6: Get rcv timestamp if needed when handling hop-by-hop IOAM option · b6561f84
      Martin KaFai Lau authored
      IOAM is a hop-by-hop option with a temporary iana allocation (49).
      Since it is hop-by-hop, it is done before the input routing decision.
      One of the traced data field is the (rcv) timestamp.
      
      When the locally generated skb is looping from egress to ingress over
      a virtual interface (e.g. veth, loopback...), skb->tstamp may have the
      delivery time before it is known that it will be delivered locally
      and received by another sk.
      
      Like handling the network tapping (tcpdump) in the earlier patch,
      this patch gets the timestamp if needed without over-writing the
      delivery_time in the skb->tstamp.  skb_tstamp_cond() is added to do the
      ktime_get_real() with an extra cond arg to check on top of the
      netstamp_needed_key static key.  skb_tstamp_cond() will also be used in
      a latter patch and it needs the netstamp_needed_key check.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b6561f84
    • Martin KaFai Lau's avatar
      net: ipv6: Handle delivery_time in ipv6 defrag · 335c8cf3
      Martin KaFai Lau authored
      A latter patch will postpone the delivery_time clearing until the stack
      knows the skb is being delivered locally (i.e. calling
      skb_clear_delivery_time() at ip_local_deliver_finish() for IPv4
      and at ip6_input_finish() for IPv6).  That will allow other kernel
      forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.
      
      A very similar IPv6 defrag codes have been duplicated in
      multiple places: regular IPv6, nf_conntrack, and 6lowpan.
      
      Unlike the IPv4 defrag which is done before ip_local_deliver_finish(),
      the regular IPv6 defrag is done after ip6_input_finish().
      Thus, no change should be needed in the regular IPv6 defrag
      logic because skb_clear_delivery_time() should have been called.
      
      6lowpan also does not need special handling on delivery_time
      because it is a non-inet packet_type.
      
      However, cf_conntrack has a case in NF_INET_PRE_ROUTING that needs
      to do the IPv6 defrag earlier.  Thus, it needs to save the
      mono_delivery_time bit in the inet_frag_queue which is similar
      to how it is handled in the previous patch for the IPv4 defrag.
      
      This patch chooses to do it consistently and stores the mono_delivery_time
      in the inet_frag_queue for all cases such that it will be easier
      for the future refactoring effort on the IPv6 reasm code.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      335c8cf3
    • Martin KaFai Lau's avatar
      net: ip: Handle delivery_time in ip defrag · 8672406e
      Martin KaFai Lau authored
      A latter patch will postpone the delivery_time clearing until the stack
      knows the skb is being delivered locally.  That will allow other kernel
      forwarding path (e.g. ip[6]_forward) to keep the delivery_time also.
      
      An earlier attempt was to do skb_clear_delivery_time() in
      ip_local_deliver() and ip6_input().  The discussion [0] requested
      to move it one step later into ip_local_deliver_finish()
      and ip6_input_finish() so that the delivery_time can be kept
      for the ip_vs forwarding path also.
      
      To do that, this patch also needs to take care of the (rcv) timestamp
      usecase in ip_is_fragment().  It needs to expect delivery_time in
      the skb->tstamp, so it needs to save the mono_delivery_time bit in
      inet_frag_queue such that the delivery_time (if any) can be restored
      in the final defragmented skb.
      
      [Note that it will only happen when the locally generated skb is looping
       from egress to ingress over a virtual interface (e.g. veth, loopback...),
       skb->tstamp may have the delivery time before it is known that it will
       be delivered locally and received by another sk.]
      
      [0]: https://lore.kernel.org/netdev/ca728d81-80e8-3767-d5e-d44f6ad96e43@ssi.bg/Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8672406e
    • Martin KaFai Lau's avatar
      net: Set skb->mono_delivery_time and clear it after sch_handle_ingress() · d98d58a0
      Martin KaFai Lau authored
      The previous patches handled the delivery_time before sch_handle_ingress().
      
      This patch can now set the skb->mono_delivery_time to flag the skb->tstamp
      is used as the mono delivery_time (EDT) instead of the (rcv) timestamp
      and also clear it with skb_clear_delivery_time() after
      sch_handle_ingress().  This will make the bpf_redirect_*()
      to keep the mono delivery_time and used by a qdisc (fq) of
      the egress-ing interface.
      
      A latter patch will postpone the skb_clear_delivery_time() until the
      stack learns that the skb is being delivered locally and that will
      make other kernel forwarding paths (ip[6]_forward) able to keep
      the delivery_time also.  Thus, like the previous patches on using
      the skb->mono_delivery_time bit, calling skb_clear_delivery_time()
      is not limited within the CONFIG_NET_INGRESS to avoid too many code
      churns among this set.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d98d58a0
    • Martin KaFai Lau's avatar
      net: Clear mono_delivery_time bit in __skb_tstamp_tx() · d93376f5
      Martin KaFai Lau authored
      In __skb_tstamp_tx(), it may clone the egress skb and queues the clone to
      the sk_error_queue.  The outgoing skb may have the mono delivery_time
      while the (rcv) timestamp is expected for the clone, so the
      skb->mono_delivery_time bit needs to be cleared from the clone.
      
      This patch adds the skb->mono_delivery_time clearing to the existing
      __net_timestamp() and use it in __skb_tstamp_tx().
      The __net_timestamp() fast path usage in dev.c is changed to directly
      call ktime_get_real() since the mono_delivery_time bit is not set at
      that point.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d93376f5
    • Martin KaFai Lau's avatar
      net: Handle delivery_time in skb->tstamp during network tapping with af_packet · 27942a15
      Martin KaFai Lau authored
      A latter patch will set the skb->mono_delivery_time to flag the skb->tstamp
      is used as the mono delivery_time (EDT) instead of the (rcv) timestamp.
      skb_clear_tstamp() will then keep this delivery_time during forwarding.
      
      This patch is to make the network tapping (with af_packet) to handle
      the delivery_time stored in skb->tstamp.
      
      Regardless of tapping at the ingress or egress,  the tapped skb is
      received by the af_packet socket, so it is ingress to the af_packet
      socket and it expects the (rcv) timestamp.
      
      When tapping at egress, dev_queue_xmit_nit() is used.  It has already
      expected skb->tstamp may have delivery_time,  so it does
      skb_clone()+net_timestamp_set() to ensure the cloned skb has
      the (rcv) timestamp before passing to the af_packet sk.
      This patch only adds to clear the skb->mono_delivery_time
      bit in net_timestamp_set().
      
      When tapping at ingress, it currently expects the skb->tstamp is either 0
      or the (rcv) timestamp.  Meaning, the tapping at ingress path
      has already expected the skb->tstamp could be 0 and it will get
      the (rcv) timestamp by ktime_get_real() when needed.
      
      There are two cases for tapping at ingress:
      
      One case is af_packet queues the skb to its sk_receive_queue.
      The skb is either not shared or new clone created.  The newly
      added skb_clear_delivery_time() is called to clear the
      delivery_time (if any) and set the (rcv) timestamp if
      needed before the skb is queued to the sk_receive_queue.
      
      Another case, the ingress skb is directly copied to the rx_ring
      and tpacket_get_timestamp() is used to get the (rcv) timestamp.
      The newly added skb_tstamp() is used in tpacket_get_timestamp()
      to check the skb->mono_delivery_time bit before returning skb->tstamp.
      As mentioned earlier, the tapping@ingress has already expected
      the skb may not have the (rcv) timestamp (because no sk has asked
      for it) and has handled this case by directly calling ktime_get_real().
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27942a15
    • Martin KaFai Lau's avatar
      net: Add skb_clear_tstamp() to keep the mono delivery_time · de799101
      Martin KaFai Lau authored
      Right now, skb->tstamp is reset to 0 whenever the skb is forwarded.
      
      If skb->tstamp has the mono delivery_time, clearing it can hurt
      the performance when it finally transmits out to fq@phy-dev.
      
      The earlier patch added a skb->mono_delivery_time bit to
      flag the skb->tstamp carrying the mono delivery_time.
      
      This patch adds skb_clear_tstamp() helper which keeps
      the mono delivery_time and clears everything else.
      
      The delivery_time clearing will be postponed until the stack knows the
      skb will be delivered locally.  It will be done in a latter patch.
      Signed-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      de799101
    • Martin KaFai Lau's avatar
      net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp · a1ac9c8a
      Martin KaFai Lau authored
      skb->tstamp was first used as the (rcv) timestamp.
      The major usage is to report it to the user (e.g. SO_TIMESTAMP).
      
      Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
      during egress and used by the qdisc (e.g. sch_fq) to make decision on when
      the skb can be passed to the dev.
      
      Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
      or the delivery_time, so it is always reset to 0 whenever forwarded
      between egress and ingress.
      
      While it makes sense to always clear the (rcv) timestamp in skb->tstamp
      to avoid confusing sch_fq that expects the delivery_time, it is a
      performance issue [0] to clear the delivery_time if the skb finally
      egress to a fq@phy-dev.  For example, when forwarding from egress to
      ingress and then finally back to egress:
      
                  tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                           ^              ^
                                           reset          rest
      
      This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
      is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
      
      The current use case is to keep the TCP mono delivery_time (EDT) and
      to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
      to read and change the mono delivery_time.
      
      In the future, another bit (e.g. skb->user_delivery_time) can be added
      for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
      
      [ This patch is a prep work.  The following patches will
        get the other parts of the stack ready first.  Then another patch
        after that will finally set the skb->mono_delivery_time. ]
      
      skb_set_delivery_time() function is added.  It is used by the tcp_output.c
      and during ip[6] fragmentation to assign the delivery_time to
      the skb->tstamp and also set the skb->mono_delivery_time.
      
      A note on the change in ip_send_unicast_reply() in ip_output.c.
      It is only used by TCP to send reset/ack out of a ctl_sk.
      Like the new skb_set_delivery_time(), this patch sets
      the skb->mono_delivery_time to 0 for now as a place
      holder.  It will be enabled in a latter patch.
      A similar case in tcp_ipv6 can be done with
      skb_set_delivery_time() in tcp_v6_send_response().
      
      [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdfSigned-off-by: default avatarMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a1ac9c8a
    • David S. Miller's avatar
      Merge branch 'dsa-unicast-filtering' · 6fb8661c
      David S. Miller authored
      Vladimir Oltean says:
      
      ====================
      DSA unicast filtering
      
      This series doesn't attempt anything extremely brave, it just changes
      the way in which standalone ports which support FDB isolation work.
      
      Up until now, DSA has recommended that switch drivers configure
      standalone ports in a separate VID/FID with learning disabled, and with
      the CPU port as the only destination, reached trivially via flooding.
      That works, except that standalone ports will deliver all packets to the
      CPU. We can leverage the hardware FDB as a MAC DA filter, and disable
      flooding towards the CPU port, to force the dropping of packets with
      unknown MAC DA.
      
      We handle port promiscuity by re-enabling flooding towards the CPU port.
      This is relevant because the bridge puts its automatic (learning +
      flooding) ports in promiscuous mode, and this makes some things work
      automagically, like for example bridging with a foreign interface.
      We don't delve yet into the territory of managing CPU flooding more
      aggressively while under a bridge.
      
      The only switch driver that benefits from this work right now is the
      NXP LS1028A switch (felix). The others need to implement FDB isolation
      first, before DSA is going to install entries to the port's standalone
      database. Otherwise, these entries might collide with bridge FDB/MDB
      entries.
      
      This work was done mainly to have all the required features in place
      before somebody starts seriously architecting DSA support for multiple
      CPU ports. Otherwise it is much more difficult to bolt these features on
      top of multiple CPU ports.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      6fb8661c
    • Vladimir Oltean's avatar
      net: mscc: ocelot: accept configuring bridge port flags on the NPI port · ac455209
      Vladimir Oltean authored
      In order for the Felix DSA driver to be able to turn on/off flooding
      towards its CPU port, we need to redirect calls on the NPI port to
      actually act upon the index in the analyzer block that corresponds to
      the CPU port module. This was never necessary until now because DSA
      (or the bridge) never called ocelot_port_bridge_flags() for the NPI
      port.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ac455209
    • Vladimir Oltean's avatar
      net: dsa: felix: stop clearing CPU flooding in felix_setup_tag_8021q · 0cc36980
      Vladimir Oltean authored
      felix_migrate_flood_to_tag_8021q_port() takes care of clearing the
      flooding bits on the old CPU port (which was the CPU port module), so
      manually clearing this bit from PGID_UC, PGID_MC, PGID_BC is redundant.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0cc36980
    • Vladimir Oltean's avatar
      net: dsa: felix: start off with flooding disabled on the CPU port · 90897569
      Vladimir Oltean authored
      The driver probes with all ports as standalone, and it supports unicast
      filtering. So DSA will call port_fdb_add() for all necessary addresses
      on the current CPU port. We also handle migrations when the CPU port
      hardware resource changes (on tagging protocol change), so there should
      not be any unknown address that we have to receive while not promiscuous.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      90897569
    • Vladimir Oltean's avatar
      net: dsa: felix: migrate flood settings from NPI to tag_8021q CPU port · b903a6bd
      Vladimir Oltean authored
      When the tagging protocol changes from "ocelot" to "ocelot-8021q" or in
      reverse, the DSA promiscuity setting that was applied for the old CPU
      port must be transferred to the new one.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      b903a6bd
    • Vladimir Oltean's avatar
      net: dsa: felix: migrate host FDB and MDB entries when changing tag proto · f9cef64f
      Vladimir Oltean authored
      The "ocelot" and "ocelot-8021q" tagging protocols make use of different
      hardware resources, and host FDB entries have different destination
      ports in the switch analyzer module, practically speaking.
      
      So when the user requests a tagging protocol change, the driver must
      migrate all host FDB and MDB entries from the NPI port (in fact CPU port
      module) towards the same physical port, but this time used as a regular
      port.
      
      It is pointless for the felix driver to keep a copy of the host
      addresses, when we can create and export DSA helpers for walking through
      the addresses that it already needs to keep on the CPU port, for
      refcounting purposes.
      
      felix_classify_db() is moved up to avoid a forward declaration.
      
      We pass "bool change" because dp->fdbs and dp->mdbs are uninitialized
      lists when felix_setup() first calls felix_set_tag_protocol(), so we
      need to avoid calling dsa_port_walk_fdbs() during probe time.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      f9cef64f
    • Vladimir Oltean's avatar
      net: dsa: manage flooding on the CPU ports · 7569459a
      Vladimir Oltean authored
      DSA can treat IFF_PROMISC and IFF_ALLMULTI on standalone user ports as
      signifying whether packets with an unknown MAC DA will be received or
      not. Since known MAC DAs are handled by FDB/MDB entries, this means that
      promiscuity is analogous to including/excluding the CPU port from the
      flood domain of those packets.
      
      There are two ways to signal CPU flooding to drivers.
      
      The first (chosen here) is to synthesize a call to
      ds->ops->port_bridge_flags() for the CPU port, with a mask of
      BR_FLOOD | BR_MCAST_FLOOD. This has the effect of turning on egress
      flooding on the CPU port regardless of source.
      
      The alternative would be to create a new ds->ops->port_host_flood()
      which is called per user port. Some switches (sja1105) have a flood
      domain that is managed per {ingress port, egress port} pair, so it would
      make more sense for this kind of switch to not flood the CPU from port A
      if just port B requires it. Nonetheless, the sja1105 has other quirks
      that prevent it from making use of unicast filtering, and without a
      concrete user making use of this feature, I chose not to implement it.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7569459a
    • Vladimir Oltean's avatar
      net: dsa: install the primary unicast MAC address as standalone port host FDB · 499aa9e1
      Vladimir Oltean authored
      To be able to safely turn off CPU flooding for standalone ports, we need
      to ensure that the dev_addr of each DSA slave interface is installed as
      a standalone host FDB entry for compatible switches.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      499aa9e1
    • Vladimir Oltean's avatar
      net: dsa: install secondary unicast and multicast addresses as host FDB/MDB · 5e8a1e03
      Vladimir Oltean authored
      In preparation of disabling flooding towards the CPU in standalone ports
      mode, identify the addresses requested by upper interfaces and use the
      new API for DSA FDB isolation to request the hardware driver to offload
      these as FDB or MDB objects. The objects belong to the user port's
      database, and are installed pointing towards the CPU port.
      
      Because dev_uc_add()/dev_mc_add() is VLAN-unaware, we offload to the
      port standalone database addresses with VID 0 (also VLAN-unaware).
      So this excludes switches with global VLAN filtering from supporting
      unicast filtering, because there, it is possible for a port of a switch
      to join a VLAN-aware bridge, and this changes the VLAN awareness of
      standalone ports, requiring VLAN-aware standalone host FDB entries.
      For the same reason, hellcreek, which requires VLAN awareness in
      standalone mode, is also exempted from unicast filtering.
      
      We create "standalone" variants of dsa_port_host_fdb_add() and
      dsa_port_host_mdb_add() (and the _del coresponding functions).
      
      We also create a separate work item type for handling deferred
      standalone host FDB/MDB entries compared to the switchdev one.
      This is done for the purpose of clarity - the procedure for offloading a
      bridge FDB entry is different than offloading a standalone one, and
      the switchdev event work handles only FDBs anyway, not MDBs.
      Deferral is needed for standalone entries because ndo_set_rx_mode runs
      in atomic context. We could probably optimize things a little by first
      queuing up all entries that need to be offloaded, and scheduling the
      work item just once, but the data structures that we can pass through
      __dev_uc_sync() and __dev_mc_sync() are limiting (there is nothing like
      a void *priv), so we'd have to keep the list of queued events somewhere
      in struct dsa_switch, and possibly a lock for it. Too complicated for
      now.
      
      Adding the address to the master is handled by dev_uc_sync(), adding it
      to the hardware is handled by __dev_uc_sync(). So this is the reason why
      dsa_port_standalone_host_fdb_add() does not call dev_uc_add(). Not that
      it had the rtnl_mutex anyway - ndo_set_rx_mode has it, but is atomic.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5e8a1e03
    • Vladimir Oltean's avatar
      net: dsa: rename the host FDB and MDB methods to contain the "bridge" namespace · 68d6d71e
      Vladimir Oltean authored
      We are preparing to add API in port.c that adds FDB and MDB entries that
      correspond to the port's standalone database. Rename the existing
      methods to make it clear that the FDB and MDB entries offloaded come
      from the bridge database.
      
      Since the function names lengthen in dsa_slave_switchdev_event_work(),
      we place "addr" and "vid" in temporary variables, to shorten those.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68d6d71e
    • Vladimir Oltean's avatar
      net: dsa: remove workarounds for changing master promisc/allmulti only while up · 35aae5ab
      Vladimir Oltean authored
      Lennert Buytenhek explains in commit df02c6ff ("dsa: fix master
      interface allmulti/promisc handling"), dated Nov 2008, that changing the
      promiscuity of interfaces that are down (here the master) is broken.
      
      This fact regarding promisc/allmulti has changed since commit
      b6c40d68 ("net: only invoke dev->change_rx_flags when device is UP")
      by Vlad Yasevich, dated Nov 2013.
      
      Therefore, DSA now has unnecessary complexity to handle master state
      transitions from down to up. In fact, syncing the unicast and multicast
      addresses can happen completely asynchronously to the administrative
      state changes.
      
      This change reduces that complexity by effectively fully reverting
      commit df02c6ff ("dsa: fix master interface allmulti/promisc
      handling").
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      35aae5ab
    • Karol Kolacinski's avatar
      ice: add TTY for GNSS module for E810T device · 43113ff7
      Karol Kolacinski authored
      Add a new ice_gnss.c file for holding the basic GNSS module functions.
      If the device supports GNSS module, call the new ice_gnss_init and
      ice_gnss_release functions where appropriate.
      
      Implement basic functionality for reading the data from GNSS module
      using TTY device.
      
      Add I2C read AQ command. It is now required for controlling the external
      physical connectors via external I2C port expander on E810-T adapters.
      
      Future changes will introduce write functionality.
      Signed-off-by: default avatarKarol Kolacinski <karol.kolacinski@intel.com>
      Signed-off-by: default avatarSudhansu Sekhar Mishra <sudhansu.mishra@intel.com>
      Tested-by: default avatarSunitha Mekala <sunithax.d.mekala@intel.com>
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      43113ff7
    • David S. Miller's avatar
      Merge branch 'nfc-llcp-cleanups' · ef132dc4
      David S. Miller authored
      Krzysztof Kozlowski says:
      
      ====================
      nfc: llcp: few cleanups/improvements
      
      These are improvements, not fixing any experienced issue, just looking correct
      to me from the code point of view.
      
      Changes since v1
      ================
      1. Split from the fix.
      
      Testing
      =======
      Under QEMU only. The NFC/LLCP code was not really tested on a device.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ef132dc4
    • Krzysztof Kozlowski's avatar
      nfc: llcp: Revert "NFC: Keep socket alive until the DISC PDU is actually sent" · 44cd5765
      Krzysztof Kozlowski authored
      This reverts commit 17f7ae16.
      
      The commit brought a new socket state LLCP_DISCONNECTING, which was
      never set, only read, so socket could never set to such state.
      
      Remove the dead code.
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      44cd5765
    • Krzysztof Kozlowski's avatar
      nfc: llcp: protect nfc_llcp_sock_unlink() calls · a06b8044
      Krzysztof Kozlowski authored
      nfc_llcp_sock_link() is called in all paths (bind/connect) as a last
      action, still protected with lock_sock().  When cleaning up in
      llcp_sock_release(), call nfc_llcp_sock_unlink() in a mirrored way:
      earlier and still under the lock_sock().
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a06b8044
    • Krzysztof Kozlowski's avatar
      nfc: llcp: use test_bit() · a7364912
      Krzysztof Kozlowski authored
      Use test_bit() instead of open-coding it, just like in other places
      touching the bitmap.
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a7364912
    • Krzysztof Kozlowski's avatar
      nfc: llcp: use centralized exiting of bind on errors · 4dbbf673
      Krzysztof Kozlowski authored
      Coding style encourages centralized exiting of functions, so rewrite
      llcp_sock_bind() error paths to use such pattern.  This reduces the
      duplicated cleanup code, make success path visually shorter and also
      cleans up the errors in proper order (in reversed way from
      initialization).
      
      No functional impact expected.
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      4dbbf673
    • Krzysztof Kozlowski's avatar
      nfc: llcp: simplify llcp_sock_connect() error paths · ec10fd15
      Krzysztof Kozlowski authored
      The llcp_sock_connect() error paths were using a mixed way of central
      exit (goto) and cleanup
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ec10fd15
    • Krzysztof Kozlowski's avatar
      nfc: llcp: nullify llcp_sock->dev on connect() error paths · 13a3585b
      Krzysztof Kozlowski authored
      Nullify the llcp_sock->dev on llcp_sock_connect() error paths,
      symmetrically to the code llcp_sock_bind().  The non-NULL value of
      llcp_sock->dev is used in a few places to check whether the socket is
      still valid.
      
      There was no particular issue observed with missing NULL assignment in
      connect() error path, however a similar case - in the bind() error path
      - was triggereable.  That one was fixed in commit 4ac06a1e ("nfc:
      fix NULL ptr dereference in llcp_sock_getname() after failed connect"),
      so the change here seems logical as well.
      Signed-off-by: default avatarKrzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      13a3585b
    • David S. Miller's avatar
      Merge branch 'net-hw-counters-for-soft-devices' · ca0a53dc
      David S. Miller authored
      Ido Schimmel says:
      
      ====================
      HW counters for soft devices
      
      Petr says:
      
      Offloading switch device drivers may be able to collect statistics of the
      traffic taking place in the HW datapath that pertains to a certain soft
      netdevice, such as a VLAN. In this patch set, add the necessary
      infrastructure to allow exposing these statistics to the offloaded
      netdevice in question, and add mlxsw offload.
      
      Across HW platforms, the counter itself very likely constitutes a limited
      resource, and the act of counting may have a performance impact. Therefore
      this patch set makes the HW statistics collection opt-in and togglable from
      userspace on a per-netdevice basis.
      
      Additionally, HW devices may have various limiting conditions under which
      they can realize the counter. Therefore it is also possible to query
      whether the requested counter is realized by any driver. In TC parlance,
      which is to a degree reused in this patch set, two values are recognized:
      "request" tracks whether the user enabled collecting HW statistics, and
      "used" tracks whether any HW statistics are actually collected.
      
      In the past, this author has expressed the opinion that `a typical user
      doing "ip -s l sh", including various scripts, wants to see the full
      picture and not worry what's going on where'. While that would be nice,
      unfortunately it cannot work:
      
      - Packets that trap from the HW datapath to the SW datapath would be
        double counted.
      
        For a given netdevice, some traffic can be purely a SW artifact, and some
        may flow through the HW object corresponding to the netdevice. But some
        traffic can also get trapped to the SW datapath after bumping the HW
        counter. It is not clear how to make sure double-counting does not occur
        in the SW datapath in that case, while still making sure that possibly
        divergent SW forwarding path gets bumped as appropriate.
      
        So simply adding HW and SW stats may work roughly, most of the time, but
        there are scenarios where the result is nonsensical.
      
      - HW devices will have limitations as to what type of traffic they can
        count.
      
        In case of mlxsw, which is part of this patch set, there is no reasonable
        way to count all traffic going through a certain netdevice, such as a
        VLAN netdevice enslaved to a bridge. It is however very simple to count
        traffic flowing through an L3 object, such as a VLAN netdevice with an IP
        address.
      
        Similarly for physical netdevices, the L3 object at which the counter is
        installed is the subport carrying untagged traffic.
      
        These are not "just counters". It is important that the user understands
        what is being counted. It would be incorrect to conflate these statistics
        with another existing statistics suite.
      
      To that end, this patch set introduces a statistics suite called "L3
      stats". This label should make it easy to understand what is being counted,
      and to decide whether a given device can or cannot implement this suite for
      some type of netdevice. At the same time, the code is written to make
      future extensions easy, should a device pop up that can implement a
      different flavor of statistics suite (say L2, or an address-family-specific
      suite).
      
      For example, using a work-in-progress iproute2[1], to turn on and then list
      the counters on a VLAN netdevice:
      
          # ip stats set dev swp1.200 l3_stats on
          # ip stats show dev swp1.200 group offload subgroup l3_stats
          56: swp1.200: group offload subgroup l3_stats on used on
      	RX:  bytes packets errors dropped  missed   mcast
      		0       0      0       0       0       0
      	TX:  bytes packets errors dropped carrier collsns
      		0       0      0       0       0       0
      
      The patchset progresses as follows:
      
      - Patch #1 is a cleanup.
      
      - In patch #2, remove the assumption that all LINK_OFFLOAD_XSTATS are
        dev-backed.
      
        The only attribute defined under the nest is currently
        IFLA_OFFLOAD_XSTATS_CPU_HIT. L3_STATS differs from CPU_HIT in that the
        driver that supplies the statistics is not the same as the driver that
        implements the netdevice. Make the code compatible with this in patch #2.
      
      - In patch #3, add the possibility to filter inside nests.
      
        The filter_mask field of RTM_GETSTATS header determines which
        top-level attributes should be included in the netlink response. This
        saves processing time by only including the bits that the user cares
        about instead of always dumping everything. This is doubly important
        for HW-backed statistics that would typically require a trip to the
        device to fetch the stats. In this patch, the UAPI is extended to
        allow filtering inside IFLA_STATS_LINK_OFFLOAD_XSTATS in particular,
        but the scheme is easily extensible to other nests as well.
      
      - In patch #4, propagate extack where we need it.
        In patch #5, make it possible to propagate errors from drivers to the
        user.
      
      - In patch #6, add the in-kernel APIs for keeping track of the new stats
        suite, and the notifiers that the core uses to communicate with the
        drivers.
      
      - In patch #7, add UAPI for obtaining the new stats suite.
      
      - In patch #8, add a new UAPI message, RTM_SETSTATS, which will carry
        the message to toggle the newly-added stats suite.
        In patch #9, add the toggle itself.
      
      At this point the core is ready for drivers to add support for the new
      stats suite.
      
      - In patches #10, #11 and #12, apply small tweaks to mlxsw code.
      
      - In patch #13, add support for L3 stats, which are realized as RIF
        counters.
      
      - Finally in patch #14, a selftest is added to the net/forwarding
        directory. Technically this is a HW-specific test, in that without a HW
        implementing the counters, it just will not pass. But devices that
        support L3 statistics at all are likely to be able to reuse this
        selftest, so it seems appropriate to put it in the general forwarding
        directory.
      
      We also have a netdevsim implementation, and a corresponding selftest that
      verifies specifically some of the core code. We intend to contribute these
      later. Interested parties can take a look at the raw code at [2].
      
      [1] https://github.com/pmachata/iproute2/commits/soft_counters
      [2] https://github.com/pmachata/linux_mlxsw/commits/petrm_soft_counters_2
      
      v2:
      - Patch #3:
          - Do not declare strict_start_type at the new policies, since they are
            used with nla_parse_nested() (sans _deprecated).
          - Use NLA_POLICY_NESTED to declare what the nest contents should be
          - Use NLA_POLICY_MASK instead of BITFIELD32 for the filtering
            attribute.
      - Patch #6:
          - s/monotonous/monotonic/ in commit message
          - Use a newly-added struct rtnl_hw_stats64 for stats transfer
      - Patch #7:
          - Use a newly-added struct rtnl_hw_stats64 for stats transfer
      - Patch #8:
          - Do not declare strict_start_type at the new policies, since they are
            used with nla_parse_nested() (sans _deprecated).
      - Patch #13:
          - Use a newly-added struct rtnl_hw_stats64 for stats transfer
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ca0a53dc
    • Petr Machata's avatar
      selftests: forwarding: hw_stats_l3: Add a new test · ba95e793
      Petr Machata authored
      Add a test that verifies operation of L3 HW statistics.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ba95e793
    • Petr Machata's avatar
      mlxsw: Add support for IFLA_OFFLOAD_XSTATS_L3_STATS · 8d0f7d3a
      Petr Machata authored
      Spectrum machines support L3 stats by binding a counter to a RIF, a
      hardware object representing a router interface. Recognize the netdevice
      notifier events, NETDEV_OFFLOAD_XSTATS_*, to support enablement,
      disablement, and reporting back to core.
      
      As a netdevice gains a RIF, if L3 stats are enabled, install the counters,
      and ping the core so that a userspace notification can be emitted.
      
      Similarly, as a netdevice loses a RIF, push the as-yet-unreported
      statistics to the core, so that they are not lost, and ping the core to
      emit userspace notification.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8d0f7d3a
    • Petr Machata's avatar
      mlxsw: Extract classification of router-related events to a helper · c1de13f9
      Petr Machata authored
      Several more events are coming in the following patches, and extending the
      if statement is getting awkward. Instead, convert it to a switch.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      c1de13f9
    • Petr Machata's avatar
      mlxsw: spectrum_router: Drop mlxsw_sp arg from counter alloc/free functions · 9834e246
      Petr Machata authored
      The mlxsw_sp reference is carried by the mlxsw_sp_rif object that is passed
      to these functions as well. Just deduce the former from the latter,
      and drop the explicit mlxsw_sp parameter. Adapt callers.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9834e246
    • Petr Machata's avatar
      mlxsw: reg: Fix packing of router interface counters · 8fe96f58
      Petr Machata authored
      The function mlxsw_reg_ritr_counter_pack() formats a register to configure
      a router interface (RIF) counter. The parameter `egress' determines whether
      an ingress or egress counter is to be configured. RITR, the register in
      question, has two sets of counter-related fields: one for ingress, one for
      egress. When setting values of the fields, the function sets the proper
      counter index field, but when setting the counter type, it always sets the
      egress field. Thus configuration of ingress counters is broken, and in fact
      an attempt to configure an ingress counter mangles a previously configured
      egress counter.
      
      This was never discovered, because there is currently no way to enable
      ingress counters on a router interface, only the egress one.
      
      Fix in an obvious way.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      8fe96f58
    • Petr Machata's avatar
      net: rtnetlink: Add UAPI toggle for IFLA_OFFLOAD_XSTATS_L3_STATS · 5fd0b838
      Petr Machata authored
      The offloaded HW stats are designed to allow per-netdevice enablement and
      disablement. Add an attribute, IFLA_STATS_SET_OFFLOAD_XSTATS_L3_STATS,
      which should be carried by the RTM_SETSTATS message, and expresses a desire
      to toggle L3 offload xstats on or off.
      
      As part of the above, add an exported function rtnl_offload_xstats_notify()
      that drivers can use when they have installed or deinstalled the counters
      backing the HW stats.
      
      At this point, it is possible to enable, disable and query L3 offload
      xstats on netdevices. (However there is no driver actually implementing
      these.)
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5fd0b838
    • Petr Machata's avatar
      net: rtnetlink: Add RTM_SETSTATS · 03ba3566
      Petr Machata authored
      The offloaded HW stats are designed to allow per-netdevice enablement and
      disablement. These stats are only accessible through RTM_GETSTATS, and
      therefore should be toggled by a RTM_SETSTATS message. Add it, and the
      necessary skeleton handler.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      03ba3566
    • Petr Machata's avatar
      net: rtnetlink: Add UAPI for obtaining L3 offload xstats · 0e7788fd
      Petr Machata authored
      Add a new IFLA_STATS_LINK_OFFLOAD_XSTATS child attribute,
      IFLA_OFFLOAD_XSTATS_L3_STATS, to carry statistics for traffic that takes
      place in a HW router.
      
      The offloaded HW stats are designed to allow per-netdevice enablement and
      disablement. Additionally, as a netdevice is configured, it may become or
      cease being suitable for binding of a HW counter. Both of these aspects
      need to be communicated to the userspace. To that end, add another child
      attribute, IFLA_OFFLOAD_XSTATS_HW_S_INFO:
      
          - attr nest IFLA_OFFLOAD_XSTATS_HW_S_INFO
      	- attr nest IFLA_OFFLOAD_XSTATS_L3_STATS
       	    - attr IFLA_OFFLOAD_XSTATS_HW_S_INFO_REQUEST
      	      - {0,1} as u8
       	    - attr IFLA_OFFLOAD_XSTATS_HW_S_INFO_USED
      	      - {0,1} as u8
      
      Thus this one attribute is a nest that can be used to carry information
      about various types of HW statistics, and indexing is very simply done by
      wrapping the information for a given statistics suite into the attribute
      that carries the suite is the RTM_GETSTATS query. At the same time, because
      _HW_S_INFO is nested directly below IFLA_STATS_LINK_OFFLOAD_XSTATS, it is
      possible through filtering to request only the metadata about individual
      statistics suites, without having to hit the HW to get the actual counters.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0e7788fd
    • Petr Machata's avatar
      net: dev: Add hardware stats support · 9309f97a
      Petr Machata authored
      Offloading switch device drivers may be able to collect statistics of the
      traffic taking place in the HW datapath that pertains to a certain soft
      netdevice, such as VLAN. Add the necessary infrastructure to allow exposing
      these statistics to the offloaded netdevice in question. The API was shaped
      by the following considerations:
      
      - Collection of HW statistics is not free: there may be a finite number of
        counters, and the act of counting may have a performance impact. It is
        therefore necessary to allow toggling whether HW counting should be done
        for any particular SW netdevice.
      
      - As the drivers are loaded and removed, a particular device may get
        offloaded and unoffloaded again. At the same time, the statistics values
        need to stay monotonic (modulo the eventual 64-bit wraparound),
        increasing only to reflect traffic measured in the device.
      
        To that end, the netdevice keeps around a lazily-allocated copy of struct
        rtnl_link_stats64. Device drivers then contribute to the values kept
        therein at various points. Even as the driver goes away, the struct stays
        around to maintain the statistics values.
      
      - Different HW devices may be able to count different things. The
        motivation behind this patch in particular is exposure of HW counters on
        Nvidia Spectrum switches, where the only practical approach to counting
        traffic on offloaded soft netdevices currently is to use router interface
        counters, and count L3 traffic. Correspondingly that is the statistics
        suite added in this patch.
      
        Other devices may be able to measure different kinds of traffic, and for
        that reason, the APIs are built to allow uniform access to different
        statistics suites.
      
      - Because soft netdevices and offloading drivers are only loosely bound, a
        netdevice uses a notifier chain to communicate with the drivers. Several
        new notifiers, NETDEV_OFFLOAD_XSTATS_*, have been added to carry messages
        to the offloading drivers.
      
      - Devices can have various conditions for when a particular counter is
        available. As the device is configured and reconfigured, the device
        offload may become or cease being suitable for counter binding. A
        netdevice can use a notifier type NETDEV_OFFLOAD_XSTATS_REPORT_USED to
        ping offloading drivers and determine whether anyone currently implements
        a given statistics suite. This information can then be propagated to user
        space.
      
        When the driver decides to unoffload a netdevice, it can use a
        newly-added function, netdev_offload_xstats_report_delta(), to record
        outstanding collected statistics, before destroying the HW counter.
      
      This patch adds a helper, call_netdevice_notifiers_info_robust(), for
      dispatching a notifier with the possibility of unwind when one of the
      consumers bails. Given the wish to eventually get rid of the global
      notifier block altogether, this helper only invokes the per-netns notifier
      block.
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      9309f97a