1. 12 May, 2022 32 commits
    • Vladimir Oltean's avatar
      net: dsa: felix: reimplement tagging protocol change with function pointers · 7a29d220
      Vladimir Oltean authored
      The error handling for the current tagging protocol change procedure is
      a bit brittle (we dismantle the previous tagging protocol entirely
      before setting up the new one). By identifying which parts of a tagging
      protocol are unique to itself and which parts are shared with the other,
      we can implement a protocol change procedure where error handling is a
      bit more robust, because we start setting up the new protocol first, and
      tear down the old one only after the setup of the specific and shared
      parts succeeded.
      
      The protocol change is a bit too open-coded too, in the area of
      migrating host flood settings and MDBs. By identifying what differs
      between tagging protocols (the forwarding masks for host flooding) we
      can implement a more straightforward migration procedure which is
      handled in the shared portion of the protocol change, rather than
      individually by each protocol.
      
      Therefore, a more structured approach calls for the introduction of a
      structure of function pointers per tagging protocol. This covers setup,
      teardown and the host forwarding mask. In the future it will also cover
      how to prepare for a new DSA master.
      
      The initial tagging protocol setup (at driver probe time) and the final
      teardown (at driver removal time) are also adapted to call into the
      structured methods of the specific protocol in current use. This is
      especially relevant for teardown, where we previously called
      felix_del_tag_protocol() only for the first CPU port. But by not
      specifying which CPU port this is for, we gain more flexibility to
      support multiple CPU ports in the future.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      7a29d220
    • Vladimir Oltean's avatar
      net: dsa: felix: dynamically determine tag_8021q CPU port for traps · c352e5e8
      Vladimir Oltean authored
      Ocelot switches support a single active CPU port at a time (at least as
      a trapping destination, i.e. for control traffic). This is true
      regardless of whether we are using the native copy-to-CPU-port-module
      functionality, or a redirect action towards the software-defined
      tag_8021q CPU port.
      
      Currently we assume that the trapping destination in tag_8021q mode is
      the first CPU port, yet in the future we may want to migrate the user
      ports to the second CPU port.
      
      For that to work, we need to make sure that the tag_8021q trapping
      destination is a CPU port that is active, i.e. is used by at least some
      user port on which the trap was added. Otherwise, we may end up
      redirecting the traffic to a CPU port which isn't even up.
      
      Note that due to the current design where we simply choose the CPU port
      of the first port from the trap's ingress port mask, it may be that a
      CPU port absorbes control traffic from user ports which aren't affine to
      it as per user space's request. This isn't ideal, but is the lesser of
      two evils. Following the user-configured affinity for traps would mean
      that we can no longer reuse a single TCAM entry for multiple traps,
      which is what we actually do for e.g. PTP. Either we duplicate and
      deduplicate TCAM entries on the fly when user-to-CPU-port mappings
      change (which is unnecessarily complicated), or we redirect trapped
      traffic to all tag_8021q CPU ports if multiple such ports are in use.
      The latter would have actually been nice, if it actually worked, but it
      doesn't, since a OCELOT_MASK_MODE_REDIRECT action towards multiple ports
      would not take PGID_SRC into consideration, and it would just duplicate
      the packet towards each (CPU) port, leading to duplicates in software.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      c352e5e8
    • Vladimir Oltean's avatar
      net: dsa: remove port argument from ->change_tag_protocol() · bacf93b0
      Vladimir Oltean authored
      DSA has not supported (and probably will not support in the future
      either) independent tagging protocols per CPU port.
      
      Different switch drivers have different requirements, some may need to
      replicate some settings for each CPU port, some may need to apply some
      settings on a single CPU port, while some may have to configure some
      global settings and then some per-CPU-port settings.
      
      In any case, the current model where DSA calls ->change_tag_protocol for
      each CPU port turns out to be impractical for drivers where there are
      global things to be done. For example, felix calls dsa_tag_8021q_register(),
      which makes no sense per CPU port, so it suppresses the second call.
      
      Let drivers deal with replication towards all CPU ports, and remove the
      CPU port argument from the function prototype.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Acked-by: default avatarLuiz Angelo Daros de Luca <luizluca@gmail.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bacf93b0
    • Vladimir Oltean's avatar
      net: dsa: felix: manage host flooding using a specific driver callback · 72c3b0c7
      Vladimir Oltean authored
      At the time - commit 7569459a ("net: dsa: manage flooding on the CPU
      ports") - not introducing a dedicated switch callback for host flooding
      made sense, because for the only user, the felix driver, there was
      nothing different to do for the CPU port than set the flood flags on the
      CPU port just like on any other bridge port.
      
      There are 2 reasons why this approach is not good enough, however.
      
      (1) Other drivers, like sja1105, support configuring flooding as a
          function of {ingress port, egress port}, whereas the DSA
          ->port_bridge_flags() function only operates on an egress port.
          So with that driver we'd have useless host flooding from user ports
          which don't need it.
      
      (2) Even with the felix driver, support for multiple CPU ports makes it
          difficult to piggyback on ->port_bridge_flags(). The way in which
          the felix driver is going to support host-filtered addresses with
          multiple CPU ports is that it will direct these addresses towards
          both CPU ports (in a sort of multicast fashion), then restrict the
          forwarding to only one of the two using the forwarding masks.
          Consequently, flooding will also be enabled towards both CPU ports.
          However, ->port_bridge_flags() gets passed the index of a single CPU
          port, and that leaves the flood settings out of sync between the 2
          CPU ports.
      
      This is to say, it's better to have a specific driver method for host
      flooding, which takes the user port as argument. This solves problem (1)
      by allowing the driver to do different things for different user ports,
      and problem (2) by abstracting the operation and letting the driver do
      whatever, rather than explicitly making the DSA core point to the CPU
      port it thinks needs to be touched.
      
      This new method also creates a problem, which is that cross-chip setups
      are not handled. However I don't have hardware right now where I can
      test what is the proper thing to do, and there isn't hardware compatible
      with multi-switch trees that supports host flooding. So it remains a
      problem to be tackled in the future.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      72c3b0c7
    • Vladimir Oltean's avatar
      net: dsa: introduce the dsa_cpu_ports() helper · 465c3de4
      Vladimir Oltean authored
      Similar to dsa_user_ports() which retrieves a port mask of all user
      ports, introduce dsa_cpu_ports() which retrieves the mask of all CPU
      ports of a switch.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      465c3de4
    • Vladimir Oltean's avatar
      net: dsa: felix: bring the NPI port indirection for host flooding to surface · 910ee6cc
      Vladimir Oltean authored
      For symmetry with host FDBs and MDBs where the indirection is now
      handled outside the ocelot switch lib, do the same for bridge port
      flags (unicast/multicast/broadcast flooding).
      
      The only caller of the ocelot switch lib which uses the NPI port is the
      Felix DSA driver.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      910ee6cc
    • Vladimir Oltean's avatar
      net: dsa: felix: bring the NPI port indirection for host MDBs to surface · 0ddf83cd
      Vladimir Oltean authored
      For symmetry with host FDBs where the indirection is now handled outside
      the ocelot switch lib, do the same for host MDB entries. The only caller
      of the ocelot switch lib which uses the NPI port is the Felix DSA driver.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0ddf83cd
    • Vladimir Oltean's avatar
      net: dsa: felix: program host FDB entries towards PGID_CPU for tag_8021q too · e9b3ba43
      Vladimir Oltean authored
      I remembered why we had the host FDB migration procedure in place.
      
      It is true that host FDB entry migration can be done by changing the
      value of PGID_CPU, but the problem is that only host FDB entries learned
      while operating in NPI mode go to PGID_CPU. When the CPU port operates
      in tag_8021q mode, the FDB entries are learned towards the unicast PGID
      equal to the physical port number of this CPU port, bypassing the
      PGID_CPU indirection.
      
      So host FDB entries learned in tag_8021q mode are not migrated any
      longer towards the NPI port.
      
      Fix this by extracting the NPI port -> PGID_CPU redirection from the
      ocelot switch lib, moving it to the Felix DSA driver, and applying it
      for any CPU port regardless of its kind (NPI or tag_8021q).
      
      Fixes: a51c1c3f ("net: dsa: felix: stop migrating FDBs back and forth on tag proto change")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e9b3ba43
    • Horatiu Vultur's avatar
      net: lan966x: Fix use of pointer after being freed · f0a65f81
      Horatiu Vultur authored
      The smatch found the following warning:
      
      drivers/net/ethernet/microchip/lan966x/lan966x_fdma.c:736 lan966x_fdma_reload()
      warn: 'rx_dcbs' was already freed.
      
      This issue can happen when changing the MTU on one of the ports and once
      the RX buffers are allocated and then the TX buffer allocation fails.
      In that case the RX buffers should not be restore. This fix this issue
      such that the RX buffers will not be restored if the TX buffers failed
      to be allocated.
      
      Fixes: 2ea1cbac ("net: lan966x: Update FDMA to change MTU.")
      Reported-by: default avatarDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: default avatarHoratiu Vultur <horatiu.vultur@microchip.com>
      Link: https://lore.kernel.org/r/20220511204059.2689199-1-horatiu.vultur@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f0a65f81
    • Jakub Kicinski's avatar
      net: update the register_netdevice() kdoc · fa926bb3
      Jakub Kicinski authored
      The BUGS section looks quite dated, the registration
      is under rtnl lock. Remove some obvious information
      while at it.
      
      Link: https://lore.kernel.org/r/20220511190720.1401356-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fa926bb3
    • Jakub Kicinski's avatar
      skbuff: replace a BUG_ON() with the new DEBUG_NET_WARN_ON_ONCE() · 0df65743
      Jakub Kicinski authored
      Very few drivers actually have Kconfig knobs for adding
      -DDEBUG. 8 according to a quick grep, while there are
      93 users of skb_checksum_none_assert(). Switch to the
      new DEBUG_NET_WARN_ON_ONCE() to catch bad skbs.
      Reviewed-by: default avatarEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220511172305.1382810-1-kuba@kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      0df65743
    • David Thompson's avatar
      mlxbf_gige: remove driver-managed interrupt counts · f4826443
      David Thompson authored
      The driver currently has three interrupt counters,
      which are incremented every time each interrupt handler
      executes.  These driver-managed counters are not
      necessary as the kernel already has logic that manages
      interrupt counts and exposes them via /proc/interrupts.
      This patch removes the driver-managed counters.
      Signed-off-by: default avatarDavid Thompson <davthompson@nvidia.com>
      Signed-off-by: default avatarAsmaa Mnebhi <asmaa@nvidia.com>
      Link: https://lore.kernel.org/r/20220511135251.2989-1-davthompson@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f4826443
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · 9b19e57a
      Jakub Kicinski authored
      No conflicts.
      
      Build issue in drivers/net/ethernet/sfc/ptp.c
        54fccfdd ("sfc: efx_default_channel_type APIs can be static")
        49e6123c ("net: sfc: fix memory leak due to ptp channel")
      https://lore.kernel.org/all/20220510130556.52598fe2@canb.auug.org.au/Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      9b19e57a
    • Linus Torvalds's avatar
      Merge tag 'net-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net · f3f19f93
      Linus Torvalds authored
      Pull networking fixes from Jakub Kicinski:
       "Including fixes from wireless, and bluetooth.
      
        No outstanding fires.
      
        Current release - regressions:
      
         - eth: atlantic: always deep reset on pm op, fix null-deref
      
        Current release - new code bugs:
      
         - rds: use maybe_get_net() when acquiring refcount on TCP sockets
           [refinement of a previous fix]
      
         - eth: ocelot: mark traps with a bool instead of guessing type based
           on list membership
      
        Previous releases - regressions:
      
         - net: fix skipping features in for_each_netdev_feature()
      
         - phy: micrel: fix null-derefs on suspend/resume and probe
      
         - bcmgenet: check for Wake-on-LAN interrupt probe deferral
      
        Previous releases - always broken:
      
         - ipv4: drop dst in multicast routing path, prevent leaks
      
         - ping: fix address binding wrt vrf
      
         - net: fix wrong network header length when BPF protocol translation
           is used on skbs with a fraglist
      
         - bluetooth: fix the creation of hdev->name
      
         - rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition
      
         - wifi: iwlwifi: iwl-dbg: use del_timer_sync() before freeing
      
         - wifi: ath11k: reduce the wait time of 11d scan and hw scan while
           adding an interface
      
         - mac80211: fix rx reordering with non explicit / psmp ack policy
      
         - mac80211: reset MBSSID parameters upon connection
      
         - nl80211: fix races in nl80211_set_tx_bitrate_mask()
      
         - tls: fix context leak on tls_device_down
      
         - sched: act_pedit: really ensure the skb is writable
      
         - batman-adv: don't skb_split skbuffs with frag_list
      
         - eth: ocelot: fix various issues with TC actions (null-deref; bad
           stats; ineffective drops; ineffective filter removal)"
      
      * tag 'net-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (61 commits)
        tls: Fix context leak on tls_device_down
        net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
        net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
        net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
        mlxsw: Avoid warning during ip6gre device removal
        net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
        net: ethernet: mediatek: ppe: fix wrong size passed to memset()
        Bluetooth: Fix the creation of hdev->name
        i40e: i40e_main: fix a missing check on list iterator
        net/sched: act_pedit: really ensure the skb is writable
        s390/lcs: fix variable dereferenced before check
        s390/ctcm: fix potential memory leak
        s390/ctcm: fix variable dereferenced before check
        net: atlantic: verify hw_head_ lies within TX buffer ring
        net: atlantic: add check for MAX_SKB_FRAGS
        net: atlantic: reduce scope of is_rsc_complete
        net: atlantic: fix "frag[0] not initialized"
        net: stmmac: fix missing pci_disable_device() on error in stmmac_pci_probe()
        net: phy: micrel: Fix incorrect variable type in micrel
        decnet: Use container_of() for struct dn_neigh casts
        ...
      f3f19f93
    • Linus Torvalds's avatar
      Merge branch 'for-5.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup · 0ac824f3
      Linus Torvalds authored
      Pull cgroup fix from Tejun Heo:
       "Waiman's fix for a cgroup2 cpuset bug where it could miss nodes which
        were hot-added"
      
      * 'for-5.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
        cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
      0ac824f3
    • Linus Torvalds's avatar
      Merge tag 'fixes_for_v5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs · c37dba6a
      Linus Torvalds authored
      Pull fs fixes from Jan Kara:
       "Three fixes that I'd still like to get to 5.18:
      
         - add a missing sanity check in the fanotify FAN_RENAME feature
           (added in 5.17, let's fix it before it gets wider usage in
           userspace)
      
         - udf fix for recently introduced filesystem corruption issue
      
         - writeback fix for a race in inode list handling that can lead to
           delayed writeback and possible dirty throttling stalls"
      
      * tag 'fixes_for_v5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
        udf: Avoid using stale lengthOfImpUse
        writeback: Avoid skipping inode writeback
        fanotify: do not allow setting dirent events in mask of non-dir
      c37dba6a
    • Maxim Mikityanskiy's avatar
      tls: Fix context leak on tls_device_down · 3740651b
      Maxim Mikityanskiy authored
      The commit cited below claims to fix a use-after-free condition after
      tls_device_down. Apparently, the description wasn't fully accurate. The
      context stayed alive, but ctx->netdev became NULL, and the offload was
      torn down without a proper fallback, so a bug was present, but a
      different kind of bug.
      
      Due to misunderstanding of the issue, the original patch dropped the
      refcount_dec_and_test line for the context to avoid the alleged
      premature deallocation. That line has to be restored, because it matches
      the refcount_inc_not_zero from the same function, otherwise the contexts
      that survived tls_device_down are leaked.
      
      This patch fixes the described issue by restoring refcount_dec_and_test.
      After this change, there is no leak anymore, and the fallback to
      software kTLS still works.
      
      Fixes: c55dcdd4 ("net/tls: Fix use-after-free after the TLS device goes down and up")
      Signed-off-by: default avatarMaxim Mikityanskiy <maximmi@nvidia.com>
      Reviewed-by: default avatarTariq Toukan <tariqt@nvidia.com>
      Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3740651b
    • Taehee Yoo's avatar
      net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe() · 1fa89ffb
      Taehee Yoo authored
      In the NIC ->probe() callback, ->mtd_probe() callback is called.
      If NIC has 2 ports, ->probe() is called twice and ->mtd_probe() too.
      In the ->mtd_probe(), which is efx_ef10_mtd_probe() it allocates and
      initializes mtd partiion.
      But mtd partition for sfc is shared data.
      So that allocated mtd partition data from last called
      efx_ef10_mtd_probe() will not be used.
      Therefore it must be freed.
      But it doesn't free a not used mtd partition data in efx_ef10_mtd_probe().
      
      kmemleak reports:
      unreferenced object 0xffff88811ddb0000 (size 63168):
        comm "systemd-udevd", pid 265, jiffies 4294681048 (age 348.586s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffffa3767749>] kmalloc_order_trace+0x19/0x120
          [<ffffffffa3873f0e>] __kmalloc+0x20e/0x250
          [<ffffffffc041389f>] efx_ef10_mtd_probe+0x11f/0x270 [sfc]
          [<ffffffffc0484c8a>] efx_pci_probe.cold.17+0x3df/0x53d [sfc]
          [<ffffffffa414192c>] local_pci_probe+0xdc/0x170
          [<ffffffffa4145df5>] pci_device_probe+0x235/0x680
          [<ffffffffa443dd52>] really_probe+0x1c2/0x8f0
          [<ffffffffa443e72b>] __driver_probe_device+0x2ab/0x460
          [<ffffffffa443e92a>] driver_probe_device+0x4a/0x120
          [<ffffffffa443f2ae>] __driver_attach+0x16e/0x320
          [<ffffffffa4437a90>] bus_for_each_dev+0x110/0x190
          [<ffffffffa443b75e>] bus_add_driver+0x39e/0x560
          [<ffffffffa4440b1e>] driver_register+0x18e/0x310
          [<ffffffffc02e2055>] 0xffffffffc02e2055
          [<ffffffffa3001af3>] do_one_initcall+0xc3/0x450
          [<ffffffffa33ca574>] do_init_module+0x1b4/0x700
      Acked-by: default avatarMartin Habets <habetsm.xilinx@gmail.com>
      Fixes: 8127d661 ("sfc: Add support for Solarflare SFC9100 family")
      Signed-off-by: default avatarTaehee Yoo <ap420073@gmail.com>
      Link: https://lore.kernel.org/r/20220512054709.12513-1-ap420073@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1fa89ffb
    • Guangguan Wang's avatar
      net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending · f3c46e41
      Guangguan Wang authored
      Non blocking sendmsg will return -EAGAIN when any signal pending
      and no send space left, while non blocking recvmsg return -EINTR
      when signal pending and no data received. This may makes confused.
      As TCP returns -EAGAIN in the conditions described above. Align the
      behavior of smc with TCP.
      
      Fixes: 846e344e ("net/smc: add receive timeout check")
      Signed-off-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
      Reviewed-by: default avatarTony Lu <tonylu@linux.alibaba.com>
      Acked-by: default avatarKarsten Graul <kgraul@linux.ibm.com>
      Link: https://lore.kernel.org/r/20220512030820.73848-1-guangguan.wang@linux.alibaba.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      f3c46e41
    • Florian Fainelli's avatar
      net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down() · b7be130c
      Florian Fainelli authored
      After commit 2d1f90f9 ("net: dsa/bcm_sf2: fix incorrect usage of
      state->link") the interface suspend path would call our mac_link_down()
      call back which would forcibly set the link down, thus preventing
      Wake-on-LAN packets from reaching our management port.
      
      Fix this by looking at whether the port is enabled for Wake-on-LAN and
      not clearing the link status in that case to let packets go through.
      
      Fixes: 2d1f90f9 ("net: dsa/bcm_sf2: fix incorrect usage of state->link")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Link: https://lore.kernel.org/r/20220512021731.2494261-1-f.fainelli@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      b7be130c
    • Amit Cohen's avatar
      mlxsw: Avoid warning during ip6gre device removal · 810c2f0a
      Amit Cohen authored
      IPv6 addresses which are used for tunnels are stored in a hash table
      with reference counting. When a new GRE tunnel is configured, the driver
      is notified and configures it in hardware.
      
      Currently, any change in the tunnel is not applied in the driver. It
      means that if the remote address is changed, the driver is not aware of
      this change and the first address will be used.
      
      This behavior results in a warning [1] in scenarios such as the
      following:
      
       # ip link add name gre1 type ip6gre local 2000::3 remote 2000::fffe tos inherit ttl inherit
       # ip link set name gre1 type ip6gre local 2000::3 remote 2000::ffff ttl inherit
       # ip link delete gre1
      
      The change of the address is not applied in the driver. Currently, the
      driver uses the remote address which is stored in the 'parms' of the
      overlay device. When the tunnel is removed, the new IPv6 address is
      used, the driver tries to release it, but as it is not aware of the
      change, this address is not configured and it warns about releasing non
      existing IPv6 address.
      
      Fix it by using the IPv6 address which is cached in the IPIP entry, this
      address is the last one that the driver used, so even in cases such the
      above, the first address will be released, without any warning.
      
      [1]:
      
      WARNING: CPU: 1 PID: 2197 at drivers/net/ethernet/mellanox/mlxsw/spectrum.c:2920 mlxsw_sp_ipv6_addr_put+0x146/0x220 [mlxsw_spectrum]
      ...
      CPU: 1 PID: 2197 Comm: ip Not tainted 5.17.0-rc8-custom-95062-gc1e5ded51a9a #84
      Hardware name: Mellanox Technologies Ltd. MSN4700/VMOD0010, BIOS 5.11 07/12/2021
      RIP: 0010:mlxsw_sp_ipv6_addr_put+0x146/0x220 [mlxsw_spectrum]
      ...
      Call Trace:
       <TASK>
       mlxsw_sp2_ipip_rem_addr_unset_gre6+0xf1/0x120 [mlxsw_spectrum]
       mlxsw_sp_netdevice_ipip_ol_event+0xdb/0x640 [mlxsw_spectrum]
       mlxsw_sp_netdevice_event+0xc4/0x850 [mlxsw_spectrum]
       raw_notifier_call_chain+0x3c/0x50
       call_netdevice_notifiers_info+0x2f/0x80
       unregister_netdevice_many+0x311/0x6d0
       rtnl_dellink+0x136/0x360
       rtnetlink_rcv_msg+0x12f/0x380
       netlink_rcv_skb+0x49/0xf0
       netlink_unicast+0x233/0x340
       netlink_sendmsg+0x202/0x440
       ____sys_sendmsg+0x1f3/0x220
       ___sys_sendmsg+0x70/0xb0
       __sys_sendmsg+0x54/0xa0
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: e846efe2 ("mlxsw: spectrum: Add hash table for IPv6 address mapping")
      Reported-by: default avatarMaksym Yaremchuk <maksymy@nvidia.com>
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/20220511115747.238602-1-idosch@nvidia.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      810c2f0a
    • Paolo Abeni's avatar
      Merge branch 'nfp-vf-rate-limit-support' · b33177f1
      Paolo Abeni authored
      Simon Horman says:
      
      ====================
      *nfp: VF rate limit support
      
      this short series adds VF rate limiting to the NFP driver.
      
      The first patch, as suggested by Jakub Kicinski, adds a helper
      to check that ndo_set_vf_rate() rate parameters are sane.
      It also provides a place for further parameter checking to live,
      if needed in future.
      
      The second patch adds VF rate limit support to the NFP driver.
      It addresses several comments made on v1, including removing
      the parameter check that is now provided by the helper added
      in the first patch.
      ====================
      
      Link: https://lore.kernel.org/r/20220511113932.92114-1-simon.horman@corigine.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      b33177f1
    • Bin Chen's avatar
      nfp: VF rate limit support · e0d0e1fd
      Bin Chen authored
      Add VF rate limit feature
      
      This patch enhances the NFP driver to supports assignment of
      both max_tx_rate and min_tx_rate to VFs
      
      The template of configurations below is all supported.
      e.g.
       # ip link set $DEV vf $VF_NUM max_tx_rate $RATE_VALUE
       # ip link set $DEV vf $VF_NUM min_tx_rate $RATE_VALUE
       # ip link set $DEV vf $VF_NUM max_tx_rate $RATE_VALUE \
      			       min_tx_rate $RATE_VALUE
       # ip link set $DEV vf $VF_NUM min_tx_rate $RATE_VALUE \
      			       max_tx_rate $RATE_VALUE
      
      The max RATE_VALUE is limited to 0xFFFF which is about
      63Gbps (using 1024 for 1G)
      Signed-off-by: default avatarBin Chen <bin.chen@corigine.com>
      Signed-off-by: default avatarLouis Peens <louis.peens@corigine.com>
      Signed-off-by: default avatarBaowen Zheng <baowen.zheng@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      e0d0e1fd
    • Bin Chen's avatar
      rtnetlink: verify rate parameters for calls to ndo_set_vf_rate · a14857c2
      Bin Chen authored
      When calling ndo_set_vf_rate() the max_tx_rate parameter may be zero,
      in which case the setting is cleared, or it must be greater or equal to
      min_tx_rate.
      
      Enforce this requirement on all calls to ndo_set_vf_rate via a wrapper
      which also only calls ndo_set_vf_rate() if defined by the driver.
      
      Based on work by Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarBin Chen <bin.chen@corigine.com>
      Signed-off-by: default avatarBaowen Zheng <baowen.zheng@corigine.com>
      Signed-off-by: default avatarSimon Horman <simon.horman@corigine.com>
      Signed-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      a14857c2
    • Colin Ian King's avatar
      982c97ee
    • Vladimir Oltean's avatar
      net: enetc: kill PHY-less mode for PFs · 0f84d403
      Vladimir Oltean authored
      Right now, a PHY-less port (no phy-mode, no fixed-link, no phy-handle)
      doesn't register with phylink, but calls netif_carrier_on() from
      enetc_start().
      
      This makes sense for a VF, but for a PF, this is braindead, because we
      never call enetc_mac_enable() so the MAC is left inoperational.
      Furthermore, commit 71b77a7a ("enetc: Migrate to PHYLINK and
      PCS_LYNX") put the nail in the coffin because it removed the initial
      netif_carrier_off() call done right after register_netdev().
      
      Without that call, netif_carrier_on() does not call
      linkwatch_fire_event(), so the operstate remains IF_OPER_UNKNOWN.
      
      Just deny the broken configuration by requiring that a phy-mode is
      present, and always register a PF with phylink.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarClaudiu Manoil <claudiu.manoil@nxp.com>
      Link: https://lore.kernel.org/r/20220511094200.558502-1-vladimir.oltean@nxp.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      0f84d403
    • Kees Cook's avatar
      fortify: Provide a memcpy trap door for sharp corners · 43213dae
      Kees Cook authored
      As we continue to narrow the scope of what the FORTIFY memcpy() will
      accept and build alternative APIs that give the compiler appropriate
      visibility into more complex memcpy scenarios, there is a need for
      "unfortified" memcpy use in rare cases where combinations of compiler
      behaviors, source code layout, etc, result in cases where the stricter
      memcpy checks need to be bypassed until appropriate solutions can be
      developed (i.e. fix compiler bugs, code refactoring, new API, etc). The
      intention is for this to be used only if there's no other reasonable
      solution, for its use to include a justification that can be used
      to assess future solutions, and for it to be temporary.
      
      Example usage included, based on analysis and discussion from:
      https://lore.kernel.org/netdev/CANn89iLS_2cshtuXPyNUGDPaic=sJiYfvTb_wNLgWrZRyBxZ_g@mail.gmail.com
      
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Coco Li <lixiaoyan@google.com>
      Cc: Tariq Toukan <tariqt@nvidia.com>
      Cc: Saeed Mahameed <saeedm@nvidia.com>
      Cc: Leon Romanovsky <leon@kernel.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-hardening@vger.kernel.org
      Signed-off-by: default avatarKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20220511025301.3636666-1-keescook@chromium.orgSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      43213dae
    • Florian Fainelli's avatar
      net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral · 6b77c066
      Florian Fainelli authored
      The interrupt controller supplying the Wake-on-LAN interrupt line maybe
      modular on some platforms (irq-bcm7038-l1.c) and might be probed at a
      later time than the GENET driver. We need to specifically check for
      -EPROBE_DEFER and propagate that error to ensure that we eventually
      fetch the interrupt descriptor.
      
      Fixes: 9deb48b5 ("bcmgenet: add WOL IRQ check")
      Fixes: 5b1f0e62 ("net: bcmgenet: Avoid touching non-existent interrupt")
      Signed-off-by: default avatarFlorian Fainelli <f.fainelli@gmail.com>
      Reviewed-by: default avatarStefan Wahren <stefan.wahren@i2se.com>
      Link: https://lore.kernel.org/r/20220511031752.2245566-1-f.fainelli@gmail.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      6b77c066
    • Yang Yingliang's avatar
      net: ethernet: mediatek: ppe: fix wrong size passed to memset() · 00832b1d
      Yang Yingliang authored
      'foe_table' is a pointer, the real size of struct mtk_foe_entry
      should be pass to memset().
      
      Fixes: ba37b7ca ("net: ethernet: mtk_eth_soc: add support for initializing the PPE")
      Signed-off-by: default avatarYang Yingliang <yangyingliang@huawei.com>
      Acked-by: default avatarFelix Fietkau <nbd@nbd.name>
      Link: https://lore.kernel.org/r/20220511030829.3308094-1-yangyingliang@huawei.comSigned-off-by: default avatarPaolo Abeni <pabeni@redhat.com>
      00832b1d
    • Jakub Kicinski's avatar
      Merge tag 'for-net-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth · a48ab883
      Jakub Kicinski authored
      Luiz Augusto von Dentz says:
      
      ====================
      bluetooth pull request for net:
      
       - Fix the creation of hdev->name when index is greater than 9999
      
      * tag 'for-net-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
        Bluetooth: Fix the creation of hdev->name
      ====================
      
      Link: https://lore.kernel.org/r/20220512002901.823647-1-luiz.dentz@gmail.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      a48ab883
    • Jakub Kicinski's avatar
      Merge tag 'wireless-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless · 8bf6008c
      Jakub Kicinski authored
      Kalle Valo says:
      
      ====================
      wireless fixes for v5.18
      
      Second set of fixes for v5.18 and hopefully the last one. We have a
      new iwlwifi maintainer, a fix to rfkill ioctl interface and important
      fixes to both stack and two drivers.
      
      * tag 'wireless-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
        rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition
        nl80211: fix locking in nl80211_set_tx_bitrate_mask()
        mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
        mac80211_hwsim: fix RCU protected chanctx access
        mailmap: update Kalle Valo's email
        mac80211: Reset MBSSID parameters upon connection
        cfg80211: retrieve S1G operating channel number
        nl80211: validate S1G channel width
        mac80211: fix rx reordering with non explicit / psmp ack policy
        ath11k: reduce the wait time of 11d scan and hw scan while add interface
        MAINTAINERS: update iwlwifi driver maintainer
        iwlwifi: iwl-dbg: Use del_timer_sync() before freeing
      ====================
      
      Link: https://lore.kernel.org/r/20220511154535.A1A12C340EE@smtp.kernel.orgSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      8bf6008c
    • Itay Iellin's avatar
      Bluetooth: Fix the creation of hdev->name · 103a2f32
      Itay Iellin authored
      Set a size limit of 8 bytes of the written buffer to "hdev->name"
      including the terminating null byte, as the size of "hdev->name" is 8
      bytes. If an id value which is greater than 9999 is allocated,
      then the "snprintf(hdev->name, sizeof(hdev->name), "hci%d", id)"
      function call would lead to a truncation of the id value in decimal
      notation.
      
      Set an explicit maximum id parameter in the id allocation function call.
      The id allocation function defines the maximum allocated id value as the
      maximum id parameter value minus one. Therefore, HCI_MAX_ID is defined
      as 10000.
      Signed-off-by: default avatarItay Iellin <ieitayie@gmail.com>
      Signed-off-by: default avatarLuiz Augusto von Dentz <luiz.von.dentz@intel.com>
      103a2f32
  2. 11 May, 2022 8 commits
    • Jakub Kicinski's avatar
      Merge branch 'count-tc-taprio-window-drops-in-enetc-driver' · bb709987
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Count tc-taprio window drops in enetc driver
      
      This series includes a patch from Po Liu (no longer with NXP) which
      counts frames dropped by the tc-taprio offload in ethtool -S and in
      ndo_get_stats64. It also contains a preparation patch from myself.
      ====================
      
      Link: https://lore.kernel.org/r/20220510163615.6096-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bb709987
    • Po Liu's avatar
      net: enetc: count the tc-taprio window drops · 285e8ded
      Po Liu authored
      The enetc scheduler for IEEE 802.1Qbv has 2 options (depending on
      PTGCR[TG_DROP_DISABLE]) when we attempt to send an oversized packet
      which will never fit in its allotted time slot for its traffic class:
      either block the entire port due to head-of-line blocking, or drop the
      packet and set a bit in the writeback format of the transmit buffer
      descriptor, allowing other packets to be sent.
      
      We obviously choose the second option in the driver, but we do not
      detect the drop condition, so from the perspective of the network stack,
      the packet is sent and no error counter is incremented.
      
      This change checks the writeback of the TX BD when tc-taprio is enabled,
      and increments a specific ethtool statistics counter and a generic
      "tx_dropped" counter in ndo_get_stats64.
      Signed-off-by: default avatarPo Liu <Po.Liu@nxp.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      285e8ded
    • Vladimir Oltean's avatar
      net: enetc: manage ENETC_F_QBV in priv->active_offloads only when enabled · 32bf8e1f
      Vladimir Oltean authored
      Future work in this driver would like to look at priv->active_offloads &
      ENETC_F_QBV to determine whether a tc-taprio qdisc offload was
      installed, but this does not produce the intended effect.
      
      All the other flags in priv->active_offloads are managed dynamically,
      except ENETC_F_QBV which is set statically based on the probed SI capability.
      
      This change makes priv->active_offloads & ENETC_F_QBV really track the
      presence of a tc-taprio schedule on the port.
      
      Some existing users, like the enetc_sched_speed_set() call from
      phylink_mac_link_up(), are best kept using the old logic: the tc-taprio
      offload does not re-trigger another link mode resolve, so the scheduler
      needs to be functional from the get go, as long as Qbv is supported at
      all on the port. So to preserve functionality there, look at the static
      station interface capability from pf->si->hw_features instead.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarClaudiu Manoil <claudiu.manoil@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      32bf8e1f
    • Jakub Kicinski's avatar
      Merge branch 'macb-napi-improvements' · d7722973
      Jakub Kicinski authored
      Robert Hancock says:
      
      ====================
      MACB NAPI improvements
      
      Simplify the logic in the Cadence MACB/GEM driver for determining
      when to reschedule NAPI processing, and update it to use NAPI for the
      TX path as well as the RX path.
      
      Changes since v1: Changed to use separate TX and RX NAPI instances and
      poll functions to avoid unnecessary checks of the other ring (TX/RX)
      states during polling and to use budget handling for both RX and TX.
      Fixed locking to protect against concurrent access to TX ring on
      TX transmit and TX poll paths.
      ====================
      
      Link: https://lore.kernel.org/r/20220509194635.3094080-1-robert.hancock@calian.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d7722973
    • Robert Hancock's avatar
      net: macb: use NAPI for TX completion path · 138badbc
      Robert Hancock authored
      This driver was using the TX IRQ handler to perform all TX completion
      tasks. Under heavy TX network load, this can cause significant irqs-off
      latencies (found to be in the hundreds of microseconds using ftrace).
      This can cause other issues, such as overrunning serial UART FIFOs when
      using high baud rates with limited UART FIFO sizes.
      
      Switch to using a NAPI poll handler to perform the TX completion work
      to get this out of hard IRQ context and avoid the IRQ latency impact. A
      separate NAPI instance is used for TX and RX to avoid checking the other
      ring's state unnecessarily when doing the poll, and so that the NAPI
      budget handling can work for both TX and RX packets.
      
      A new per-queue tx_ptr_lock spinlock has been added to avoid using the
      main device lock (with IRQs needing to be disabled) across the entire TX
      mapping operation, and also to protect the TX queue pointers from
      concurrent access between the TX start and TX poll operations.
      
      The TX Used Bit Read interrupt (TXUBR) handling also needs to be moved into
      the TX NAPI poll handler to maintain the proper order of operations. A flag
      is used to notify the poll handler that a UBR condition needs to be
      handled. The macb_tx_restart handler has had some locking added for global
      register access, since this could now potentially happen concurrently on
      different queues.
      Signed-off-by: default avatarRobert Hancock <robert.hancock@calian.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      138badbc
    • Robert Hancock's avatar
      net: macb: simplify/cleanup NAPI reschedule checking · 1900e30d
      Robert Hancock authored
      Previously the macb_poll method was checking the RSR register after
      completing its RX receive work to see if additional packets had been
      received since IRQs were disabled, since this controller does not
      maintain the pending IRQ status across IRQ disable. It also had to
      double-check the register after re-enabling IRQs to detect if packets
      were received after the first check but before IRQs were enabled.
      
      Using the RSR register for this purpose is problematic since it reflects
      the global device state rather than the per-queue state, so if packets
      are being received on multiple queues it may end up retriggering receive
      on a queue where the packets did not actually arrive and not on the one
      where they did arrive. This will also cause problems with an upcoming
      change to use NAPI for the TX path where use of multiple queues is more
      likely.
      
      Add a macb_rx_pending function to check the RX ring to see if more
      packets have arrived in the queue, and use that to check if NAPI should
      be rescheduled rather than the RSR register. By doing this, we can just
      ignore the global RSR register entirely, and thus save some extra device
      register accesses at the same time.
      
      This also makes the previous first check for pending packets rather
      redundant, since it would be checking the RX ring state which was just
      checked in the receive work function. Therefore we can get rid of it and
      just check after enabling interrupts whether packets are already
      pending.
      Signed-off-by: default avatarRobert Hancock <robert.hancock@calian.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      1900e30d
    • Vladimir Oltean's avatar
      net: dsa: ocelot: accept 1000base-X for VSC9959 and VSC9953 · 11ecf341
      Vladimir Oltean authored
      Switches using the Lynx PCS driver support 1000base-X optical SFP
      modules. Accept this interface type on a port.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Link: https://lore.kernel.org/r/20220510164320.10313-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      11ecf341
    • Xiaomeng Tong's avatar
      i40e: i40e_main: fix a missing check on list iterator · 3f95a747
      Xiaomeng Tong authored
      The bug is here:
      	ret = i40e_add_macvlan_filter(hw, ch->seid, vdev->dev_addr, &aq_err);
      
      The list iterator 'ch' will point to a bogus position containing
      HEAD if the list is empty or no element is found. This case must
      be checked before any use of the iterator, otherwise it will
      lead to a invalid memory access.
      
      To fix this bug, use a new variable 'iter' as the list iterator,
      while use the origin variable 'ch' as a dedicated pointer to
      point to the found element.
      
      Cc: stable@vger.kernel.org
      Fixes: 1d8d80b4 ("i40e: Add macvlan support on i40e")
      Signed-off-by: default avatarXiaomeng Tong <xiam0nd.tong@gmail.com>
      Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel)
      Signed-off-by: default avatarTony Nguyen <anthony.l.nguyen@intel.com>
      Link: https://lore.kernel.org/r/20220510204846.2166999-1-anthony.l.nguyen@intel.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      3f95a747