1. 18 Aug, 2022 12 commits
    • Jakub Kicinski's avatar
      Merge branch 'fixes-for-ocelot-driver-statistics' · 5b6a0729
      Jakub Kicinski authored
      Vladimir Oltean says:
      
      ====================
      Fixes for Ocelot driver statistics
      
      This series contains bug fixes for the ocelot drivers (both switchdev
      and DSA). Some concern the counters exposed to ethtool -S, and others to
      the counters exposed to ifconfig. I'm aware that the changes are fairly
      large, but I wanted to prioritize on a proper approach to addressing the
      issues rather than a quick hack.
      
      Some of the noticed problems:
      - bad register offsets for some counters
      - unhandled concurrency leading to corrupted counters
      - unhandled 32-bit wraparound of ifconfig counters
      
      The issues on the ocelot switchdev driver were noticed through code
      inspection, I do not have the hardware to test.
      
      This patch set necessarily converts ocelot->stats_lock from a mutex to a
      spinlock. I know this affects Colin Foster's development with the SPI
      controlled VSC7512. I have other changes prepared for net-next that
      convert this back into a mutex (along with other changes in this area).
      ====================
      
      Link: https://lore.kernel.org/r/20220816135352.1431497-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5b6a0729
    • Vladimir Oltean's avatar
      net: mscc: ocelot: report ndo_get_stats64 from the wraparound-resistant ocelot->stats · e780e319
      Vladimir Oltean authored
      Rather than reading the stats64 counters directly from the 32-bit
      hardware, it's better to rely on the output produced by the periodic
      ocelot_port_update_stats().
      
      It would be even better to call ocelot_port_update_stats() right from
      ocelot_get_stats64() to make sure we report the current values rather
      than the ones from 2 seconds ago. But we need to export
      ocelot_port_update_stats() from the switch lib towards the switchdev
      driver for that, and future work will largely undo that.
      
      There are more ocelot-based drivers waiting to be introduced, an example
      of which is the SPI-controlled VSC7512. In that driver's case, it will
      be impossible to call ocelot_port_update_stats() from ndo_get_stats64
      context, since the latter is atomic, and reading the stats over SPI is
      sleepable. So the compromise taken here, which will also hold going
      forward, is to report 64-bit counters to stats64, which are not 100% up
      to date.
      
      Fixes: a556c76a ("net: mscc: Add initial Ocelot switch support")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      e780e319
    • Vladimir Oltean's avatar
      net: mscc: ocelot: keep ocelot_stat_layout by reg address, not offset · d4c36765
      Vladimir Oltean authored
      With so many counter addresses recently discovered as being wrong, it is
      desirable to at least have a central database of information, rather
      than two: one through the SYS_COUNT_* registers (used for
      ndo_get_stats64), and the other through the offset field of struct
      ocelot_stat_layout elements (used for ethtool -S).
      
      The strategy will be to keep the SYS_COUNT_* definitions as the single
      source of truth, but for that we need to expand our current definitions
      to cover all registers. Then we need to convert the ocelot region
      creation logic, and stats worker, to the read semantics imposed by going
      through SYS_COUNT_* absolute register addresses, rather than offsets
      of 32-bit words relative to SYS_COUNT_RX_OCTETS (which should have been
      SYS_CNT, by the way).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      d4c36765
    • Vladimir Oltean's avatar
      net: mscc: ocelot: make struct ocelot_stat_layout array indexable · 91904600
      Vladimir Oltean authored
      The ocelot counters are 32-bit and require periodic reading, every 2
      seconds, by ocelot_port_update_stats(), so that wraparounds are
      detected.
      
      Currently, the counters reported by ocelot_get_stats64() come from the
      32-bit hardware counters directly, rather than from the 64-bit
      accumulated ocelot->stats, and this is a problem for their integrity.
      
      The strategy is to make ocelot_get_stats64() able to cherry-pick
      individual stats from ocelot->stats the way in which it currently reads
      them out from SYS_COUNT_* registers. But currently it can't, because
      ocelot->stats is an opaque u64 array that's used only to feed data into
      ethtool -S.
      
      To solve that problem, we need to make ocelot->stats indexable, and
      associate each element with an element of struct ocelot_stat_layout used
      by ethtool -S.
      
      This makes ocelot_stat_layout a fat (and possibly sparse) array, so we
      need to change the way in which we access it. We no longer need
      OCELOT_STAT_END as a sentinel, because we know the array's size
      (OCELOT_NUM_STATS). We just need to skip the array elements that were
      left unpopulated for the switch revision (ocelot, felix, seville).
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      91904600
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix race between ndo_get_stats64 and ocelot_check_stats_work · 18d8e67d
      Vladimir Oltean authored
      The 2 methods can run concurrently, and one will change the window of
      counters (SYS_STAT_CFG_STAT_VIEW) that the other sees. The fix is
      similar to what commit 7fbf6795 ("net: mscc: ocelot: fix mutex lock
      error during ethtool stats read") has done for ethtool -S.
      
      Fixes: a556c76a ("net: mscc: Add initial Ocelot switch support")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      18d8e67d
    • Vladimir Oltean's avatar
      net: mscc: ocelot: turn stats_lock into a spinlock · 22d842e3
      Vladimir Oltean authored
      ocelot_get_stats64() currently runs unlocked and therefore may collide
      with ocelot_port_update_stats() which indirectly accesses the same
      counters. However, ocelot_get_stats64() runs in atomic context, and we
      cannot simply take the sleepable ocelot->stats_lock mutex. We need to
      convert it to an atomic spinlock first. Do that as a preparatory change.
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      22d842e3
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix address of SYS_COUNT_TX_AGING counter · 173ca866
      Vladimir Oltean authored
      This register, used as part of stats->tx_dropped in
      ocelot_get_stats64(), has a wrong address. At the address currently
      given, there is actually the c_tx_green_prio_6 counter.
      
      Fixes: a556c76a ("net: mscc: Add initial Ocelot switch support")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      173ca866
    • Vladimir Oltean's avatar
      net: mscc: ocelot: fix incorrect ndo_get_stats64 packet counters · 5152de7b
      Vladimir Oltean authored
      Reading stats using the SYS_COUNT_* register definitions is only used by
      ocelot_get_stats64() from the ocelot switchdev driver, however,
      currently the bucket definitions are incorrect.
      
      Separately, on both RX and TX, we have the following problems:
      - a 256-1023 bucket which actually tracks the 256-511 packets
      - the 1024-1526 bucket actually tracks the 512-1023 packets
      - the 1527-max bucket actually tracks the 1024-1526 packets
      
      => nobody tracks the packets from the real 1527-max bucket
      
      Additionally, the RX_PAUSE, RX_CONTROL, RX_LONGS and RX_CLASSIFIED_DROPS
      all track the wrong thing. However this doesn't seem to have any
      consequence, since ocelot_get_stats64() doesn't use these.
      
      Even though this problem only manifests itself for the switchdev driver,
      we cannot split the fix for ocelot and for DSA, since it requires fixing
      the bucket definitions from enum ocelot_reg, which makes us necessarily
      adapt the structures from felix and seville as well.
      
      Fixes: 84705fc1 ("net: dsa: felix: introduce support for Seville VSC9953 switch")
      Fixes: 56051948 ("net: dsa: ocelot: add driver for Felix switch family")
      Fixes: a556c76a ("net: mscc: Add initial Ocelot switch support")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      5152de7b
    • Vladimir Oltean's avatar
      net: dsa: felix: fix ethtool 256-511 and 512-1023 TX packet counters · 40d21c45
      Vladimir Oltean authored
      What the driver actually reports as 256-511 is in fact 512-1023, and the
      TX packets in the 256-511 bucket are not reported. Fix that.
      
      Fixes: 56051948 ("net: dsa: ocelot: add driver for Felix switch family")
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      40d21c45
    • Vladimir Oltean's avatar
      net: dsa: don't warn in dsa_port_set_state_now() when driver doesn't support it · 211987f3
      Vladimir Oltean authored
      ds->ops->port_stp_state_set() is, like most DSA methods, optional, and
      if absent, the port is supposed to remain in the forwarding state (as
      standalone). Such is the case with the mv88e6060 driver, which does not
      offload the bridge layer. DSA warns that the STP state can't be changed
      to FORWARDING as part of dsa_port_enable_rt(), when in fact it should not.
      
      The error message is also not up to modern standards, so take the
      opportunity to make it more descriptive.
      
      Fixes: fd364541 ("net: dsa: change scope of STP state setter")
      Reported-by: default avatarSergei Antonov <saproj@gmail.com>
      Signed-off-by: default avatarVladimir Oltean <vladimir.oltean@nxp.com>
      Reviewed-by: default avatarSergei Antonov <saproj@gmail.com>
      Link: https://lore.kernel.org/r/20220816201445.1809483-1-vladimir.oltean@nxp.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      211987f3
    • Rustam Subkhankulov's avatar
      net: dsa: sja1105: fix buffer overflow in sja1105_setup_devlink_regions() · fd8e899c
      Rustam Subkhankulov authored
      If an error occurs in dsa_devlink_region_create(), then 'priv->regions'
      array will be accessed by negative index '-1'.
      
      Found by Linux Verification Center (linuxtesting.org) with SVACE.
      Signed-off-by: default avatarRustam Subkhankulov <subkhankulov@ispras.ru>
      Fixes: bf425b82 ("net: dsa: sja1105: expose static config as devlink region")
      Link: https://lore.kernel.org/r/20220817003845.389644-1-subkhankulov@ispras.ruSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fd8e899c
    • Jakub Kicinski's avatar
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf · bec13ba9
      Jakub Kicinski authored
      Florian Westphal says:
      
      ====================
      netfilter: conntrack and nf_tables bug fixes
      
      The following patchset contains netfilter fixes for net.
      
      Broken since 5.19:
        A few ancient connection tracking helpers assume TCP packets cannot
        exceed 64kb in size, but this isn't the case anymore with 5.19 when
        BIG TCP got merged, from myself.
      
      Regressions since 5.19:
        1. 'conntrack -E expect' won't display anything because nfnetlink failed
           to enable events for expectations, only for normal conntrack events.
      
        2. partially revert change that added resched calls to a function that can
           be in atomic context.  Both broken and fixed up by myself.
      
      Broken for several releases (up to original merge of nf_tables):
        Several fixes for nf_tables control plane, from Pablo.
        This fixes up resource leaks in error paths and adds more sanity
        checks for mutually exclusive attributes/flags.
      
      Kconfig:
        NF_CONNTRACK_PROCFS is very old and doesn't provide all info provided
        via ctnetlink, so it should not default to y. From Geert Uytterhoeven.
      
      Selftests:
        rework nft_flowtable.sh: it frequently indicated failure; the way it
        tried to detect an offload failure did not work reliably.
      
      * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
        testing: selftests: nft_flowtable.sh: rework test to detect offload failure
        testing: selftests: nft_flowtable.sh: use random netns names
        netfilter: conntrack: NF_CONNTRACK_PROCFS should no longer default to y
        netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specified
        netfilter: nf_tables: disallow NFT_SET_ELEM_CATCHALL and NFT_SET_ELEM_INTERVAL_END
        netfilter: nf_tables: NFTA_SET_ELEM_KEY_END requires concat and interval flags
        netfilter: nf_tables: validate NFTA_SET_ELEM_OBJREF based on NFT_SET_OBJECT flag
        netfilter: nf_tables: really skip inactive sets when allocating name
        netfilter: nfnetlink: re-enable conntrack expectation events
        netfilter: nf_tables: fix scheduling-while-atomic splat
        netfilter: nf_ct_irc: cap packet search space to 4k
        netfilter: nf_ct_ftp: prefer skb_linearize
        netfilter: nf_ct_h323: cap packet size at 64k
        netfilter: nf_ct_sane: remove pseudo skb linearization
        netfilter: nf_tables: possible module reference underflow in error path
        netfilter: nf_tables: disallow NFTA_SET_ELEM_KEY_END with NFT_SET_ELEM_INTERVAL_END flag
        netfilter: nf_tables: use READ_ONCE and WRITE_ONCE for shared generation id access
      ====================
      
      Link: https://lore.kernel.org/r/20220817140015.25843-1-fw@strlen.deSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      bec13ba9
  2. 17 Aug, 2022 8 commits
    • David Howells's avatar
      net: Fix suspicious RCU usage in bpf_sk_reuseport_detach() · fc4aaf9f
      David Howells authored
      bpf_sk_reuseport_detach() calls __rcu_dereference_sk_user_data_with_flags()
      to obtain the value of sk->sk_user_data, but that function is only usable
      if the RCU read lock is held, and neither that function nor any of its
      callers hold it.
      
      Fix this by adding a new helper, __locked_read_sk_user_data_with_flags()
      that checks to see if sk->sk_callback_lock() is held and use that here
      instead.
      
      Alternatively, making __rcu_dereference_sk_user_data_with_flags() use
      rcu_dereference_checked() might suffice.
      
      Without this, the following warning can be occasionally observed:
      
      =============================
      WARNING: suspicious RCU usage
      6.0.0-rc1-build2+ #563 Not tainted
      -----------------------------
      include/net/sock.h:592 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      5 locks held by locktest/29873:
       #0: ffff88812734b550 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x77/0x121
       #1: ffff88812f5621b0 (sk_lock-AF_INET){+.+.}-{0:0}, at: tcp_close+0x1c/0x70
       #2: ffff88810312f5c8 (&h->lhash2[i].lock){+.+.}-{2:2}, at: inet_unhash+0x76/0x1c0
       #3: ffffffff83768bb8 (reuseport_lock){+...}-{2:2}, at: reuseport_detach_sock+0x18/0xdd
       #4: ffff88812f562438 (clock-AF_INET){++..}-{2:2}, at: bpf_sk_reuseport_detach+0x24/0xa4
      
      stack backtrace:
      CPU: 1 PID: 29873 Comm: locktest Not tainted 6.0.0-rc1-build2+ #563
      Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
      Call Trace:
       <TASK>
       dump_stack_lvl+0x4c/0x5f
       bpf_sk_reuseport_detach+0x6d/0xa4
       reuseport_detach_sock+0x75/0xdd
       inet_unhash+0xa5/0x1c0
       tcp_set_state+0x169/0x20f
       ? lockdep_sock_is_held+0x3a/0x3a
       ? __lock_release.isra.0+0x13e/0x220
       ? reacquire_held_locks+0x1bb/0x1bb
       ? hlock_class+0x31/0x96
       ? mark_lock+0x9e/0x1af
       __tcp_close+0x50/0x4b6
       tcp_close+0x28/0x70
       inet_release+0x8e/0xa7
       __sock_release+0x95/0x121
       sock_close+0x14/0x17
       __fput+0x20f/0x36a
       task_work_run+0xa3/0xcc
       exit_to_user_mode_prepare+0x9c/0x14d
       syscall_exit_to_user_mode+0x18/0x44
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: cf8c1e96 ("net: refactor bpf_sk_reuseport_detach()")
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Hawkins Jiawei <yin31149@gmail.com>
      Link: https://lore.kernel.org/r/166064248071.3502205.10036394558814861778.stgit@warthog.procyon.org.ukSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      fc4aaf9f
    • Arun Ramadoss's avatar
      net: dsa: microchip: ksz9477: fix fdb_dump last invalid entry · 36c0d935
      Arun Ramadoss authored
      In the ksz9477_fdb_dump function it reads the ALU control register and
      exit from the timeout loop if there is valid entry or search is
      complete. After exiting the loop, it reads the alu entry and report to
      the user space irrespective of entry is valid. It works till the valid
      entry. If the loop exited when search is complete, it reads the alu
      table. The table returns all ones and it is reported to user space. So
      bridge fdb show gives ff:ff:ff:ff:ff:ff as last entry for every port.
      To fix it, after exiting the loop the entry is reported only if it is
      valid one.
      
      Fixes: b987e98e ("dsa: add DSA switch driver for Microchip KSZ9477")
      Signed-off-by: default avatarArun Ramadoss <arun.ramadoss@microchip.com>
      Reviewed-by: default avatarVladimir Oltean <olteanv@gmail.com>
      Link: https://lore.kernel.org/r/20220816105516.18350-1-arun.ramadoss@microchip.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      36c0d935
    • Florian Westphal's avatar
      testing: selftests: nft_flowtable.sh: rework test to detect offload failure · c8550b90
      Florian Westphal authored
      This test fails on current kernel releases because the flotwable path
      now calls dst_check from packet path and will then remove the offload.
      
      Test script has two purposes:
      1. check that file (random content) can be sent to other netns (and vv)
      2. check that the flow is offloaded (rather than handled by classic
         forwarding path).
      
      Since dst_check is in place, 2) fails because the nftables ruleset in
      router namespace 1 intentionally blocks traffic under the assumption
      that packets are not passed via classic path at all.
      
      Rework this: Instead of blocking traffic, create two named counters, one
      for original and one for reverse direction.
      
      The first three test cases are handled by classic forwarding path
      (path mtu discovery is disabled and packets exceed MTU).
      
      But all other tests enable PMTUD, so the originator and responder are
      expected to lower packet size and flowtable is expected to do the packet
      forwarding.
      
      For those tests, check that the packet counters (which are only
      incremented for packets that are passed up to classic forward path)
      are significantly lower than the file size transferred.
      
      I've tested that the counter-checks fail as expected when the 'flow add'
      statement is removed from the ruleset.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      c8550b90
    • David S. Miller's avatar
      Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue · ed16d19c
      David S. Miller authored
      ====================
      Intel Wired LAN Driver Updates 2022-08-16)
      
      This series contains updates to i40e driver only.
      
      Przemyslaw fixes issue with checksum offload on VXLAN tunnels.
      
      Alan disables VSI for Tx timeout when all recovery methods have failed.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      ed16d19c
    • Jakub Kicinski's avatar
      tls: rx: react to strparser initialization errors · 849f16bb
      Jakub Kicinski authored
      Even though the normal strparser's init function has a return
      value we got away with ignoring errors until now, as it only
      validates the parameters and we were passing correct parameters.
      
      tls_strp can fail to init on memory allocation errors, which
      syzbot duly induced and reported.
      
      Reported-by: syzbot+abd45eb849b05194b1b6@syzkaller.appspotmail.com
      Fixes: 84c61fe1 ("tls: rx: do not use the standard strparser")
      Signed-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      849f16bb
    • Florian Westphal's avatar
      testing: selftests: nft_flowtable.sh: use random netns names · b71b7bfe
      Florian Westphal authored
      "ns1" is a too generic name, use a random suffix to avoid
      errors when such a netns exists.  Also allows to run multiple
      instances of the script in parallel.
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      b71b7bfe
    • Geert Uytterhoeven's avatar
      netfilter: conntrack: NF_CONNTRACK_PROCFS should no longer default to y · aa5762c3
      Geert Uytterhoeven authored
      NF_CONNTRACK_PROCFS was marked obsolete in commit 54b07dca
      ("netfilter: provide config option to disable ancient procfs parts") in
      v3.3.
      Signed-off-by: default avatarGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: default avatarFlorian Westphal <fw@strlen.de>
      aa5762c3
    • Zhengchao Shao's avatar
      net: sched: fix misuse of qcpu->backlog in gnet_stats_add_queue_cpu · de64b6b6
      Zhengchao Shao authored
      In the gnet_stats_add_queue_cpu function, the qstats->qlen statistics
      are incorrectly set to qcpu->backlog.
      
      Fixes: 448e163f ("gen_stats: Add gnet_stats_add_queue()")
      Signed-off-by: default avatarZhengchao Shao <shaozhengchao@huawei.com>
      Link: https://lore.kernel.org/r/20220815030848.276746-1-shaozhengchao@huawei.comSigned-off-by: default avatarJakub Kicinski <kuba@kernel.org>
      de64b6b6
  3. 16 Aug, 2022 5 commits
  4. 15 Aug, 2022 14 commits
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: check NFT_SET_CONCAT flag if field_count is specified · 1b6345d4
      Pablo Neira Ayuso authored
      Since f3a2181e ("netfilter: nf_tables: Support for sets with
      multiple ranged fields"), it possible to combine intervals and
      concatenations. Later on, ef516e86 ("netfilter: nf_tables:
      reintroduce the NFT_SET_CONCAT flag") provides the NFT_SET_CONCAT flag
      for userspace to report that the set stores a concatenation.
      
      Make sure NFT_SET_CONCAT is set on if field_count is specified for
      consistency. Otherwise, if NFT_SET_CONCAT is specified with no
      field_count, bail out with EINVAL.
      
      Fixes: ef516e86 ("netfilter: nf_tables: reintroduce the NFT_SET_CONCAT flag")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      1b6345d4
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: disallow NFT_SET_ELEM_CATCHALL and NFT_SET_ELEM_INTERVAL_END · fc0ae524
      Pablo Neira Ayuso authored
      These flags are mutually exclusive, report EINVAL in this case.
      
      Fixes: aaa31047 ("netfilter: nftables: add catch-all set element support")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      fc0ae524
    • Pablo Neira Ayuso's avatar
      netfilter: nf_tables: NFTA_SET_ELEM_KEY_END requires concat and interval flags · 88cccd90
      Pablo Neira Ayuso authored
      If the NFT_SET_CONCAT|NFT_SET_INTERVAL flags are set on, then the
      netlink attribute NFTA_SET_ELEM_KEY_END must be specified. Otherwise,
      NFTA_SET_ELEM_KEY_END should not be present.
      
      For catch-all element, NFTA_SET_ELEM_KEY_END should not be present.
      The NFT_SET_ELEM_INTERVAL_END is never used with this set flags
      combination.
      
      Fixes: 7b225d0b ("netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute")
      Signed-off-by: default avatarPablo Neira Ayuso <pablo@netfilter.org>
      88cccd90
    • David S. Miller's avatar
      Merge branch 'mlxsw-fixes' · 5061e34c
      David S. Miller authored
      Petr Machata says:
      
      ====================
      mlxsw: Fixes for PTP support
      
      This set fixes several issues in mlxsw PTP code.
      
      - Patch #1 fixes compilation warnings.
      
      - Patch #2 adjusts the order of operation during cleanup, thereby
        closing the window after PTP state was already cleaned in the ASIC
        for the given port, but before the port is removed, when the user
        could still in theory make changes to the configuration.
      
      - Patch #3 protects the PTP configuration with a custom mutex, instead
        of relying on RTNL, which is not held in all access paths.
      
      - Patch #4 forbids enablement of PTP only in RX or only in TX. The
        driver implicitly assumed this would be the case, but neglected to
        sanitize the configuration.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      5061e34c
    • Amit Cohen's avatar
      mlxsw: spectrum_ptp: Forbid PTP enablement only in RX or in TX · e01885c3
      Amit Cohen authored
      Currently mlxsw driver configures one global PTP configuration for all
      ports. The reason is that the switch behaves like a transparent clock
      between CPU port and front-panel ports. When time stamp is enabled in
      any port, the hardware is configured to update the correction field. The
      fact that the configuration of CPU port affects all the ports, makes the
      correction field update to be global for all ports. Otherwise, user will
      see odd values in the correction field, as the switch will update the
      correction field in the CPU port, but not in all the front-panel ports.
      
      The CPU port is relevant in both RX and TX, so to avoid problematic
      configuration, forbid PTP enablement only in one direction, i.e., only in
      RX or TX.
      
      Without the change:
      $ hwstamp_ctl -i swp1 -r 12 -t 0
      current settings:
      tx_type 0
      rx_filter 0
      new settings:
      tx_type 0
      rx_filter 2
      $ echo $?
      0
      
      With the change:
      $ hwstamp_ctl -i swp1 -r 12 -t 0
      current settings:
      tx_type 1
      rx_filter 2
      SIOCSHWTSTAMP failed: Invalid argument
      
      Fixes: 08ef8bc8 ("mlxsw: spectrum_ptp: Support SIOCGHWTSTAMP, SIOCSHWTSTAMP ioctls")
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      e01885c3
    • Amit Cohen's avatar
      mlxsw: spectrum_ptp: Protect PTP configuration with a mutex · d72fdef2
      Amit Cohen authored
      Currently the functions mlxsw_sp2_ptp_{configure, deconfigure}_port()
      assume that they are called when RTNL is locked and they warn otherwise.
      
      The deconfigure function can be called when port is removed, for example
      as part of device reload, then there is no locked RTNL and the function
      warns [1].
      
      To avoid such case, do not assume that RTNL protects this code, add a
      dedicated mutex instead. The mutex protects 'ptp_state->config' which
      stores the existing global configuration in hardware. Use this mutex also
      to protect the code which configures the hardware. Then, there will be
      only one configuration in any time, which will be updated in 'ptp_state'
      and a race will be avoided.
      
      [1]:
      RTNL: assertion failed at drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c (1600)
      WARNING: CPU: 1 PID: 1583493 at drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c:1600 mlxsw_sp2_ptp_hwtstamp_set+0x2d3/0x300 [mlxsw_spectrum]
      [...]
      CPU: 1 PID: 1583493 Comm: devlink Not tainted5.19.0-rc8-custom-127022-gb371dffda095 #789
      Hardware name: Mellanox Technologies Ltd.MSN3420/VMOD0005, BIOS 5.11 01/06/2019
      RIP: 0010:mlxsw_sp2_ptp_hwtstamp_set+0x2d3/0x300[mlxsw_spectrum]
      [...]
      Call Trace:
       <TASK>
       mlxsw_sp_port_remove+0x7e/0x190 [mlxsw_spectrum]
       mlxsw_sp_fini+0xd1/0x270 [mlxsw_spectrum]
       mlxsw_core_bus_device_unregister+0x55/0x280 [mlxsw_core]
       mlxsw_devlink_core_bus_device_reload_down+0x1c/0x30[mlxsw_core]
       devlink_reload+0x1ee/0x230
       devlink_nl_cmd_reload+0x4de/0x580
       genl_family_rcv_msg_doit+0xdc/0x140
       genl_rcv_msg+0xd7/0x1d0
       netlink_rcv_skb+0x49/0xf0
       genl_rcv+0x1f/0x30
       netlink_unicast+0x22f/0x350
       netlink_sendmsg+0x208/0x440
       __sys_sendto+0xf0/0x140
       __x64_sys_sendto+0x1b/0x20
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Fixes: 08ef8bc8 ("mlxsw: spectrum_ptp: Support SIOCGHWTSTAMP, SIOCSHWTSTAMP ioctls")
      Reported-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      d72fdef2
    • Amit Cohen's avatar
      mlxsw: spectrum: Clear PTP configuration after unregistering the netdevice · a159e986
      Amit Cohen authored
      Currently as part of removing port, PTP API is called to clear the
      existing configuration and set the 'rx_filter' and 'tx_type' to zero.
      The clearing is done before unregistering the netdevice, which means that
      there is a window of time in which the user can reconfigure PTP in the
      port, and this configuration will not be cleared.
      
      Reorder the operations, clear PTP configuration after unregistering the
      netdevice.
      
      Fixes: 87486427 ("mlxsw: spectrum: PTP: Support SIOCGHWTSTAMP, SIOCSHWTSTAMP ioctls")
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: default avatarIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      a159e986
    • Amit Cohen's avatar
      mlxsw: spectrum_ptp: Fix compilation warnings · 12e09138
      Amit Cohen authored
      In case that 'CONFIG_PTP_1588_CLOCK' is not enabled in the config file,
      there are implementations for the functions
      mlxsw_{sp,sp2}_ptp_txhdr_construct() as part of 'spectrum_ptp.h'. In this
      case, they should be defined as 'static' as they are not supposed to be
      used out of this file. Make the functions 'static', otherwise the following
      warnings are returned:
      
      "warning: no previous prototype for 'mlxsw_sp_ptp_txhdr_construct'"
      "warning: no previous prototype for 'mlxsw_sp2_ptp_txhdr_construct'"
      
      In addition, make the functions 'inline' for case that 'spectrum_ptp.h'
      will be included anywhere else and the functions would probably not be
      used, so compilation warnings about unused static will be returned.
      
      Fixes: 24157bc6 ("mlxsw: Send PTP packets as data packets to overcome a limitation")
      Reported-by: default avatarkernel test robot <lkp@intel.com>
      Signed-off-by: default avatarAmit Cohen <amcohen@nvidia.com>
      Reviewed-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarPetr Machata <petrm@nvidia.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      12e09138
    • Jamal Hadi Salim's avatar
      net_sched: cls_route: disallow handle of 0 · 02799571
      Jamal Hadi Salim authored
      Follows up on:
      https://lore.kernel.org/all/20220809170518.164662-1-cascardo@canonical.com/
      
      handle of 0 implies from/to of universe realm which is not very
      sensible.
      
      Lets see what this patch will do:
      $sudo tc qdisc add dev $DEV root handle 1:0 prio
      
      //lets manufacture a way to insert handle of 0
      $sudo tc filter add dev $DEV parent 1:0 protocol ip prio 100 \
      route to 0 from 0 classid 1:10 action ok
      
      //gets rejected...
      Error: handle of 0 is not valid.
      We have an error talking to the kernel, -1
      
      //lets create a legit entry..
      sudo tc filter add dev $DEV parent 1:0 protocol ip prio 100 route from 10 \
      classid 1:10 action ok
      
      //what did the kernel insert?
      $sudo tc filter ls dev $DEV parent 1:0
      filter protocol ip pref 100 route chain 0
      filter protocol ip pref 100 route chain 0 fh 0x000a8000 flowid 1:10 from 10
      	action order 1: gact action pass
      	 random type none pass val 0
      	 index 1 ref 1 bind 1
      
      //Lets try to replace that legit entry with a handle of 0
      $ sudo tc filter replace dev $DEV parent 1:0 protocol ip prio 100 \
      handle 0x000a8000 route to 0 from 0 classid 1:10 action drop
      
      Error: Replacing with handle of 0 is invalid.
      We have an error talking to the kernel, -1
      
      And last, lets run Cascardo's POC:
      $ ./poc
      0
      0
      -22
      -22
      -22
      Signed-off-by: default avatarJamal Hadi Salim <jhs@mojatatu.com>
      Acked-by: default avatarStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      02799571
    • Xin Xiong's avatar
      net: fix potential refcount leak in ndisc_router_discovery() · 7396ba87
      Xin Xiong authored
      The issue happens on specific paths in the function. After both the
      object `rt` and `neigh` are grabbed successfully, when `lifetime` is
      nonzero but the metric needs change, the function just deletes the
      route and set `rt` to NULL. Then, it may try grabbing `rt` and `neigh`
      again if above conditions hold. The function simply overwrite `neigh`
      if succeeds or returns if fails, without decreasing the reference
      count of previous `neigh`. This may result in memory leaks.
      
      Fix it by decrementing the reference count of `neigh` in place.
      
      Fixes: 6b2e04bc ("net: allow user to set metric on default route learned via Router Advertisement")
      Signed-off-by: default avatarXin Xiong <xiongx18@fudan.edu.cn>
      Signed-off-by: default avatarXin Tan <tanxin.ctf@gmail.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      7396ba87
    • David S. Miller's avatar
      Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net · 27b8d4d7
      David S. Miller authored
      -queue
      
      Tony Nguyen says:
      
      ====================
      Intel Wired LAN Driver Updates 2022-08-11 (ice)
      
      This series contains updates to ice driver only.
      
      Benjamin corrects a misplaced parenthesis for a WARN_ON check.
      
      Michal removes WARN_ON from a check as its recoverable and not
      warranting of a call trace.
      ====================
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      27b8d4d7
    • Alexander Mikhalitsyn's avatar
      neighbour: make proxy_queue.qlen limit per-device · 0ff4eb3d
      Alexander Mikhalitsyn authored
      Right now we have a neigh_param PROXY_QLEN which specifies maximum length
      of neigh_table->proxy_queue. But in fact, this limitation doesn't work well
      because check condition looks like:
      tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)
      
      The problem is that p (struct neigh_parms) is a per-device thing,
      but tbl (struct neigh_table) is a system-wide global thing.
      
      It seems reasonable to make proxy_queue limit per-device based.
      
      v2:
      	- nothing changed in this patch
      v3:
      	- rebase to net tree
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Yajun Deng <yajun.deng@linux.dev>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
      Cc: kernel@openvz.org
      Cc: devel@openvz.org
      Suggested-by: default avatarDenis V. Lunev <den@openvz.org>
      Signed-off-by: default avatarAlexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Reviewed-by: default avatarDenis V. Lunev <den@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      0ff4eb3d
    • Denis V. Lunev's avatar
      neigh: fix possible DoS due to net iface start/stop loop · 66ba215c
      Denis V. Lunev authored
      Normal processing of ARP request (usually this is Ethernet broadcast
      packet) coming to the host is looking like the following:
      * the packet comes to arp_process() call and is passed through routing
        procedure
      * the request is put into the queue using pneigh_enqueue() if
        corresponding ARP record is not local (common case for container
        records on the host)
      * the request is processed by timer (within 80 jiffies by default) and
        ARP reply is sent from the same arp_process() using
        NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
        pneigh_enqueue())
      
      And here the problem comes. Linux kernel calls pneigh_queue_purge()
      which destroys the whole queue of ARP requests on ANY network interface
      start/stop event through __neigh_ifdown().
      
      This is actually not a problem within the original world as network
      interface start/stop was accessible to the host 'root' only, which
      could do more destructive things. But the world is changed and there
      are Linux containers available. Here container 'root' has an access
      to this API and could be considered as untrusted user in the hosting
      (container's) world.
      
      Thus there is an attack vector to other containers on node when
      container's root will endlessly start/stop interfaces. We have observed
      similar situation on a real production node when docker container was
      doing such activity and thus other containers on the node become not
      accessible.
      
      The patch proposed doing very simple thing. It drops only packets from
      the same namespace in the pneigh_queue_purge() where network interface
      state change is detected. This is enough to prevent the problem for the
      whole node preserving original semantics of the code.
      
      v2:
      	- do del_timer_sync() if queue is empty after pneigh_queue_purge()
      v3:
      	- rebase to net tree
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Yajun Deng <yajun.deng@linux.dev>
      Cc: Roopa Prabhu <roopa@nvidia.com>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
      Cc: kernel@openvz.org
      Cc: devel@openvz.org
      Investigated-by: default avatarAlexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
      Signed-off-by: default avatarDenis V. Lunev <den@openvz.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      66ba215c
    • Maxim Kochetkov's avatar
      net: qrtr: start MHI channel after endpoit creation · 68a838b8
      Maxim Kochetkov authored
      MHI channel may generates event/interrupt right after enabling.
      It may leads to 2 race conditions issues.
      
      1)
      Such event may be dropped by qcom_mhi_qrtr_dl_callback() at check:
      
      	if (!qdev || mhi_res->transaction_status)
      		return;
      
      Because dev_set_drvdata(&mhi_dev->dev, qdev) may be not performed at
      this moment. In this situation qrtr-ns will be unable to enumerate
      services in device.
      ---------------------------------------------------------------
      
      2)
      Such event may come at the moment after dev_set_drvdata() and
      before qrtr_endpoint_register(). In this case kernel will panic with
      accessing wrong pointer at qcom_mhi_qrtr_dl_callback():
      
      	rc = qrtr_endpoint_post(&qdev->ep, mhi_res->buf_addr,
      				mhi_res->bytes_xferd);
      
      Because endpoint is not created yet.
      --------------------------------------------------------------
      So move mhi_prepare_for_transfer_autoqueue after endpoint creation
      to fix it.
      
      Fixes: a2e2cc0d ("net: qrtr: Start MHI channels during init")
      Signed-off-by: default avatarMaxim Kochetkov <fido_max@inbox.ru>
      Reviewed-by: default avatarHemant Kumar <quic_hemantk@quicinc.com>
      Reviewed-by: default avatarManivannan Sadhasivam <mani@kernel.org>
      Reviewed-by: default avatarLoic Poulain <loic.poulain@linaro.org>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
      68a838b8
  5. 13 Aug, 2022 1 commit